Question: Clarification needed on required columns for pretrain.py input

Hello, thank you for sharing the project.

I’m currently trying to reproduce your project using the MIMIC-IV Demo data.
I followed the instructions and successfully completed the data preprocessing step, resulting in a patient sequence in `.parquet` format.

Now I’m attempting to train the pretrain model by running the `pretrain.py` script, but I'm encountering issues during execution. `KeyError: 'type_tokens'`

It seems the model requires input columns: `["concept_ids"]`, `["type_ids"]`, `["time_stamps"]`, `["ages"]`, `["visit_orders"]`, and `["visit_segments"]`.
From what I understand, these can be added by setting the `additional_token_types` argument when initializing `PretrainDataset` like so:

```python
PretrainDataset(
    data=pre_train,
    tokenizer=tokenizer,
    max_len=args.max_len,
    mask_prob=args.mask_prob,
    additional_token_types=['type_ids', 'ages', 'time_stamps', 'visit_orders', 'visit_segments'], 
    padding_side=args.padding_side,
)
```

However, the patient sequence obtained from preprocessing does not contain the necessary columns like `['type_tokens', 'age_tokens', 'time_tokens', 'position_tokens', 'visit_tokens']`, which causes issues when running `pretrain.py`.

I’d like to ask for more information so I can resolve this and proceed further:

1. What columns should the patient sequence contain after preprocessing?
   From my current result, it only includes `['subject_id', 'code']`.

2. Is there an additional processing step required for the patient sequence before running `pretrain.py`?
   If so, could you provide details or code for that step?

3. Do you have more detailed instructions or examples to guide through the entire process from preprocessing, pretraining, and fine-tuning?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Clarification needed on required columns for pretrain.py input #134

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: Clarification needed on required columns for pretrain.py input #134

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions