Hello, thank you for sharing the project.
I’m currently trying to reproduce your project using the MIMIC-IV Demo data.
I followed the instructions and successfully completed the data preprocessing step, resulting in a patient sequence in .parquet format.
Now I’m attempting to train the pretrain model by running the pretrain.py script, but I'm encountering issues during execution. KeyError: 'type_tokens'
It seems the model requires input columns: ["concept_ids"], ["type_ids"], ["time_stamps"], ["ages"], ["visit_orders"], and ["visit_segments"].
From what I understand, these can be added by setting the additional_token_types argument when initializing PretrainDataset like so:
PretrainDataset(
data=pre_train,
tokenizer=tokenizer,
max_len=args.max_len,
mask_prob=args.mask_prob,
additional_token_types=['type_ids', 'ages', 'time_stamps', 'visit_orders', 'visit_segments'],
padding_side=args.padding_side,
)
However, the patient sequence obtained from preprocessing does not contain the necessary columns like ['type_tokens', 'age_tokens', 'time_tokens', 'position_tokens', 'visit_tokens'], which causes issues when running pretrain.py.
I’d like to ask for more information so I can resolve this and proceed further:
-
What columns should the patient sequence contain after preprocessing?
From my current result, it only includes ['subject_id', 'code'].
-
Is there an additional processing step required for the patient sequence before running pretrain.py?
If so, could you provide details or code for that step?
-
Do you have more detailed instructions or examples to guide through the entire process from preprocessing, pretraining, and fine-tuning?
Hello, thank you for sharing the project.
I’m currently trying to reproduce your project using the MIMIC-IV Demo data.
I followed the instructions and successfully completed the data preprocessing step, resulting in a patient sequence in
.parquetformat.Now I’m attempting to train the pretrain model by running the
pretrain.pyscript, but I'm encountering issues during execution.KeyError: 'type_tokens'It seems the model requires input columns:
["concept_ids"],["type_ids"],["time_stamps"],["ages"],["visit_orders"], and["visit_segments"].From what I understand, these can be added by setting the
additional_token_typesargument when initializingPretrainDatasetlike so:However, the patient sequence obtained from preprocessing does not contain the necessary columns like
['type_tokens', 'age_tokens', 'time_tokens', 'position_tokens', 'visit_tokens'], which causes issues when runningpretrain.py.I’d like to ask for more information so I can resolve this and proceed further:
What columns should the patient sequence contain after preprocessing?
From my current result, it only includes
['subject_id', 'code'].Is there an additional processing step required for the patient sequence before running
pretrain.py?If so, could you provide details or code for that step?
Do you have more detailed instructions or examples to guide through the entire process from preprocessing, pretraining, and fine-tuning?