MedTok: Multimodal Medical Code Tokenizer

Overview of MedTok

MEDTOK is a multimodal tokenizer of medical codes that combines text descriptions of codes with graph-based representations of dependencies between codes derived from clinical ontologies and standard medical terminologies. MEDTOK is a general-purpose tokenizer that can be integrated into any transformer-based model or system that requires tokenization.

How to use MedTok?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mims-harvard/MedTok", trust_remote_code=True)
tokens = tokenizer("E11.9")
embed = tokenizer.embed("E11.9")

embed means the quantized embedding for this input medical code.

If you want to use the tokenized embedding for each medical code, please download it from mims-harvard/MedTok or code2embeddings.json.zip directly. And the downloaded embedding file could be put into 'MedTok/embedding.npy' to run EHR or QA tasks based on MedTok.

🏥MedTok for EHR & MedicalQA

Please reference our github repo MedTok

Note

MedTok tokenizer V1.0 now only supports those medical codes adopted in our paper. For those unseen codes, the output will be '' token. We will also continue to update our MedTok to make it apply to more coding system and tokenize medical code dynamically.

Citation

@article{su2025multimodal,
  title={Multimodal Medical Code Tokenizer},
  author={Su, Xiaorui and Messica, Shvat and Huang, Yepeng and Johnson, Ruth and Fesser, Lukas and Gao, Shanghua and Sahneh, Faryad and Zitnik, Marinka},
  journal={International Conference on Machine Learning, ICML},
  year={2025}
}

Contact

Thank you for your support! If you have any questions or suggestions, please email Xiaorui Su and Marinka Zitnik.

mims-harvard
/

MedTok