MOJO

MOJO (MultiOmics JOint representation learning) is a model that learns joint representations of bulk RNA-seq and DNA methylation through bimodal masked language modeling and is tailored for cancer-type classification and survival analysis on the TCGA dataset.

Developed by: InstaDeep

Model Sources

How to use

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models. PyTorch should also be installed.

pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch

Other notes

We also provide the params for the MOJO jax model in jax_params.

A small snippet of code is provided below to run inference with the model using bulk RNA-seq and DNA methylation samples from the TCGA dataset.

import numpy as np
import pandas as pd
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/MOJO", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "InstaDeepAI/MOJO",
    trust_remote_code=True,
)

n_examples = 4
omic_dict = {}

for omic in ["rnaseq", "methylation"]:
    csv_path = hf_hub_download(
        repo_id="InstaDeepAI/MOJO",
        filename=f"data/tcga_{omic}_sample.csv",
        repo_type="model",
    )
    omic_array = pd.read_csv(csv_path).drop(["identifier", "cohort"], axis=1).to_numpy()[:n_examples, :]
    if omic == "rnaseq":
        omic_array = np.log10(1 + omic_array)
    assert omic_array.shape[1] == model.config.sequence_length
    omic_dict[omic] = omic_array

omic_ids = {
    omic: tokens["input_ids"]
    for omic, tokens in tokenizer.batch_encode_plus(omic_dict, pad_to_fixed_length=True, return_tensors="pt").items()
}

omic_mean_embeddings = model(omic_ids)["after_transformer_embedding"].mean(axis=1) # embeddings can be used for downstream tasks.

Citing our work

@article {G{\'e}lard2025.06.25.661237,
    author = {G{\'e}lard, Maxence and Benkirane, Hakim and Pierrot, Thomas and Richard, Guillaume and Courn{\`e}de, Paul-Henry},
    title = {Bimodal masked language modeling for bulk RNA-seq and DNA methylation representation learning},
    elocation-id = {2025.06.25.661237},
    year = {2025},
    doi = {10.1101/2025.06.25.661237},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/06/27/2025.06.25.661237},
    journal = {bioRxiv}
}
Downloads last month
25
Safetensors
Model size
52.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including InstaDeepAI/MOJO