biomed-multi-omic
Biology
RNA
thrumbel's picture
Create README.md
8fd392f verified
metadata
library_name: biomed-multi-omic
license: apache-2.0
tags:
  - Biology
  - RNA
datasets:
  - PanglaoDB
  - CELLxGENE

ibm-research/biomed.rna.bert.110m.wced.v1

Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.

biomed-multi-omic enables development and testing of foundation models for DNA sequences and for RNA expression, with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface. biomed-multi-omic leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.

  • 🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
  • 🚀 Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
  • 📈 Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs) for DNA sequences)
  • Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.

For details on how the models were trained, please refer to the BMFM-RNA preprint.

Checkpoint

Whole-cell Expression Decoder (WCED): Using the BMFM-RNA framework, we implemented a new pretraining objective that is centered around predicting the expression levels for the whole cell at once, rather than limiting to just the masked genes.

WCED 10 pct: Trained using WCED with random gene order and log-normalization.

See section 2.3.4 of the BMFM-RNA manuscript for more details.

Usage

Using biomed.rna.bert.110m.wced.v1 requires the codebase https://github.com/BiomedSciAI/biomed-multi-omic

For installation, please follow the instructions on github.

RNA Inference

To get embeddings and predictions for scRNA data run:

export MY_DATA_FILE=... # path to h5ad file with raw counts and gene symbols
bmfm-targets-run -cn predict input_file=$MY_DATA_FILE working_dir=/tmp checkpoint=ibm-research/biomed.rna.bert.110m.wced.v1

For more details see the RNA tutorials on github.

Citation

@misc{dandala2025bmfmrnaopenframeworkbuilding,
      title={BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models},
      author={Bharath Dandala and Michael M. Danziger and Ella Barkan and Tanwi Biswas and Viatcheslav Gurev and Jianying Hu and Matthew Madgwick and Akira Koseki and Tal Kozlovski and Michal Rosen-Zvi and Yishai Shimoni and Ching-Huei Tsou},
      year={2025},
      eprint={2506.14861},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN},
      url={https://arxiv.org/abs/2506.14861},
}