biomed-multi-omic
Biology
RNA
thrumbel commited on
Commit
9cbb4e4
·
verified ·
1 Parent(s): 01008c4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: biomed-multi-omic
3
+ license: apache-2.0
4
+ tags:
5
+ - Biology
6
+ - RNA
7
+ datasets:
8
+ - PanglaoDB
9
+ - CELLxGENE
10
+ ---
11
+
12
+ # ibm-research/biomed.rna.bert.110m.wced.multitask.v1
13
+
14
+ Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.
15
+
16
+ `biomed-multi-omic` enables development and testing of foundation models for DNA sequences and for RNA expression,
17
+ with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface.
18
+ `biomed-multi-omic` leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.
19
+
20
+ - 🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
21
+ - 🚀 Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
22
+ - 📈 Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs)
23
+ for DNA sequences)
24
+ - Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.
25
+
26
+ For details on how the models were trained, please refer to [the BMFM-RNA preprint](https://arxiv.org/abs/2506.14861).
27
+
28
+ - **Developers:** IBM Research
29
+ - **GitHub Repository:** [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic)
30
+ - **Paper:** [BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models](https://arxiv.org/abs/2506.14861)
31
+ - **Release Date**: Jun 17th, 2025
32
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
33
+
34
+ ## Checkpoint
35
+
36
+ Whole-cell Expression Decoder (WCED): Using the BMFM-RNA framework, we implemented a new pretraining objective that is centered around predicting the expression levels for the whole cell at once, rather than limiting to just the masked
37
+ genes.
38
+
39
+ Multitask objectives: multi-label classification (cell type, tissue, tissue general), and an adversarial loss to unlearn donor ID.
40
+
41
+ **WCED + Multitask:** Trained first using WCED with random gene order and log-normalization, then fine-tuned with multitask objectives.
42
+
43
+ See section 2.3.3 of [the BMFM-RNA manuscript](https://arxiv.org/abs/2506.14861) for more details.
44
+
45
+ ## Usage
46
+
47
+ Using `biomed.rna.bert.110m.wced.multitask.v1` requires the codebase [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic)
48
+
49
+ For installation, please follow the [instructions on github](https://github.com/BiomedSciAI/biomed-multi-omic?tab=readme-ov-file#installation).
50
+
51
+ ## RNA Inference
52
+
53
+ To get embeddings and predictions for scRNA data run:
54
+
55
+ ```bash
56
+ export MY_DATA_FILE=... # path to h5ad file with raw counts and gene symbols
57
+ bmfm-targets-run -cn predict input_file=$MY_DATA_FILE working_dir=/tmp checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1
58
+ ```
59
+
60
+ For more details see the [RNA tutorials on github](https://github.com/BiomedSciAI/biomed-multi-omic/tree/main/tutorials/RNA).
61
+
62
+ ## Citation
63
+
64
+ ```bibtex
65
+ @misc{dandala2025bmfmrnaopenframeworkbuilding,
66
+ title={BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models},
67
+ author={Bharath Dandala and Michael M. Danziger and Ella Barkan and Tanwi Biswas and Viatcheslav Gurev and Jianying Hu and Matthew Madgwick and Akira Koseki and Tal Kozlovski and Michal Rosen-Zvi and Yishai Shimoni and Ching-Huei Tsou},
68
+ year={2025},
69
+ eprint={2506.14861},
70
+ archivePrefix={arXiv},
71
+ primaryClass={q-bio.GN},
72
+ url={https://arxiv.org/abs/2506.14861},
73
+ }
74
+ ```