--- license: cc-by-nc-4.0 tags: - m42 - genomics - biology - GFM - BioFM - BioToken --- # BioFM: A Biologically-Informed Genomic Foundation Model BioFM is a cutting-edge genomic foundation model that addresses critical limitations in existing genomic sequence modeling. By introducing BioToken, a novel tokenization framework, BioFM encodes genomic variants and structural annotations with unprecedented biological context, enabling more nuanced and accurate representation learning. ![BioFM](figures/biotoken_biofm.png) ## Model Highlights - With the introduction of BioToken, we achieved competitive genomic prediction results using only 265 million parameters, significantly reducing computational requirements and training costs. - Demonstrated superior performance compared to specialized models like Enformer and SpliceTransformer in critical genomic tasks, such as expression prediction and sQTL prediction, respectively. - BioFM excels at various genomic tasks (e.g., expression prediction, coding/non-coding pathogenicity prediction, and sQTL prediction) that require long-range genomic contexts, outperforming existing GFMs. ## Model Details - **Model developers:** M42 Health AI Team - **Base architecture:** [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) - **Context length:** - **Training:** 6k tokens - **Inference:** 12k tokens - **Training data:** 1000 Genomes - **Input format:** Annotated DNA sequences using BioToken - **Output options:** - DNA sequences only - Embeddings - **License:** CC BY-NC 4.0 - **Publication:** [Paper link]() ## Model Inference We developed a BioFM-Eval Python package for inference and embedding extraction from genomic sequences. Refer to [BioFM-Eval](https://github.com/m42-health/biofm-eval/) library for setup and installation instructions. ### Creating Variant Embeddings with BioFM This guide will help you quickly generate BioFM embeddings for the variants in your VCF file. These embeddings are created using the method described in our publication. ```python from biofm_eval import AnnotatedModel, AnnotationTokenizer, Embedder, VCFConverter import torch # Define paths to the pre-trained BioFM model and tokenizer MODEL_PATH = "m42-health/BioFM-265M" TOKENIZER_PATH = "m42-health/BioFM-265M" # Load the pre-trained BioFM model and BioToken tokenizer model = AnnotatedModel.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, ) tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) # Initialize the embedder using the model and tokenizer embedder = Embedder(model, tokenizer) # Set up the VCF converter with paths to gene annotations and reference genome vcf_converter = VCFConverter( gene_annotation_path="./gencode.v38.annotation.gff3", reference_genome_path="./GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna" ) # Convert a VCF file into an annotated dataset using BioTokens annotated_dataset = vcf_converter.vcf_to_annotated_dataset( vcf_path = './HG01779_b.vcf.gz', max_variants=200 # Set to None to process all variants in the VCF file ) # Extract BioFM embeddings for all annotated variants embeddings = embedder.get_dataset_embeddings(annotated_dataset) print(embeddings) # Example output (dict): # { # 'embeddings': array of shape (num_variants, 2*embedding_dim), # Numeric embeddings for each variant # 'labels': array of shape (num_variants,) # Present only during supervised embedding extraction # } ``` - Sample reference genome fasta file: [download link](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/) - Gene annotation file: [download_link](https://www.gencodegenes.org/human/release_38.html) - Sample vcf file from 1000 Genomes data: [download_link](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) ### Generation with BioFM BioFM can generate genomic sequences based on input DNA prompts. ```python from biofm_eval import AnnotatedModel, AnnotationTokenizer, Generator import torch # Define paths to the pre-trained BioFM model and tokenizer MODEL_PATH = "m42-health/BioFM-265M" TOKENIZER_PATH = "m42-health/BioFM-265M" # Load the pre-trained BioFM model and BioToken tokenizer model = AnnotatedModel.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, ) tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) # Initializing the generator using model and tokenizer seq_generator = Generator(model, tokenizer) # Generate DNA sequences input_sequences = ['AGCT', 'GACTGCA'] output = seq_generator.generate( input_sequences, max_new_tokens=10, temperature=1.0, do_sample=True, top_k=4) print(output) # Example output: List[str] = ['AGCTACTCCCCTCC', 'GACTGCACCACTGTACT'] ``` ## Training Setup The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework. ## Evaluation Results To demonstrate the effectiveness of BioToken, we evaluated BioFM against strong supervised baselines: Enformer for gene expression prediction and Splice Transformer for sQTL prediction. - *Gene Expression Prediction:* BioFM matches Enformer's performance when both models use a 12K context, making it the first-ever GFM to achieve this. Notably, Enformer fails to reach this performance level even with a 98K context. - *sQTL Prediction:* BioFM significantly outperforms Splice Transformer across all tissues, highlighting its robustness and generalizability. | sQTL prediction | Expression prediction | |---------|---------| | ![Alt1](figures/sqtl_model_comparison.png) | ![Alt2](figures/expression_model_comparison.png) | We further evaluated BioFM on the Variant Benchmark we curated and the Genomics Long-Range Benchmark. - *Variant Benchmark:* Across a broad spectrum of [variant prediction tasks](https://huggingface.co/datasets/m42-health/variant-benchmark), BioFM outperforms other GFMs, showcasing its superior predictive capabilities. - *Long-Range Genomic Dependencies:* On the Genomics Long-Range Benchmark, BioFM sets new performance standards, surpassing previous GFMs that required extensive fine-tuning and longer genomic contexts. This highlights BioFM’s ability to effectively capture and utilize long-range genomic dependencies. | Variant benchmark | Genomics long-range benchmark | |---------|---------| | ![Alt1](figures/vb_heatmap_and_barh_max.png) | ![Alt2](figures/nt_lr_heatmap_and_barh.png) | Please go through the [paper]() for more resutls and ablations. ## Citation ``` @article {Medvedev2025.03.27.645711, author = {Medvedev, Aleksandr and Viswanathan, Karthik and Kanithi, Praveenkumar and Vishniakov, Kirill and Munjal, Prateek and Christophe, Clement and Pimentel, Marco AF and Rajan, Ronnie and Khan, Shadab}, title = {BioToken and BioFM - Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models}, elocation-id = {2025.03.27.645711}, year = {2025}, doi = {10.1101/2025.03.27.645711}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711}, eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711.full.pdf}, journal = {bioRxiv} } ```