--- license: cc-by-nc-4.0 --- Mistral-based 500M decoder-only GFM trained on 50 genomes from 1000G data with sequence length of 4K nucleotides. Useful baseline for the gfm-random-eval paper. ## Model Details - **Model developers:** M42 Health AI Team - **Base architecture:** [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) - **Context length:** - **Training:** 4k tokens - **Inference:** 4k tokens - **Training data:** 1000 Genomes - **Input format:** Raw DNA sequences - **Output options:** - DNA sequences only - Embeddings - **License:** CC BY-NC 4.0 - **Publication:** [paper link](https://www.biorxiv.org/content/10.1101/2024.12.18.628606v2)