alex-medvedev-msc's picture
Update README.md
b8a1cee verified
metadata
license: cc-by-nc-4.0

Mistral-based 500M decoder-only GFM trained on 50 genomes from 1000G data with sequence length of 4K nucleotides. Useful baseline for the gfm-random-eval paper.

Model Details

  • Model developers: M42 Health AI Team
  • Base architecture: MistralForCausalLM
  • Context length:
    • Training: 4k tokens
    • Inference: 4k tokens
  • Training data: 1000 Genomes
  • Input format: Raw DNA sequences
  • Output options:
    • DNA sequences only
    • Embeddings
  • License: CC BY-NC 4.0
  • Publication: paper link