Model Card for Model ID

We fine-tuned BiomedBERT using study descriptions from metagenomic projects sourced from MGnify. We applied MLM to unlabelled text data, specifically focusing on the project study descriptions. By fine-tuning the model on domain-specific text, the model now better understands the language and nuances found in metagenomics study description, which helps improve the performance of biome classification tasks.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

  • Developed by: SantiagoSanchezF
  • Model type: MLM
  • Language(s) (NLP): English
  • License: [More Information Needed]
  • Finetuned from model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Downstream Use [optional]

This model isthe base of SantiagoSanchezF/trapiche-biome-classifier

Training Details

Training Data

[More Information Needed]

Training Procedure

The model was domain adapted by applying masked language modeling (MLM) to a corpus of study descriptions derived from metagenomic projects in MGnify. The input text was tokenized with a maximum sequence length of 256 tokens. A data collator was configured to randomly mask 15% of the input tokens for the MLM task. Training was performed with a batch size of 8, over 3 epochs, and with a learning rate of 5e-5.

Citation [optional]

TBD

Downloads last month
8
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SantiagoSanchezF/BiomedBERT_mgnify_studies

Finetuned
(66)
this model
Finetunes
1 model

Dataset used to train SantiagoSanchezF/BiomedBERT_mgnify_studies