Model Card for Model ID

We fine-tuned BiomedBERT using study descriptions from metagenomic projects sourced from MGnify. We applied MLM to unlabelled text data, specifically focusing on the project study descriptions. By fine-tuning the model on domain-specific text, the model now better understands the language and nuances found in metagenomics study description, which helps improve the performance of biome classification tasks.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

Developed by: SantiagoSanchezF
Model type: MLM
Language(s) (NLP): English
License: [More Information Needed]
Finetuned from model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Downstream Use [optional]

This model isthe base of SantiagoSanchezF/trapiche-biome-classifier

Training Details

Training Data

[More Information Needed]

Training Procedure

The model was domain adapted by applying masked language modeling (MLM) to a corpus of study descriptions derived from metagenomic projects in MGnify. The input text was tokenized with a maximum sequence length of 256 tokens. A data collator was configured to randomly mask 15% of the input tokens for the MLM task. Training was performed with a batch size of 8, over 3 epochs, and with a learning rate of 5e-5.

Citation [optional]

TBD

SantiagoSanchezF
/

BiomedBERT_mgnify_studies