Model Card for Model ID
We fine-tuned BiomedBERT using study descriptions from metagenomic projects sourced from MGnify. We applied MLM to unlabelled text data, specifically focusing on the project study descriptions. By fine-tuning the model on domain-specific text, the model now better understands the language and nuances found in metagenomics study description, which helps improve the performance of biome classification tasks.
This modelcard aims to be a base template for new models. It has been generated using this raw template.
Model Details
Model Description
- Developed by: SantiagoSanchezF
- Model type: MLM
- Language(s) (NLP): English
- License: [More Information Needed]
- Finetuned from model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Downstream Use [optional]
This model isthe base of SantiagoSanchezF/trapiche-biome-classifier
Training Details
Training Data
[More Information Needed]
Training Procedure
The model was domain adapted by applying masked language modeling (MLM) to a corpus of study descriptions derived from metagenomic projects in MGnify. The input text was tokenized with a maximum sequence length of 256 tokens. A data collator was configured to randomly mask 15% of the input tokens for the MLM task. Training was performed with a batch size of 8, over 3 epochs, and with a learning rate of 5e-5.
Citation [optional]
TBD
- Downloads last month
- 8