Training Data

#2
by jo-mengr - opened

Hi!
Love the model and am working with it for my phd. Would it possible for you to share the training dataset? I would like to train a modern Bert model with a larger context window with the same objective.
Thanks!
Jonatan

Thank you, I appreciate it!

The dataset is just a random sample of PubMed title/abstract pairs, so I don't think it's hard to reproduce and probably could even be improved upon with good dataset engineering/analysis/parameter tuning. Then for each randomly selected article, a similar title is found. PaperETL can handle all the PubMed article processing.

There is also another model that uses a ModernBERT fine-tuned model as the base: https://huggingface.co/NeuML/bioclinical-modernbert-base-embeddings

Perfect! In that case I'll just use that model instead.
Thanks!

jo-mengr changed discussion status to closed

Sign up or log in to comment