Transformers
GGUF
conversational

Question about pretraining data

#1
by Doctor-Chad-PhD - opened

Hi, in your readme it says:

The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pretraining data followed by 1.5 trillion tokens of midtraining data with enhanced focus on mathematical reasoning and code generation.

If you can't disclose I understand but do you have an idea what topics those 6.5 trillion tokens mostly consisted of? Thanks.

Sign up or log in to comment