Question about pretraining data
#1
by
Doctor-Chad-PhD
- opened
Hi, in your readme it says:
The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pretraining data followed by 1.5 trillion tokens of midtraining data with enhanced focus on mathematical reasoning and code generation.
If you can't disclose I understand but do you have an idea what topics those 6.5 trillion tokens mostly consisted of? Thanks.