Question about pretraining data

by Doctor-Chad-PhD - opened Jul 29

Jul 29

Hi, in your readme it says:

The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pretraining data followed by 1.5 trillion tokens of midtraining data with enhanced focus on mathematical reasoning and code generation.

If you can't disclose I understand but do you have an idea what topics those 6.5 trillion tokens mostly consisted of? Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment