Question about the size of training tokens

#34

by feiyulv - opened May 22, 2023

May 22, 2023

Hi, the starcoder paper says we use 1 trillion tokens to train the model, but in starcoder dataset, it says the total starcoder dataset is about 250 billion tokens. How to understand the difference? Is that caused by the FIM strategy?

christopher

BigCode org May 22, 2023

The model was trained for multiple epochs to reach 1 trillion tokens.

feiyulv

May 22, 2023

Thank you

feiyulv changed discussion status to closed May 22, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment