Data/Training requirements?
#4
by
ToastyPigeon
- opened
Hi Gryphe, this model is really cool. π I'm curious if you could share some details about what it took in terms of data and compute to make this model?
You do mention you use mixed-source synthetic data and full-fine tune, which is awesome. I'm wondering about how much data total (in terms of tokens) went into this train to get it to behave nicely from Base. Can you give an approximate number? (hundreds of millions? billions?)
Approximately 34 million tokens, divided like this:
25% instruct, from my Sonnet dataset
25% generic roleplay data, much like my public charcard set
25% Pantheon-specific personas (300 scenarios per character)
25%ish text adventure data, borrowed from Wayfarer and some other stuff I had lying about