Data/Training requirements?

by ToastyPigeon - opened May 15

May 15

Hi Gryphe, this model is really cool. 😁 I'm curious if you could share some details about what it took in terms of data and compute to make this model?

You do mention you use mixed-source synthetic data and full-fine tune, which is awesome. I'm wondering about how much data total (in terms of tokens) went into this train to get it to behave nicely from Base. Can you give an approximate number? (hundreds of millions? billions?)

Gryphe

Owner May 16

Approximately 34 million tokens, divided like this:

25% instruct, from my Sonnet dataset
25% generic roleplay data, much like my public charcard set
25% Pantheon-specific personas (300 scenarios per character)
25%ish text adventure data, borrowed from Wayfarer and some other stuff I had lying about

ToastyPigeon

May 16

Wow 😱 That's a lot less than I expected to get something nice from base. That's also really encouraging! (Makes it feel a lot more feasible to make a good train on base for me, too.)

Thank you for sharing 💞

Gryphe changed discussion status to closed May 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment