From Babble to Words
The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.
- Paper • 2410.22906 • Published
phonemetransformers/IPA-BabyLM
Viewer • Updated • 12.5Mphonemetransformers/IPA-BabyLM-evaluation
Preview • Updatedphonemetransformers/babble-tokenizers
Updated
phonemetransformers/GPT2-85M-BPE-PHON
Updated • 5Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS
Updated • 4Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON
Updated • 12Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS
Updated • 29Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT
Updated • 11Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS
Updated • 5Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT
Updated • 4.57kNote GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.