From Babble to Words
Collection
The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.
•
12 items
•
Updated
Tokenizers trained for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.
This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:
CHAR
) vs. subword tokenization (BPE
)PHON
) vs. orthographic data (TXT
)SPACELESS
) vs. keeps whitespaceTo load a tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')