Model Card for TICL
A RoBERTa model pre-trained on a dataset of 10M words using (Training Data) Influence-driven Curriculum Learning.
Model Details
See our paper at REDACTED for details on our method.
Model Description
This is a model submitted to the strict-small track of the 2025 BabyLM challenge.
- Developed by: REDACTED
- Funded by [optional]: REDACTED
- Model type: Language model (Masked)
- Language(s) (NLP): eng
- License: CC-By-4.0
Model Sources
- Repository: https://anonymous.4open.science/r/cl-4B5C
Uses
This model was trained to demonstrate the effectiveness of a novel curriculum learning method over training in random order.
Training Details
Training Data
We utilize this dataset built from the following existing ones:
- C1: Child Directed Speech
- CHILDES (MacWhinney 2000)
- C2: Children's Books
- Children Stories Text Corpus (Bensaid et al. 2021)
- Children's Book Test (Hill et al. 2016)
- C3: Dialogue
- OpenSubtitles (Lison and Tiedemann 2016)
- Switchboard Dialog Act Corpus (Stolcke et al. 2000)
- British NationalCorpus (BNC), dialogue portion
- C4: Educational
- Simple Wiki (Warstadt et al. 2023)
- QED (Abdelali et al. 2014)
- C5: Written English
- Standardized Project Gutenberg Corpus (Gerlach and Font-Clos 2018)
- Wikipedia (Warstadt et al. 2023)
Data mix
Words | Documents | |||
---|---|---|---|---|
C1: Child Directed Speech | 1999999 | 20.00% | 360533 | 33.68% |
C2: Children's Books | 1999995 | 20.00% | 77384 | 7.23% |
C3: Dialogue | 1999987 | 20.00% | 349650 | 32.67% |
C4: Educational | 1999999 | 20.00% | 161554 | 15.09% |
C5: Written English | 1999945 | 20.00% | 121200 | 11.32% |
Training Procedure
We extract training data influence estimates from models trained in random order, and sort the training data based on that information with various strategies detailed in the paper. This is the overall best performing model in our experiments, trained in order of increasing influence and re-weighted with lognormal filter, see the paper for details.
Training Hyperparameters
We employ a novel curriculum learning strategy in which the model is trained in non-random order with a total of 100M words.
Parameter | |
---|---|
Shared Hyperparameters | |
Vocabulary size | 52k |
Hidden size | 768 |
Number of layers | 12 |
Number of attention heads | 12 |
Initializer range | 0.02 |
Tie word embeddings | True |
Model-Specific Settings | |
Max position embeddings | 514 |
Intermediate (FFN) size | 3072 |
Norm epsilon | 1e-5 |
Attention dropout | 0.1 |
Activation function | gelu |
Hidden dropout | 0.1 |
Training Setup | |
FP16 | False |
Per Device Batch Size | 32 |
Gradient Accumulation Steps | 16 |
GPUs | 4 |
Adam β₁ | 0.9 |
Adam β₂ | 0.98 |
Adam ε | 1e-6 |
Weight Decay ε | 0.01 |
Learning rate | 5e-4 |
Scheduler | polynomial |
Evaluation
We use this evaluation pipeline of the 2025 BabyLM challange
Results
Task | Metric |
---|---|
(Super) GLUE | 0.579 |
blimp_filtered | 0.688 |
supplement_filtered | 0.559 |
entity_tracking | 0.302 |
ewok_filtered | 0.509 |
wug_adj_nominalization | 0.570 |
**Macro acc ** | 0.584 |
Model Card Contact
REDACTED
- Downloads last month
- 15