Model Card for TICL

A RoBERTa model pre-trained on a dataset of 10M words using (Training Data) Influence-driven Curriculum Learning.

Model Details

See our paper at REDACTED for details on our method.

Model Description

This is a model submitted to the strict-small track of the 2025 BabyLM challenge.

Developed by: REDACTED
Funded by [optional]: REDACTED
Model type: Language model (Masked)
Language(s) (NLP): eng
License: CC-By-4.0

Model Sources

Repository: https://anonymous.4open.science/r/cl-4B5C

Uses

This model was trained to demonstrate the effectiveness of a novel curriculum learning method over training in random order.

Training Details

Training Data

We utilize this dataset built from the following existing ones:

C1: Child Directed Speech
- CHILDES (MacWhinney 2000)
C2: Children's Books
- Children Stories Text Corpus (Bensaid et al. 2021)
- Children's Book Test (Hill et al. 2016)
C3: Dialogue
- OpenSubtitles (Lison and Tiedemann 2016)
- Switchboard Dialog Act Corpus (Stolcke et al. 2000)
- British NationalCorpus (BNC), dialogue portion
C4: Educational
- Simple Wiki (Warstadt et al. 2023)
- QED (Abdelali et al. 2014)
C5: Written English
- Standardized Project Gutenberg Corpus (Gerlach and Font-Clos 2018)
- Wikipedia (Warstadt et al. 2023)

Data mix

	Words		Documents
C1: Child Directed Speech	1999999	20.00%	360533	33.68%
C2: Children's Books	1999995	20.00%	77384	7.23%
C3: Dialogue	1999987	20.00%	349650	32.67%
C4: Educational	1999999	20.00%	161554	15.09%
C5: Written English	1999945	20.00%	121200	11.32%

Training Procedure

We extract training data influence estimates from models trained in random order, and sort the training data based on that information with various strategies detailed in the paper. This is the overall best performing model in our experiments, trained in order of increasing influence and re-weighted with lognormal filter, see the paper for details.

Training Hyperparameters

We employ a novel curriculum learning strategy in which the model is trained in non-random order with a total of 100M words.

Parameter
Shared Hyperparameters
Vocabulary size	52k
Hidden size	768
Number of layers	12
Number of attention heads	12
Initializer range	0.02
Tie word embeddings	True
Model-Specific Settings
Max position embeddings	514
Intermediate (FFN) size	3072
Norm epsilon	1e-5
Attention dropout	0.1
Activation function	gelu
Hidden dropout	0.1
Training Setup
FP16	False
Per Device Batch Size	32
Gradient Accumulation Steps	16
GPUs	4
Adam β₁	0.9
Adam β₂	0.98
Adam ε	1e-6
Weight Decay ε	0.01
Learning rate	5e-4
Scheduler	polynomial

Evaluation

We use this evaluation pipeline of the 2025 BabyLM challange

Results

Task	Metric
(Super) GLUE	0.579
blimp_filtered	0.688
supplement_filtered	0.559
entity_tracking	0.302
ewok_filtered	0.509
wug_adj_nominalization	0.570
Macro acc	0.584

Model Card Contact

REDACTED

babylm-anon
/

TICL