Fill-Mask
Transformers
Safetensors
English
roberta

Model Card for TICL

A RoBERTa model pre-trained on a dataset of 10M words using (Training Data) Influence-driven Curriculum Learning.

Model Details

See our paper at REDACTED for details on our method.

Model Description

This is a model submitted to the strict-small track of the 2025 BabyLM challenge.

  • Developed by: REDACTED
  • Funded by [optional]: REDACTED
  • Model type: Language model (Masked)
  • Language(s) (NLP): eng
  • License: CC-By-4.0

Model Sources

Uses

This model was trained to demonstrate the effectiveness of a novel curriculum learning method over training in random order.

Training Details

Training Data

We utilize this dataset built from the following existing ones:

Data mix

Words Documents
C1: Child Directed Speech 1999999 20.00% 360533 33.68%
C2: Children's Books 1999995 20.00% 77384 7.23%
C3: Dialogue 1999987 20.00% 349650 32.67%
C4: Educational 1999999 20.00% 161554 15.09%
C5: Written English 1999945 20.00% 121200 11.32%

Training Procedure

We extract training data influence estimates from models trained in random order, and sort the training data based on that information with various strategies detailed in the paper. This is the overall best performing model in our experiments, trained in order of increasing influence and re-weighted with lognormal filter, see the paper for details.

Training Hyperparameters

We employ a novel curriculum learning strategy in which the model is trained in non-random order with a total of 100M words.

Parameter
Shared Hyperparameters
Vocabulary size 52k
Hidden size 768
Number of layers 12
Number of attention heads 12
Initializer range 0.02
Tie word embeddings True
Model-Specific Settings
Max position embeddings 514
Intermediate (FFN) size 3072
Norm epsilon 1e-5
Attention dropout 0.1
Activation function gelu
Hidden dropout 0.1
Training Setup
FP16 False
Per Device Batch Size 32
Gradient Accumulation Steps 16
GPUs 4
Adam β₁ 0.9
Adam β₂ 0.98
Adam ε 1e-6
Weight Decay ε 0.01
Learning rate 5e-4
Scheduler polynomial

Evaluation

We use this evaluation pipeline of the 2025 BabyLM challange

Results

Task Metric
(Super) GLUE 0.579
blimp_filtered 0.688
supplement_filtered 0.559
entity_tracking 0.302
ewok_filtered 0.509
wug_adj_nominalization 0.570
**Macro acc ** 0.584

Model Card Contact

REDACTED

Downloads last month
15
Safetensors
Model size
126M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train babylm-anon/TICL