Model Card for Model ID

This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, "Training a Generally Curious Agent". In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration.

Model Details

Model Description

This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA.

Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources

Repository: Official Code Release for the paper "Training a Generally Curious Agent"
Paper: Training a Generally Curious Agent
Project Website: Project Website

Training Details

Training Data

Our training dataset for supervised fine-tuning can be found here: SFT dataset

Similarly, the training dataset for preference fine-tuning can be found here: Preference learning dataset

Training Procedure

The attached Wandb link shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning.

Training Hyperparameters

For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories.

For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs.

Hardware

This model has been finetuned using 8 NVIDIA L40S GPUs.

Citation

BibTeX:

@misc{tajwar2025traininggenerallycuriousagent,
      title={Training a Generally Curious Agent}, 
      author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
      year={2025},
      eprint={2502.17543},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.17543}, 
}

Model Card Contact

Fahim Tajwar

ftajwar
/

paprika_Meta-Llama-3.1-8B-Instruct

You need to agree to share your contact information to access this model