Model Card for Model ID
This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, "Training a Generally Curious Agent". In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration.
Model Details
Model Description
This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA.
- Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct
Model Sources
- Repository: Official Code Release for the paper "Training a Generally Curious Agent"
- Paper: Training a Generally Curious Agent
- Project Website: Project Website
Training Details
Training Data
Our training dataset for supervised fine-tuning can be found here: SFT dataset
Similarly, the training dataset for preference fine-tuning can be found here: Preference learning dataset
Training Procedure
The attached Wandb link shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning.
Training Hyperparameters
For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories.
For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs.
Hardware
This model has been finetuned using 8 NVIDIA L40S GPUs.
Citation
BibTeX:
@misc{tajwar2025traininggenerallycuriousagent,
title={Training a Generally Curious Agent},
author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
year={2025},
eprint={2502.17543},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.17543},
}
Model Card Contact
- Downloads last month
- 12