TinyLlama-1.1B-DPO-Human-Like

library_name: transformers model_name: model tags: - generated_from_trainer - trl - dpo licence: apache-2.0

Model Card for model

This model is a DPO trained version of TinyLlama/TinyLlama-1.1B-Chat-v1.0. It has been trained using TRL.

Quick start


import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="Josh1207/TinyLlama-1.1B-DPO-HumanLike", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

Training procedure


loss_type="sigmoid",
learning_rate=1e-5,
beta=0.3,
epochs=2,
batch_size=2,
gradient_accumulation_steps=4,

This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Framework versions

TRL: 0.19.1
Transformers: 4.53.1
Pytorch: 2.7.1
Datasets: 3.6.0
Tokenizers: 0.21.4.dev0

Citations

Cite DPO as:

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
    editor       = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Josh1207
/

TinyLlama-1.1B-DPO-HumanLike

TinyLlama-1.1B-DPO-Human-Like

library_name: transformers model_name: model tags: - generated_from_trainer - trl - dpo licence: apache-2.0

Model Card for model

Quick start

Training procedure

Framework versions

Citations

Model tree for Josh1207/TinyLlama-1.1B-DPO-HumanLike

Dataset used to train Josh1207/TinyLlama-1.1B-DPO-HumanLike