Comma Epsilon v0.1

Comma Epsilon v0.1 is an experimental finetune of Comma v0.1 2T, post-trained on commercially licensed* instruction and preference data.

Sample usage

This model is compatible with Hugging Face transformers. Feel free to experiment with sampling parameters, the choice below is almost random.

from transformers import AutoTokenizer, LlamaForCausalLM

model_name = 'numiros/Comma-Epsilon-v0.1'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name,
                                         torch_dtype="auto",
                                         device_map="auto")

def generate(messages):
    gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
    input_ids = gen_input['input_ids']
    attention_mask = gen_input['attention_mask']
    generated_ids = model.generate(input_ids=input_ids,
                                   attention_mask=attention_mask,
                                   max_new_tokens=750,
                                   temperature=0.3,
                                   min_p=0.1,
                                   repetition_penalty=1.2,
                                   do_sample=True,
                                   eos_token_id=tokenizer.eos_token_id,
                                   pad_token_id=tokenizer.pad_token_id)
    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
    return response

# A simple (but inefficient) chat example:

def get_prompt():
    l = []
    while True:
        x = input('> ')
        if x == '/end':
            return '\n'.join(l).strip()
        l.append(x)

empty_messages = [
    # {'role': 'system', 'content': 'You are a helpful assistant.'} # not recommended
]

print('Type /end in a new line to complete your message')
messages = list(empty_messages)
while True:
    prompt = get_prompt()
    if prompt == '/clear':
        messages = list(empty_messages)
        continue
    messages.append({'role': 'user', 'content': prompt})
    response = generate(messages)
    print(response)
    messages.append({'role': 'assistant', 'content': response})

Recipe

Datasets

The following datasets were used for the run:

SFT data mixture 1

This data mixture was ~50M tokens corresponding to ~200k samples. These datasets (other than Open Assistant v2) were sourced from https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses

Datasets used in full:

Dolly 15k
Open Assistant OctoPack
Open Assistant v2
Aya Dataset
StarCoder Self-Instruct
Joke Explanation

Datasets sampled (tokens corresponding to fraction of samples taken in parentheses):

Flan Collection (Chain-of-Thought) (~12M tokens)
Flan Collection (Super-NaturalInstructions) (~12M tokens)
Tasksource Instruct (~10M tokens)
OIG (~5M tokens)
Flan Collection (Flan 2021) (~3M tokens)
CommitPackFT (~1M tokens)

SFT data mixture 2

Same as SFT data mixture 1, but with a different seed for whenever sampling is done.

DPO data mixture

Effectively a 10% sample of the HH RLHF dataset due to early stopping.

Training details

The training was done on 2 Nvidia RTX 3090s using axolotl and took a total of about 24 (12 + 10 + 2) hours.

It consisted of 3 phases - SFT-1, SFT-2, DPO. All training was done using relatively high rank LoRA adapters (and the model was loaded in 8-bit precision). The SFT stages used Cut Cross Entropy Loss and Liger kernels, while DPO did not.

SFT-1

Chat template tokens were added to the tokenizer in this phase, so embeddings and the LM head were trained in this phase too.

Parameters:

1 epoch
Peak LR = 4e-5, cosine annealing
Optimizer: AdamW 8-bit
Global batch size = 8
Warmup for 3% of the training
LoRA rank = 256, alpha = 512

SFT-2

Parameters:

1 epoch
Peak LR = 2e-5, cosine annealing
Optimizer: AdamW 8-bit
Global batch size = 8
Warmup for 3% of the training
LoRA rank = 128, alpha = 256

DPO

Parameters:

0.1 epochs
Peak LR = 1e-5, cosine annealing
Optimizer: AdamW 8-bit
Global batch size = 32
Warmup for the first 20 steps
LoRA rank = 128, alpha = 256

Limitations & Intended Use

This model is best thought of as a research artifact, not a polished product. It was the result of almost a single training run with significant data and compute constraints. For any production use case, you should consider performing an additional layer of fine-tuning and alignment - without that it is suitable only for research/non-serious purposes.

Due to data and compute constraints, as well as a scarcity of high-quality data, there was a notable lack of experimentation (read: this was a one-shot run, so things might be off). I didn't scale training to usual post-training scales, and neither did I do any form of RL for math/coding/structured outputs/tool use. I also did not perform mid-training, which is a costly but effective technique used in many SOTA models. Consequently, this model might not perform up to your expectations.

The model has limited preference alignment from a small sample of the HH-RLHF dataset and may generate misaligned outputs from time to time. Furthermore, it was not trained with a system prompt due to a lack of useful data, which can reduce its steerability.

I have not performed any specific debiasing. The training data is sourced from broad internet and instructional datasets and will inevitably contain the biases present in that data. The model can and will generate text that reflects these societal biases. Handle with care and be aware of this when using it for any downstream task.

All limitations from the base model also apply here, and I strongly recommend reviewing its model card.

Footnotes and disclaimer

*This is not legal advice.

numiros
/

Comma-Epsilon-v0.1

Comma Epsilon v0.1

Sample usage

Recipe

Datasets

Training details

SFT-1

SFT-2

DPO

Limitations & Intended Use

Footnotes and disclaimer

Model tree for numiros/Comma-Epsilon-v0.1

Datasets used to train numiros/Comma-Epsilon-v0.1