How to structure the dataset for finetunning?

#19
by bkadezabek - opened

How to structure the dataset to use the Qwen2.5-coder as the pretrained model and to then finetune it for my sepcific use case? What would the jsonl file look like and what columns would it have ex: "input", "output" or?

I am using this Snippet with DataCollatorForCompletionOnlyLM and SFTTrainer from the python trl Package for Supervised finetuning of Qwen2.5-Instruct Models, should work with the Coder ones aswell

def formatting_prompts_func(batch):
        output_texts = []
        for i in range(len(batch['id'])):
            res = batch[res_key][i] # provided via args to my script could be anything, basicaly representing the assistant response
            text = f'''<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
{batch['prompt_messages'][i][0]['content']}<|im_end|>
<|im_start|>assistant
{res}<|im_end|>
'''
            output_texts.append(text)
        return output_texts

# Does automatically mask assistant response in input_ids during evaluation, and all tokens expect for the assistant response in labels, to only train on the assistant completion
collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant', tokenizer=tokenizer)

trainer = SFTTrainer(
        model_init=model_init,
        args=SFTConfig(**train_args.to_dict(), max_seq_length=7000),
        train_dataset=train_set,
        eval_dataset=test_set,
        processing_class=tokenizer,
        formatting_func=formatting_prompts_func,
        data_collator=collator,
    )

As you can see your Dataset file could have any structure since you can provide a formatting Function for preprocessing, which is an easy way to preprocess SFT Datasets.
Most Trainer implementations / Models use "input_ids" and "labels" Fields, which are filled during preprocessing of the Dataset and contain the token ids.
So you could also use the tokenizer to create those yourself, depending on the use case you will mask certain parts of the Sequences to ignore during loss calculation (for example the Prompt Tokens in the Labels) or the Assistant Response in the input ids during evaluation. Therefore you have to preprocess train and eval set seperately.

An other important Part is the max_seuqence_length, you can tokenize each sample of your Dataset (including Assistant Response) to dynamicaly get the maximum number which will be used for padding.
If this number is to low, some sequences could get cut off, if it is to big you waste Ressources.

If you dont want to use basic Supervised Fine Tuning you might have to dig deeper, since preprocessing might differ.

I am using this Snippet with DataCollatorForCompletionOnlyLM and SFTTrainer from the python trl Package for Supervised finetuning of Qwen2.5-Instruct Models, should work with the Coder ones aswell

def formatting_prompts_func(batch):
        output_texts = []
        for i in range(len(batch['id'])):
            res = batch[res_key][i] # provided via args to my script could be anything, basicaly representing the assistant response
            text = f'''<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
{batch['prompt_messages'][i][0]['content']}<|im_end|>
<|im_start|>assistant
{res}<|im_end|>
'''
            output_texts.append(text)
        return output_texts

# Does automatically mask assistant response in input_ids during evaluation, and all tokens expect for the assistant response in labels, to only train on the assistant completion
collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant', tokenizer=tokenizer)

trainer = SFTTrainer(
        model_init=model_init,
        args=SFTConfig(**train_args.to_dict(), max_seq_length=7000),
        train_dataset=train_set,
        eval_dataset=test_set,
        processing_class=tokenizer,
        formatting_func=formatting_prompts_func,
        data_collator=collator,
    )

As you can see your Dataset file could have any structure since you can provide a formatting Function for preprocessing, which is an easy way to preprocess SFT Datasets.
Most Trainer implementations / Models use "input_ids" and "labels" Fields, which are filled during preprocessing of the Dataset and contain the token ids.
So you could also use the tokenizer to create those yourself, depending on the use case you will mask certain parts of the Sequences to ignore during loss calculation (for example the Prompt Tokens in the Labels) or the Assistant Response in the input ids during evaluation. Therefore you have to preprocess train and eval set seperately.

An other important Part is the max_seuqence_length, you can tokenize each sample of your Dataset (including Assistant Response) to dynamicaly get the maximum number which will be used for padding.
If this number is to low, some sequences could get cut off, if it is to big you waste Ressources.

If you dont want to use basic Supervised Fine Tuning you might have to dig deeper, since preprocessing might differ.

Thank you for this detailed overview😊

@HondaVfr800 Hope I'm not wrong.
The code:

collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant', tokenizer=tokenizer)

...should be:

collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant\n', tokenizer=tokenizer)

There is a \n after assistant.

If the messages are:

messages = [                                                                                              
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Who are you"},
    {"role": "assistant", "content": "I'm Qwen."}
]                                                                                                

They will be encoded as:

<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho are you<|im_end|>\n<|im_start|>assistant\nI'm Qwen.<|im_end|>\n

Yes, you're right — there is a newline after the assistant. It would be interesting to see if it makes any noticeable difference during training, though I guess it doesn't.
If the newline is omitted during splitting, the model will generate the newline for each training sample, and the loss will also be calculated for the newline token.

from transformers import AutoTokenizer

messages = [
    {'role':'system','content':'Bla Bla System Message'},
    {'role':'user','content':'Bla Bla User Message'},
]

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')

res = tokenizer.apply_chat_template(messages,skip_special_tokens=False,add_generation_prompt=True,tokenize=False)

print(res)

Produces:

<|im_start|>system
Bla Bla System Message<|im_end|>
<|im_start|>user
Bla Bla User Message<|im_end|>
<|im_start|>assistant
... Completion starting here
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment