Formats for prompting the model using Hugging face

#103
by javalenzuela - opened

I am relatively new to prompting LLMs using hugging face and, after analyzing documentation from different sources related to Llama 3.1 (self instruct version) I have several doubts about the different ways of using Hugging face pipeline for this model. More specifically, I have found two different methods for prompting this model:

First method. Simply providing the pipeline with an array of messages

This example is based on the official model card from Hugging face. In this case, the model
is simply provided with a list of messages, each one of them being a dictionary with the properties role and content.

import os
import transformers
import torch
from huggingface_hub import login

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

print("### Hugging face login")
login(os.environ["HF_TOKEN"])


pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="cuda",
    return_full_text=True
)

messages = [
    {"role": "system", "content": "You are an expert in a specific field."},
    {"role": "user", "content": "This is a simple question"}
]

outputs = pipeline(
    messages,
    max_new_tokens=1024,
)

print(outputs)

Second method. Explicitly apply Llama 3.1 prompt template using the model tokenizer

This example is based on the Model card from the Meta documentation and
some tutorials which apply fine-tuning to the model. In this case, the list of messages is provided to the tokenizer, which returns
a string with the prompt format used for Llama 3.1. Then, this string is used as input to the model pipeline.

import os
import transformers
import torch
from huggingface_hub import login

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

print("### Hugging face login")
login(os.environ["HF_TOKEN"])

llama31_pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="cuda",
    return_full_text=False
)

messages = [
    {"role": "system", "content": "You are an expert in a specific field."},
    {"role": "user", "content": "This is a simple question"}
]

prompt = llama31_pipeline.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

print("### Printing formatted prompt:")
print(prompt)

prediction = llama31_pipeline(
        prompt,
        max_new_tokens=1024,
    )

print("### Printing prediction:")
print(prediction)

Questions

My question is if there are any important differences between these two methods (or if both of them are correct). Originally, I thought that they were equivalent,
but after executing both of them in a specific task, I realized that the performance of method 1 was better than the one offered by
method 2 (maybe if I perform several executions and compute the average they will be the same).

Are these two methods equivalent? Do they have advantages or disadvantages?

Is there any way of knowing the inner workings of the pipeline? I was thinking that maybe in method 2 I am applying the prompt
template twice and that is the cause behind the slightly worse performance.

Best regards and thank you in advance.

According to the source code in transformer.pipeines.text_generation.py, the difference between your method 1 and 2 is very small if not none.

First look at the TextGenerationPipeline.__call()__, you will see these code:

        if isinstance(
            text_inputs, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)
        ) and isinstance(text_inputs[0], (list, tuple, dict)):
            # We have one or more prompts in list-of-dicts format, so this is chat mode
            if isinstance(text_inputs[0], dict):
                return super().__call__(Chat(text_inputs), **kwargs)
            else:
                chats = [Chat(chat) for chat in text_inputs]  # 🐈 🐈 🐈
                return super().__call__(chats, **kwargs)
        else:
            return super().__call__(text_inputs, **kwargs)

tl;dr For method 1 it will call super().__call__(Chat(text_inputs), **kwargs), and for method 2 it will call super().__call__(text_inputs, **kwargs).

The key difference of wrapping with Chat() can be found in the preprocess() function of the same class:

        if isinstance(prompt_text, Chat):
            tokenizer_kwargs.pop("add_special_tokens", None)  # ignore add_special_tokens on chats
            inputs = self.tokenizer.apply_chat_template(
                prompt_text.messages,
                add_generation_prompt=True,
                return_dict=True,
                return_tensors=self.framework,
                **tokenizer_kwargs,
            )
        else:
            inputs = self.tokenizer(prefix + prompt_text, return_tensors=self.framework, **tokenizer_kwargs)

As you see, method 1 eventually also calls the tokenizer.apply_chat_template() function to apply the chat template. The code above generates the input_ids (token ids) immediately, while in your method 2 code, you generate the template text first, then pipeline will convert it to input_ids, which basically the same.

You might want to pay attention to the arguments differences in the apply_chat_template() like "return_dict" and "return_tensors". But generally, I would believe both method should have the same behavior. My 2 cents.

Oh one more thing I missed. In your method 2, when you call pipeline with the templated prompt, you should add arguments "add_special_tokens=False" so that it won't add another duplicated "<|begin_of_text|>" at the head of the prompt. This argument is True by default so you need manually set it False in pipeline() call.

Thank you for your detailed response! I think that I will continue using method 1.

Best regards,
Juan Carlos Alonso

Sign up or log in to comment