how to perform batch inference for llama-3.2-1B-Instruct

#28

by helmoz - opened Nov 18, 2024

Nov 18, 2024

When using the model for inference, I tried inputting batch data, but after generation, I noticed that some data in the batch was generated incorrectly. For certain prompts, a long string of blank spaces is generated first, followed by the answer, while others do not start with "assistant" and directly generate the answer. Does the model effectively support parallel processing of data? What is the correct way to perform batch inference?

nicolay-r

Mar 11

@helmoz , the personal best experience and solution I came using the transformers and for the FlanT5 in particular, is to activate padding and truncation
https://github.com/nicolay-r/nlp-thirdgate/blob/fd797d35f7c0e446671566f33c1b015dfed9bd75/llm/transformers_flan_t5.py#L26-L31
This is important due to data alignment in memory.
Through that it might be not a 100% identical behavior since input include paddings.
Tweaking other parameters such as skipping special tokens and decoding output is might be a unique case for each transformer model

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment