how to perform batch inference for llama-3.2-1B-Instruct
When using the model for inference, I tried inputting batch data, but after generation, I noticed that some data in the batch was generated incorrectly. For certain prompts, a long string of blank spaces is generated first, followed by the answer, while others do not start with "assistant" and directly generate the answer. Does the model effectively support parallel processing of data? What is the correct way to perform batch inference?
@helmoz
, the personal best experience and solution I came using the transformers and for the FlanT5 in particular, is to activate padding and truncation
https://github.com/nicolay-r/nlp-thirdgate/blob/fd797d35f7c0e446671566f33c1b015dfed9bd75/llm/transformers_flan_t5.py#L26-L31
This is important due to data alignment in memory.
Through that it might be not a 100% identical behavior since input include paddings.
Tweaking other parameters such as skipping special tokens and decoding output is might be a unique case for each transformer model