LLama-3.1-8B generates way to long answers!
#36
by
ayyylemao
- opened
I have this wrapper which i use for Llama-3-8b-instruct text generation together with a prefix function as format enforcer for JSON. This works very good. Even though i have max_tokens at 2048, the answers it gives back are usually concise.
Now with llama-3.1-8b-instruct the answers it generates are usually very very long. (Usually up to the max_new_tokens).
Is there something that changed in the generation from 3.0 to 3.1?
class Llama3:
def __init__(self, model_id: str, device: str) -> None:
self.model_id = model_id
self.pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map=device,
)
self.terminators = [
self.pipeline.tokenizer.eos_token_id,
self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
def generate_answer(self, user: str, system: str, format_enforcer : Dict, max_new_tokens=2048, temperature=0.5, top_p=1.9):
schema = format_enforcer['schema']
prefix_function = format_enforcer['prefix_function']
messages = [
{"role": "system", "content": system},
{"role": "user", "content": f'{user}{json.dumps(schema)}'},
]
outputs = self.pipeline(
messages,
max_new_tokens=max_new_tokens,
eos_token_id=self.terminators,
do_sample=True,
temperature=temperature,
top_p=top_p,
prefix_allowed_tokens_fn=prefix_function
)
return outputs[0]["generated_text"][-1]
I think this is a similar issue seen with other discussions on this repo.
The outputs are repeating up until max_new_tokens is reached.
I'm also seeing this issue, which is causing trouble for pipelines that need to deliver JSON-formatted outputs.