LLama-3.1-8B generates way to long answers!

#36
by ayyylemao - opened

I have this wrapper which i use for Llama-3-8b-instruct text generation together with a prefix function as format enforcer for JSON. This works very good. Even though i have max_tokens at 2048, the answers it gives back are usually concise.

Now with llama-3.1-8b-instruct the answers it generates are usually very very long. (Usually up to the max_new_tokens).
Is there something that changed in the generation from 3.0 to 3.1?

  class Llama3:
    def __init__(self, model_id: str, device: str) -> None:
        self.model_id = model_id
        self.pipeline = transformers.pipeline(
            "text-generation",
            model=model_id,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map=device,
        )
        self.terminators = [
            self.pipeline.tokenizer.eos_token_id,
            self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
        ]

    def generate_answer(self, user: str, system: str, format_enforcer : Dict, max_new_tokens=2048, temperature=0.5, top_p=1.9):
        schema = format_enforcer['schema']
        prefix_function = format_enforcer['prefix_function']
        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": f'{user}{json.dumps(schema)}'},
        ]
        outputs = self.pipeline(
            messages,
            max_new_tokens=max_new_tokens,
            eos_token_id=self.terminators,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            prefix_allowed_tokens_fn=prefix_function
        )
        return outputs[0]["generated_text"][-1]

I think this is a similar issue seen with other discussions on this repo.
The outputs are repeating up until max_new_tokens is reached.

I'm also seeing this issue, which is causing trouble for pipelines that need to deliver JSON-formatted outputs.

Sign up or log in to comment