Irrelevant text generation while prompting
I am asking Phi-2 to explain photosynthesis, using two methods
Greedy decoding/ Normal Method
base_model_id = 'microsoft/phi-2'
eval_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, load_in_8bit=True)
with torch.no_grad():
raw_op_greedy = eval_model.generate(**tok_eval_prompt, max_new_tokens=500, repetition_penalty=1.15)
Nucleus Sampling
with torch.no_grad():
rnp = eval_model.generate(**tok_eval_prompt, max_new_tokens=500, repetition_penalty=1.15, do_sample=True, top_p=0.90, num_return_sequences=3)
Outputs:
In all cases model is adding some irrelevant piece of text after the explanation. I was wondering what could be the reason, it it max_new_token
parameter? Do we need to set it up explicitly for every query, after guessing what could be the length where Phi-2 won't add some redundant texts.
Second question I have is regrading support of sampling of generated text. I noticed below statement in model-card section, does this mean that along with beam search, Top-k or Top-p samplings are irrelevant to Phi-2 somehow and it is best while greedy decoding only?
In the generation function, our model currently does not support beam search (num_beams > 1).
I tried multiple flavour code-generation, Chat-mode, Instruction-output, Sampling method gave worst results. Are there any specific reasons or I am doing something wrong with sampling or any other parameter passing?
The instruct template you're using has a typo: "Instruct: " should be used instead of "Instruction".
More information is on the technical reports for phi
Then it's probably because none of the phi models are instruction tuned or for chat use cases. So, they don't know when to stop generating.
normally you need to preset the prompt with "Instruct" as well as prompt it with output termination string. The model outputs "end of text" string which you can pick up as it finishes. Normally this works very well, though on occasion it will keep going until it reaches the output token limit. As people have mentioned here this is a base model :) Here's what I'm using:
def generate_llm_response(model, tokenizer, device, prompt, max_length):
output_termination = "\nOutput:"
total_input = f"Instruct:{prompt}{output_termination}"
inputs = tokenizer(total_input, return_tensors="pt", return_attention_mask=True)
inputs = inputs.to(device)
eos_token_id = tokenizer.eos_token_id
outputs = model.generate(**inputs, max_length=max_length, eos_token_id=eos_token_id)
# Find the position of "Output:" and extract the text after it
generated_text = tokenizer.batch_decode(outputs)[0]
# Split the text at "Output:" and take the second part
split_text = generated_text.split("Output:", 1)
assistant_response = split_text[1].strip() if len(split_text) > 1 else ""
assistant_response = assistant_response.replace("<|endoftext|>", "").strip()
return assistant_response
I have created a naive solution for this problem which removes the extra text at the bottom of the answer. please check the code here github.com/YodaGitMaster/medium-phi2-deploy-finetune-llm
if you know a way to do it more beautifully, please write me a message, really looking forward to it.