responses are incomplete, greetings are not handled

#150
by dev4sidra - opened

i have tried each possible way, changed parameters but still responses are incomplete , in some cases it works and of some query it return half ans. Greetings are not handled properly it return un,matched ans

dev4sidra changed discussion status to closed
dev4sidra changed discussion status to open
Mistral AI_ org

Hi dev4sidra, how are you using the model?

I am facing the same issue as well.

Sample screenshot:
image.png

Mistral AI_ org

I believe you are using the model as raw text completion and not for chat completion, I would recommend using as mentionned in the readme with:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

This is how I got it working:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

def load_quantized_model(model_name: str):
    """
    :param model_name: Name or path of the model to be loaded.
    :return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config
    )

    return model

def initialize_tokenizer(model_name: str):
    """
    Initialize the tokenizer with the specified model_name.

    :param model_name: Name or path of the model for tokenizer initialization.
    :return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer


model = load_quantized_model(model_name)

tokenizer = initialize_tokenizer(model_name)

# Define stop token ids
stop_token_ids = [0]

def generate_response(prompt):
  text = f"[INST] {prompt} [/INST]"
  encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  model_input = encoded.to(model.device)
  generated_ids = model.generate(**model_input, max_new_tokens=1000, do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  return decoded[0].replace(text, '').strip()

# https://stackoverflow.com/questions/77803696/runtimeerror-cutlassf-no-kernel-found-to-launch-when-running-huggingface-tran
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

prompt = "How AI will replace Engineers"
response = generate_response(prompt)
print(response)

const response = await textGeneration({
accessToken: apiKey,
model: 'mistralai/Mistral-7B-Instruct-v0.2',
inputs: inputText,
parameters: {
max_length: 1024,
repetition_penalty: 1.03,
temperature: 0.2, /// Adjust for balance between creativity and relevance
top_p: 0.9, // Nucleus sampling: consider top 90% probability mass.
top_k: 50, // Limits token choices to the top 50 most probable tokens.

},
}); i am using it this way, but still responses are incomplete, i tried diff ways to change paramteres,

basically i have a vector db, the question i ask find relevant data from database then i pass the query and relevant searches to model, it should generate full response

Sign up or log in to comment