generation is not good

by jmjzz - opened Mar 30, 2024

Discussion

jmjzz

Mar 30, 2024

I think this 4bit version is not working well as the generation contains too many random or garbage tokens.

MLDataScientist

Mar 31, 2024

I think this 4bit version is not working well as the generation contains too many random or garbage tokens.

@jmjzz
Can you please share what script you used to run the model? With the script provided in this repo, I am getting "Runtime error: LayerNormKernelImpl not implemented for Half".

johnrachwanpruna

Pruna AI org Apr 1, 2024

I think this 4bit version is not working well as the generation contains too many random or garbage tokens.

Can you give us an example of which input you used ?

johnrachwanpruna

Pruna AI org Apr 1, 2024

•

edited Apr 1, 2024

The generation for the base prompt looks good to me @jmjzz :

jmjzz

Apr 1, 2024

I see. I'm also using the base prompt given in the DBRX huggingface page. Did you make any modifications?

johnrachwanpruna

Pruna AI org Apr 1, 2024

This is the exact script i used @jmjzz :

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", trust_remote_code=True, token="hf_YOUR_TOKEN")
model = AutoModelForCausalLM.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="hf_YOUR_TOKEN")

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

jmjzz

Apr 1, 2024

@johnrachwanpruna I see, the generation looks good right now. But loading the model takes like 30 minutes, which is significantly slower than loading Mixtral 7B*8.

johnrachwanpruna

Pruna AI org Apr 1, 2024

@jmjzz For me running the code snippet i showed you takes only 30 seconds

jmjzz

Apr 1, 2024

@johnrachwanpruna Thanks, I think I solved the problem. BTW, I feel the 4bit DBRX is weaker than the default Mixtral 7B*8 after running some evaluations. Have you tried to evaluate it on any benchmarks?

sharpenb

Pruna AI org Apr 2, 2024

We did not try to benchmark the quantized models at the moment.

johnrachwanpruna changed discussion status to closed Apr 3, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment