I think this 4bit version is not working well as the generation contains too many random or garbage tokens.

Can you please share what script you used to run the model? With the script provided in this repo, I am getting "Runtime error: LayerNormKernelImpl not implemented for Half".

Can you give us an example of which input you used ?

The generation for the base prompt looks good to me @jmjzz :


I see. I'm also using the base prompt given in the DBRX huggingface page. Did you make any modifications?

This is the exact script i used @jmjzz :

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", trust_remote_code=True, token="hf_YOUR_TOKEN")
model = AutoModelForCausalLM.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="hf_YOUR_TOKEN")

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=200)

@johnrachwanpruna I see, the generation looks good right now. But loading the model takes like 30 minutes, which is significantly slower than loading Mixtral 7B*8.

@jmjzz For me running the code snippet i showed you takes only 30 seconds


@johnrachwanpruna Thanks, I think I solved the problem. BTW, I feel the 4bit DBRX is weaker than the default Mixtral 7B*8 after running some evaluations. Have you tried to evaluate it on any benchmarks?

We did not try to benchmark the quantized models at the moment.

