Code don't run
i can't give you any advice without seeing your current code
i can't give you any advice without seeing your current code
i use code example
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "vilm/vulture-40B"
tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )
prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))
i can't give you any advice without seeing your current code
i use code example
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = "vilm/vulture-40B"
tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?
i can't give you any advice without seeing your current code
i use code example
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = "vilm/vulture-40B"
tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?
i can't give you any advice without seeing your current code
i use code example
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = "vilm/vulture-40B"
tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?
Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try load_in_4bits=True
, I do not know how much quantization will help as I have not tried it yet!
i can't give you any advice without seeing your current code
i use code example
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = "vilm/vulture-40B"
tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?
Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try
load_in_4bits=True
, I do not know how much quantization will help as I have not tried it yet!
You don't understand what I mean, what I mean is that the model loads normally, but when running, it freezes and nothing appears. Your entire code is missing a parameter, trust_remote_code=True, which when using your code then RWForCausalLM cannot use load_in_4bits ? I don't know if your side is mistaken somewhere :((