Code don't run

by batman9x - opened Oct 14, 2023

Discussion

batman9x

Oct 14, 2023

Why do I run code but not respond? How to fix that

qnguyen3

VILM org Oct 15, 2023

i can't give you any advice without seeing your current code

batman9x

Oct 15, 2023

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

DungMinhDao

VILM org Oct 15, 2023

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

batman9x

Oct 16, 2023

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

qnguyen3

VILM org Oct 16, 2023

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try load_in_4bits=True, I do not know how much quantization will help as I have not tried it yet!

batman9x

Oct 16, 2023

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try load_in_4bits=True, I do not know how much quantization will help as I have not tried it yet!

You don't understand what I mean, what I mean is that the model loads normally, but when running, it freezes and nothing appears. Your entire code is missing a parameter, trust_remote_code=True, which when using your code then RWForCausalLM cannot use load_in_4bits ? I don't know if your side is mistaken somewhere :((

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment