Running the model in 8bit/4bit
#17
by
ybelkada
- opened
As the model seems to support accelerate loading, you can benefit from 8bit / 4bit inference out of the box by first installing bitsandbytes pip install --upgrade bitsandbytes
and run:
For 8bit:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", load_in_8bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
For 4bit:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", load_in_4bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
Making it possible to run the model through Google Colab for example
Cannot get it work on Colab free version with the code above (8bit), system ran out of memory (12.7GB in total, OOM even with low_cpu_mem_usage =True), can you please look into it? Thanks!
I believe you need to push the sharded checkpoints somewhere on the hub beforehand otherwise the colab will crash
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True, torch_dtype=torch.float16)
model.push_to_hub("baichuan-7b-sharded", max_shard_size="2GB")
Then use the sharded checkpoints on the Colab