Running the model in 8bit/4bit

#17
by ybelkada - opened

As the model seems to support accelerate loading, you can benefit from 8bit / 4bit inference out of the box by first installing bitsandbytes pip install --upgrade bitsandbytes and run:

For 8bit:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", load_in_8bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

For 4bit:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B",  load_in_4bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Making it possible to run the model through Google Colab for example

Cannot get it work on Colab free version with the code above (8bit), system ran out of memory (12.7GB in total, OOM even with low_cpu_mem_usage =True), can you please look into it? Thanks!

I believe you need to push the sharded checkpoints somewhere on the hub beforehand otherwise the colab will crash

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True, torch_dtype=torch.float16)
model.push_to_hub("baichuan-7b-sharded", max_shard_size="2GB")

Then use the sharded checkpoints on the Colab

Sign up or log in to comment