baichuan-inc/Baichuan-7B · Running the model in 8bit/4bit

Jun 22, 2023

•

edited Jun 22, 2023

As the model seems to support accelerate loading, you can benefit from 8bit / 4bit inference out of the box by first installing bitsandbytes pip install --upgrade bitsandbytes and run:

For 8bit:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", load_in_8bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

For 4bit:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B",  load_in_4bit=True, trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Making it possible to run the model through Google Colab for example

reedhs

Jun 24, 2023

•

edited Jun 24, 2023

Cannot get it work on Colab free version with the code above (8bit), system ran out of memory (12.7GB in total, OOM even with low_cpu_mem_usage =True), can you please look into it? Thanks!

ybelkada

Jun 24, 2023

•

edited Jun 24, 2023

I believe you need to push the sharded checkpoints somewhere on the hub beforehand otherwise the colab will crash

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True, torch_dtype=torch.float16)
model.push_to_hub("baichuan-7b-sharded", max_shard_size="2GB")

Then use the sharded checkpoints on the Colab