This model has been xMADified!
This repository contains meta-llama/Llama-3.1-8B-Instruct
quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.
Why should I use this model?
Accuracy: This xMADified model is the best quantized version of the
meta-llama/Llama-3.1-8B-Instruct
model. We crush the most downloaded quantized version(s) (see Table 1 below).Memory-efficiency: The full-precision model is around 16 GB, while this xMADified model is only 5.7 GB, making it feasible to run on a 8 GB GPU.
Fine-tuning: These models are fine-tunable over the same reduced (5.7 GB) hardware in mere 3-clicks. Watch our product demo here
Table 1: xMAD vs. Unsloth vs. Meta
MMLU | Arc Challenge | Arc Easy | LAMBADA Standard | LAMBADA OpenAI | PIQA | Winogrande | HellaSwag | |
---|---|---|---|---|---|---|---|---|
xmadai/Llama-3.1-8B-Instruct-xMADai-INT4 | 66.83 | 52.3 | 82.11 | 65.73 | 73.30 | 79.88 | 72.77 | 58.49 |
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | 65.91 | 51.37 | 80.89 | 63.98 | 71.49 | 79.43 | 73.80 | 58.51 |
meta-llama/Llama-3.1-8B-Instruct | 68.05 | 51.71 | 81.9 | 66.18 | 73.55 | 79.87 | 73.72 | 59.10 |
How to Run Model
Loading the model checkpoint of this xMADified model requires less than 6 GiB of VRAM. Hence it can be efficiently run on a 8 GB GPU.
Package prerequisites:
- Run the following *commands to install the required packages.
pip install torch==2.4.0 # Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/[email protected]"
Sample Inference Code
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_id = "xmadai/Llama-3.1-8B-Instruct-xMADai-INT4"
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
model = AutoGPTQForCausalLM.from_quantized(
model_id,
device_map='auto',
trust_remote_code=True,
)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Here's a sample output of the model, using the code above:
["system\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nDeep Learning be a fascinatin' field, matey! It's a form o' artificial intelligence that's based on deep neural networks, which be a type o' machine learning algorithm.\n\nYer see, traditional machine learnin' algorithms be based on shallow nets, meaning they've just one or two layers. But deep learnin' takes it to a whole new level, with multiple layers stacked on top o' each other like a chest overflowin' with booty!\n\nEach o' these layers be responsible fer processin' a different aspect o' the data, from basic features to more abstract representations. It's like navigatin' through a treasure map, with each layer helpin' ye uncover the hidden patterns and patterns hidden within the data.\n\nDeep learnin' be often used in image and speech recognition, natural language processing, and even robotics. But it be a complex and challengin' field, matey, and it requires a strong grasp o' mathematics and computer science.\n\nSo hoist the sails and set course fer the world o' deep learnin', me hearty!"]
Contact Us
For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.
- Downloads last month
- 3,981
Model tree for xmadai/Llama-3.1-8B-Instruct-xMADai-INT4
Base model
meta-llama/Llama-3.1-8B