This model has been xMADified!

This repository contains meta-llama/Llama-3.1-8B-Instruct quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

Why should I use this model?

Accuracy: This xMADified model is the best quantized version of the meta-llama/Llama-3.1-8B-Instruct model. We crush the most downloaded quantized version(s) (see Table 1 below).
Memory-efficiency: The full-precision model is around 16 GB, while this xMADified model is only 5.7 GB, making it feasible to run on a 8 GB GPU.
Fine-tuning: These models are fine-tunable over the same reduced (5.7 GB) hardware in mere 3-clicks. Watch our product demo here

Table 1: xMAD vs. Unsloth vs. Meta

	MMLU	Arc Challenge	Arc Easy	LAMBADA Standard	LAMBADA OpenAI	PIQA	Winogrande	HellaSwag
xmadai/Llama-3.1-8B-Instruct-xMADai-INT4	66.83	52.3	82.11	65.73	73.30	79.88	72.77	58.49
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit	65.91	51.37	80.89	63.98	71.49	79.43	73.80	58.51
meta-llama/Llama-3.1-8B-Instruct	68.05	51.71	81.9	66.18	73.55	79.87	73.72	59.10

How to Run Model

Loading the model checkpoint of this xMADified model requires less than 6 GiB of VRAM. Hence it can be efficiently run on a 8 GB GPU.

Package prerequisites:

Run the following *commands to install the required packages.

pip install torch==2.4.0  # Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/[email protected]"

Sample Inference Code

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_id = "xmadai/Llama-3.1-8B-Instruct-xMADai-INT4"
prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map='auto',
    trust_remote_code=True,
)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Here's a sample output of the model, using the code above:

["system\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nDeep Learning be a fascinatin' field, matey! It's a form o' artificial intelligence that's based on deep neural networks, which be a type o' machine learning algorithm.\n\nYer see, traditional machine learnin' algorithms be based on shallow nets, meaning they've just one or two layers. But deep learnin' takes it to a whole new level, with multiple layers stacked on top o' each other like a chest overflowin' with booty!\n\nEach o' these layers be responsible fer processin' a different aspect o' the data, from basic features to more abstract representations. It's like navigatin' through a treasure map, with each layer helpin' ye uncover the hidden patterns and patterns hidden within the data.\n\nDeep learnin' be often used in image and speech recognition, natural language processing, and even robotics. But it be a complex and challengin' field, matey, and it requires a strong grasp o' mathematics and computer science.\n\nSo hoist the sails and set course fer the world o' deep learnin', me hearty!"]

Contact Us

For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.

xmadai
/

Llama-3.1-8B-Instruct-xMADai-INT4

This model has been xMADified!

Why should I use this model?

Table 1: xMAD vs. Unsloth vs. Meta

How to Run Model

Contact Us

Model tree for xmadai/Llama-3.1-8B-Instruct-xMADai-INT4

Collection including xmadai/Llama-3.1-8B-Instruct-xMADai-INT4

Llama