--- license: llama3.1 tags: - llama3.1 - quantization - bitsandbytes - nlp - instruct library_name: transformers --- # 🚀 Quantized Llama-3.1-8B-Instruct Model This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU. ## 🧠 Model Description The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks. ## 🧮 Quantization Details - **Base Model**: `meta-llama/Llama-3.1-8B-Instruct` - **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization - **Compute Dtype**: float16 - **Library**: `bitsandbytes==0.43.3` - **Framework**: `transformers==4.45.1` - **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab - **Date**: Quantized on June 20, 2025 ## 📦 Files Included - `README.md`: This file - `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights - `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files ## Usage To load and use the quantized model for inference: ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline import torch # Define quantization configuration quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True ) # Load the quantized model model = AutoModelForCausalLM.from_pretrained( "your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID quantization_config=quant_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct") # Create a text generation pipeline generator = pipeline("text-generation", model=model, tokenizer=tokenizer) # Perform inference prompt = "Hello, how can I assist you today?" output = generator(prompt, max_length=50, num_return_sequences=1) print(output) ``` ## Quantization Process The model was quantized in Google Colab using the following script: ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch from huggingface_hub import login # Log in to Hugging Face login() # Requires a Hugging Face token # Define quantization configuration quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True ) # Load and quantize the model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=quantization_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token # Save the quantized model quant_path = "/content/quantized_Llama-3.1-8B-Instruct" model.save_pretrained(quant_path) tokenizer.save_pretrained(quant_path) ``` ## Requirements - **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100) - **Python**: 3.10+ - **Dependencies**: - `transformers==4.45.1` - `bitsandbytes==0.43.3` - `accelerate==0.33.0` - `torch` (with CUDA support) ## Notes - The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment. - Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence. - Access to the base model requires a Hugging Face token and approval from Meta AI. ## License This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). ## Acknowledgments - Created using Hugging Face Transformers and `bitsandbytes` for quantization. - Quantized in Google Colab with a T4 GPU on June 20, 2025.