NoorNizar
/

Meta-Llama-3-8B-Instruct-WINT8

Text Generation

text-generation-inference

8-bit precision

compressed-tensors

Model card Files Files and versions Community

NoorNizar commited on Apr 21

Commit

de4b46d

·

verified ·

1 Parent(s): 5859792

Update model card (via --mco)

Files changed (1) hide show

README.md +59 -0

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+---
+library_name: transformers
+tags:
+- llmcompressor
+- quantization
+- wint8
+---
+# Meta-Llama-3-8B-Instruct-WINT8
+This model is a 8-bit quantized version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) "using the [llmcompressor](https://github.com/neuralmagic/llmcompressor) library.
+## Quantization Details
+*   **Base Model:** [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
+*   **Quantization Library:** `llmcompressor`
+*   **Quantization Method:** Weight-only 8-bit int (WINT8)
+*   **Quantization Recipe:**
+```yaml
+quant_stage:
+      quant_modifiers:
+        QuantizationModifier:
+          ignore: [lm_head]
+          config_groups:
+            group_0:
+              weights: {num_bits: 8, type: int, symmetric: true, strategy: channel, dynamic: false}
+              targets: [Linear]
+```
+## Evaluation Results
+The following table shows the evaluation results on various benchmarks compared to the baseline (non-quantized) model.
+| Task             | Baseline Metric (10.0% Threshold) | Quantized Metric | Metric Type         |
+|------------------|-------------------------------------------------------|------------------|---------------------|
+| winogrande       | 0.7577                                              | 0.7616           | acc,none            |
+## How to Use
+You can load the quantized model and tokenizer using the `transformers` library:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "NoorNizar/Meta-Llama-3-8B-Instruct-WINT8"
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Example usage (replace with your specific task)
+prompt = "Hello, world!"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Disclaimer
+This model was quantized automatically using a script. Performance and behavior might differ slightly from the original base model.