--- library_name: transformers tags: - llmcompressor - quantization - wint4 --- # Phi-4-mini-instruct-WINT4 This model is a 4-bit quantized version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) "using the [llmcompressor](https://github.com/neuralmagic/llmcompressor) library. ## Quantization Details * **Base Model:** [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) * **Quantization Library:** `llmcompressor` * **Quantization Method:** Weight-only 4-bit int (WINT4) * **Quantization Recipe:** ```yaml quant_stage: quant_modifiers: QuantizationModifier: ignore: [lm_head] config_groups: group_0: weights: {num_bits: 4, type: int, symmetric: true, strategy: channel, dynamic: false} targets: [Linear] ``` ## Evaluation Results The following table shows the evaluation results on various benchmarks compared to the baseline (non-quantized) model. | Task | Baseline Metric (10.0% Threshold) | Quantized Metric | Metric Type | |------------------|-------------------------------------------------------|------------------|---------------------| | winogrande | 0.7545 | 0.6985 | acc,none | ## How to Use You can load the quantized model and tokenizer using the `transformers` library: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "NoorNizar/Phi-4-mini-instruct-WINT4" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_id) # Example usage (replace with your specific task) prompt = "Hello, world!" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Disclaimer This model was quantized automatically using a script. Performance and behavior might differ slightly from the original base model.