--- library_name: vllm language: - ar - de - en - es - fr - hi - id - it - pt - th - tl - vi base_model: - meta-llama/Llama-4-Maverick-17B-128E-Instruct pipeline_tag: image-text-to-text tags: - facebook - meta - pytorch - llama - llama4 - neuralmagic - redhat - llmcompressor - quantized - INT4 license: other license_name: llama4 --- # Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 ## Model Overview - **Model Architecture:** Llama4ForConditionalGeneration - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT4 - **Release Date:** 06/12/2025 - **Version:** 1.0 - **Model Developers:** Red Hat (Neural Magic) ### Model Optimizations This model was obtained by quantizing weights of [Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment This model can be deployed efficiently on vLLM. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" number_gpus = 8 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Give me a short introduction to large language model." llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompt, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation
Creation details This model was created by applying a development version [llm-compressor](https://github.com/vllm-project/llm-compressor). More details will be added as the the code is merged on main.
## Evaluation The model was evaluated on the OpenLLM v1 leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). More evaluations are under way.
Evaluation details **OpenLLM v1** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \ --tasks openllm \ --batch_size auto ```
### Accuracy | | Recovery (%) | meta-llama/Llama-4-Maverick-17B-128E-Instruct | RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
(this model) | | ---------------------------------------------- | :-----------: | :-------------------------------------------: | :-----------------------------------------------------------------: | | ARC-Challenge
25-shot | 96.6 | 73.55 | 71.08 | | GSM8k
5-shot | 99.7 | 93.18 | 92.87 | | HellaSwag
10-shot | 99.6 | 87.27 | 86.95 | | MMLU
5-shot | 99.8 | 85.98 | 85.78 | | TruthfulQA
0-shot | 100.0 | 62.81 | 62.85 | | WinoGrande
5-shot | 100.5 | 78.53 | 78.93 | | **OpenLLM v1
Average Score** | **99.4** | **80.22** | **79.74** |