RedHatAI
/

granite-3.1-8b-base-quantized.w4a16

@@ -11,7 +11,14 @@ base_model: ibm-granite/granite-3.1-8b-base
 library_name: transformers
 ---
-# granite-3.1-8b-base-quantized.w4a16
 ## Model Overview
 - **Model Architecture:** granite-3.1-8b-base
@@ -22,7 +29,7 @@ library_name: transformers
   - **Activation quantization:** BF16
 - **Release Date:** 1/8/2025
 - **Version:** 1.0
-- **Model Developers:** Neural Magic
 Quantized version of [ibm-granite/granite-3.1-8b-base](https://huggingface.co/ibm-granite/granite-3.1-8b-base).
 It achieves an average score of 69.81 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 70.30.
@@ -62,6 +69,158 @@ print(generated_text)
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Creation
 This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

 library_name: transformers
 ---
+<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
+  granite-3.1-8b-base-quantized.w4a16
+  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
+</h1>
+<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
+<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
+</a>
 ## Model Overview
 - **Model Architecture:** granite-3.1-8b-base
   - **Activation quantization:** BF16
 - **Release Date:** 1/8/2025
 - **Version:** 1.0
+- **Model Developers:** Neural Magic (Red Hat)
 Quantized version of [ibm-granite/granite-3.1-8b-base](https://huggingface.co/ibm-granite/granite-3.1-8b-base).
 It achieves an average score of 69.81 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 70.30.
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 8 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/granite-3.1-8b-base-quantized.w4a16
+```
+See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
+```bash
+# Download model from Red Hat Registry via docker
+# Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
+ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-base-quantized-w4a16:1.5
+```
+```bash
+# Serve model via ilab
+ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-base-quantized-w4a16
+# Chat with model
+ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-base-quantized-w4a16
+```
+See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
+```python
+# Setting up vllm server with ServingRuntime
+# Save as: vllm-servingruntime.yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
+ annotations:
+   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
+   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
+ labels:
+   opendatahub.io/dashboard: 'true'
+spec:
+ annotations:
+   prometheus.io/port: '8080'
+   prometheus.io/path: '/metrics'
+ multiModel: false
+ supportedModelFormats:
+   - autoSelect: true
+     name: vLLM
+ containers:
+   - name: kserve-container
+     image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
+     command:
+       - python
+       - -m
+       - vllm.entrypoints.openai.api_server
+     args:
+       - "--port=8080"
+       - "--model=/mnt/models"
+       - "--served-model-name={{.Name}}"
+     env:
+       - name: HF_HOME
+         value: /tmp/hf_home
+     ports:
+       - containerPort: 8080
+         protocol: TCP
+```
+```python
+# Attach model to vllm server. This is an NVIDIA template
+# Save as: inferenceservice.yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    openshift.io/display-name: granite-3.1-8b-base-quantized.w4a16 # OPTIONAL CHANGE
+    serving.kserve.io/deploymentMode: RawDeployment
+  name: granite-3.1-8b-base-quantized.w4a16         # specify model name. This value will be used to invoke the model in the payload
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources:
+        limits:
+          cpu: '2'			# this is model specific
+          memory: 8Gi		# this is model specific
+          nvidia.com/gpu: '1'	# this is accelerator specific
+        requests:			# same comment for this block
+          cpu: '1'
+          memory: 4Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
+      storageUri: oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-base-quantized-w4a16:1.5
+    tolerations:
+    - effect: NoSchedule
+      key: nvidia.com/gpu
+      operator: Exists
+```
+```bash
+# make sure first to be in the project where you want to deploy the model
+# oc project <project-name>
+# apply both resources to run model
+# Apply the ServingRuntime
+oc apply -f vllm-servingruntime.yaml
+# Apply the InferenceService
+oc apply -f qwen-inferenceservice.yaml
+```
+```python
+# Replace <inference-service-name> and <cluster-ingress-domain> below:
+# - Run `oc get inferenceservice` to find your URL if unsure.
+# Call the server using curl:
+curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
+        -H "Content-Type: application/json" \
+        -d '{
+    "model": "granite-3.1-8b-base-quantized.w4a16",
+    "stream": true,
+    "stream_options": {
+        "include_usage": true
+    },
+    "max_tokens": 1,
+    "messages": [
+        {
+            "role": "user",
+            "content": "How can a bee fly when its wings are so small?"
+        }
+    ]
+}'
+```
+See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
+</details>
 ## Creation
 This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.