robgreenberg3 commited on
Commit
ca93d41
·
verified ·
1 Parent(s): 942e9fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -2
README.md CHANGED
@@ -11,7 +11,14 @@ base_model: ibm-granite/granite-3.1-8b-base
11
  library_name: transformers
12
  ---
13
 
14
- # granite-3.1-8b-base-quantized.w4a16
 
 
 
 
 
 
 
15
 
16
  ## Model Overview
17
  - **Model Architecture:** granite-3.1-8b-base
@@ -22,7 +29,7 @@ library_name: transformers
22
  - **Activation quantization:** BF16
23
  - **Release Date:** 1/8/2025
24
  - **Version:** 1.0
25
- - **Model Developers:** Neural Magic
26
 
27
  Quantized version of [ibm-granite/granite-3.1-8b-base](https://huggingface.co/ibm-granite/granite-3.1-8b-base).
28
  It achieves an average score of 69.81 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 70.30.
@@ -62,6 +69,158 @@ print(generated_text)
62
 
63
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ## Creation
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
 
11
  library_name: transformers
12
  ---
13
 
14
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
15
+ granite-3.1-8b-base-quantized.w4a16
16
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
17
+ </h1>
18
+
19
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
20
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
21
+ </a>
22
 
23
  ## Model Overview
24
  - **Model Architecture:** granite-3.1-8b-base
 
29
  - **Activation quantization:** BF16
30
  - **Release Date:** 1/8/2025
31
  - **Version:** 1.0
32
+ - **Model Developers:** Neural Magic (Red Hat)
33
 
34
  Quantized version of [ibm-granite/granite-3.1-8b-base](https://huggingface.co/ibm-granite/granite-3.1-8b-base).
35
  It achieves an average score of 69.81 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 70.30.
 
69
 
70
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
71
 
72
+ <details>
73
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
74
+
75
+ ```bash
76
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
77
+ --ipc=host \
78
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
79
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
80
+ --name=vllm \
81
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
82
+ vllm serve \
83
+ --tensor-parallel-size 8 \
84
+ --max-model-len 32768 \
85
+ --enforce-eager --model RedHatAI/granite-3.1-8b-base-quantized.w4a16
86
+ ```
87
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
88
+ </details>
89
+
90
+ <details>
91
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
92
+
93
+ ```bash
94
+ # Download model from Red Hat Registry via docker
95
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
96
+ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-base-quantized-w4a16:1.5
97
+ ```
98
+
99
+ ```bash
100
+ # Serve model via ilab
101
+ ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-base-quantized-w4a16
102
+
103
+ # Chat with model
104
+ ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-base-quantized-w4a16
105
+ ```
106
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
107
+ </details>
108
+
109
+ <details>
110
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
111
+
112
+ ```python
113
+ # Setting up vllm server with ServingRuntime
114
+ # Save as: vllm-servingruntime.yaml
115
+ apiVersion: serving.kserve.io/v1alpha1
116
+ kind: ServingRuntime
117
+ metadata:
118
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
119
+ annotations:
120
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
121
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
122
+ labels:
123
+ opendatahub.io/dashboard: 'true'
124
+ spec:
125
+ annotations:
126
+ prometheus.io/port: '8080'
127
+ prometheus.io/path: '/metrics'
128
+ multiModel: false
129
+ supportedModelFormats:
130
+ - autoSelect: true
131
+ name: vLLM
132
+ containers:
133
+ - name: kserve-container
134
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
135
+ command:
136
+ - python
137
+ - -m
138
+ - vllm.entrypoints.openai.api_server
139
+ args:
140
+ - "--port=8080"
141
+ - "--model=/mnt/models"
142
+ - "--served-model-name={{.Name}}"
143
+ env:
144
+ - name: HF_HOME
145
+ value: /tmp/hf_home
146
+ ports:
147
+ - containerPort: 8080
148
+ protocol: TCP
149
+ ```
150
+
151
+ ```python
152
+ # Attach model to vllm server. This is an NVIDIA template
153
+ # Save as: inferenceservice.yaml
154
+ apiVersion: serving.kserve.io/v1beta1
155
+ kind: InferenceService
156
+ metadata:
157
+ annotations:
158
+ openshift.io/display-name: granite-3.1-8b-base-quantized.w4a16 # OPTIONAL CHANGE
159
+ serving.kserve.io/deploymentMode: RawDeployment
160
+ name: granite-3.1-8b-base-quantized.w4a16 # specify model name. This value will be used to invoke the model in the payload
161
+ labels:
162
+ opendatahub.io/dashboard: 'true'
163
+ spec:
164
+ predictor:
165
+ maxReplicas: 1
166
+ minReplicas: 1
167
+ model:
168
+ modelFormat:
169
+ name: vLLM
170
+ name: ''
171
+ resources:
172
+ limits:
173
+ cpu: '2' # this is model specific
174
+ memory: 8Gi # this is model specific
175
+ nvidia.com/gpu: '1' # this is accelerator specific
176
+ requests: # same comment for this block
177
+ cpu: '1'
178
+ memory: 4Gi
179
+ nvidia.com/gpu: '1'
180
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
181
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-base-quantized-w4a16:1.5
182
+ tolerations:
183
+ - effect: NoSchedule
184
+ key: nvidia.com/gpu
185
+ operator: Exists
186
+ ```
187
+
188
+ ```bash
189
+ # make sure first to be in the project where you want to deploy the model
190
+ # oc project <project-name>
191
+ # apply both resources to run model
192
+ # Apply the ServingRuntime
193
+ oc apply -f vllm-servingruntime.yaml
194
+ # Apply the InferenceService
195
+ oc apply -f qwen-inferenceservice.yaml
196
+ ```
197
+
198
+ ```python
199
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
200
+ # - Run `oc get inferenceservice` to find your URL if unsure.
201
+ # Call the server using curl:
202
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
203
+ -H "Content-Type: application/json" \
204
+ -d '{
205
+ "model": "granite-3.1-8b-base-quantized.w4a16",
206
+ "stream": true,
207
+ "stream_options": {
208
+ "include_usage": true
209
+ },
210
+ "max_tokens": 1,
211
+ "messages": [
212
+ {
213
+ "role": "user",
214
+ "content": "How can a bee fly when its wings are so small?"
215
+ }
216
+ ]
217
+ }'
218
+ ```
219
+
220
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
221
+ </details>
222
+
223
+
224
  ## Creation
225
 
226
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.