jennyyyi commited on
Commit
2c6e720
·
verified ·
1 Parent(s): 819d06a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -2
README.md CHANGED
@@ -10,8 +10,14 @@ language:
10
  base_model: ibm-granite/granite-3.1-8b-instruct
11
  library_name: transformers
12
  ---
13
-
14
- # granite-3.1-8b-instruct-quantized.w4a16
 
 
 
 
 
 
15
 
16
  ## Model Overview
17
  - **Model Architecture:** granite-3.1-8b-instruct
@@ -62,6 +68,165 @@ print(generated_text)
62
 
63
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ## Creation
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
 
10
  base_model: ibm-granite/granite-3.1-8b-instruct
11
  library_name: transformers
12
  ---
13
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
14
+ granite-3.1-8b-instruct-quantized.w4a16
15
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
16
+ </h1>
17
+
18
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
19
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
20
+ </a>
21
 
22
  ## Model Overview
23
  - **Model Architecture:** granite-3.1-8b-instruct
 
68
 
69
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
70
 
71
+ <details>
72
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
73
+
74
+ ```bash
75
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
76
+ --ipc=host \
77
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
78
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
79
+ --name=vllm \
80
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
81
+ vllm serve \
82
+ --tensor-parallel-size 8 \
83
+ --max-model-len 32768 \
84
+ --enforce-eager --model RedHatAI/granite-3.1-8b-instruct-quantized.w4a16
85
+ ```
86
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
87
+ </details>
88
+
89
+ <details>
90
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
91
+
92
+ ```bash
93
+ # Download model from Red Hat Registry via docker
94
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
95
+ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w4a16:1.5
96
+ ```
97
+
98
+ ```bash
99
+ # Serve model via ilab
100
+ ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-instruct-quantized-w4a16 -- --trust-remote-code
101
+
102
+ # Chat with model
103
+ ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-instruct-quantized-w4a16
104
+ ```
105
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
106
+ </details>
107
+
108
+ <details>
109
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
110
+
111
+ ```python
112
+ # Setting up vllm server with ServingRuntime
113
+ # Save as: vllm-servingruntime.yaml
114
+ apiVersion: serving.kserve.io/v1alpha1
115
+ kind: ServingRuntime
116
+ metadata:
117
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
118
+ annotations:
119
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
120
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
121
+ labels:
122
+ opendatahub.io/dashboard: 'true'
123
+ spec:
124
+ annotations:
125
+ prometheus.io/port: '8080'
126
+ prometheus.io/path: '/metrics'
127
+ multiModel: false
128
+ supportedModelFormats:
129
+ - autoSelect: true
130
+ name: vLLM
131
+ containers:
132
+ - name: kserve-container
133
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
134
+ command:
135
+ - python
136
+ - -m
137
+ - vllm.entrypoints.openai.api_server
138
+ args:
139
+ - "--port=8080"
140
+ - "--model=/mnt/models"
141
+ - "--served-model-name={{.Name}}"
142
+ env:
143
+ - name: HF_HOME
144
+ value: /tmp/hf_home
145
+ ports:
146
+ - containerPort: 8080
147
+ protocol: TCP
148
+ ```
149
+
150
+ ```python
151
+ # Attach model to vllm server. This is an NVIDIA template
152
+ # Save as: inferenceservice.yaml
153
+ apiVersion: serving.kserve.io/v1beta1
154
+ kind: InferenceService
155
+ metadata:
156
+ annotations:
157
+ openshift.io/display-name: granite-3-1-8b-instruct-quantized-w4a16 # OPTIONAL CHANGE
158
+ serving.kserve.io/deploymentMode: RawDeployment
159
+ name: granite-3-1-8b-instruct-quantized-w4a16 # specify model name. This value will be used to invoke the model in the payload
160
+ labels:
161
+ opendatahub.io/dashboard: 'true'
162
+ spec:
163
+ predictor:
164
+ maxReplicas: 1
165
+ minReplicas: 1
166
+ model:
167
+ args:
168
+ - '--trust-remote-code'
169
+ modelFormat:
170
+ name: vLLM
171
+ name: ''
172
+ resources:
173
+ limits:
174
+ cpu: '2' # this is model specific
175
+ memory: 8Gi # this is model specific
176
+ nvidia.com/gpu: '1' # this is accelerator specific
177
+ requests: # same comment for this block
178
+ cpu: '1'
179
+ memory: 4Gi
180
+ nvidia.com/gpu: '1'
181
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
182
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct-quantized-w4a16:1.5
183
+ tolerations:
184
+ - effect: NoSchedule
185
+ key: nvidia.com/gpu
186
+ operator: Exists
187
+ ```
188
+
189
+ ```bash
190
+ # make sure first to be in the project where you want to deploy the model
191
+ # oc project <project-name>
192
+
193
+ # apply both resources to run model
194
+
195
+ # Apply the ServingRuntime
196
+ oc apply -f vllm-servingruntime.yaml
197
+
198
+ # Apply the InferenceService
199
+ oc apply -f qwen-inferenceservice.yaml
200
+ ```
201
+
202
+ ```python
203
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
204
+ # - Run `oc get inferenceservice` to find your URL if unsure.
205
+
206
+ # Call the server using curl:
207
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
208
+ -H "Content-Type: application/json" \
209
+ -d '{
210
+ "model": "granite-3-1-8b-instruct-quantized-w4a16",
211
+ "stream": true,
212
+ "stream_options": {
213
+ "include_usage": true
214
+ },
215
+ "max_tokens": 1,
216
+ "messages": [
217
+ {
218
+ "role": "user",
219
+ "content": "How can a bee fly when its wings are so small?"
220
+ }
221
+ ]
222
+ }'
223
+
224
+ ```
225
+
226
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
227
+ </details>
228
+
229
+
230
  ## Creation
231
 
232
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.