Files changed (1) hide show
  1. README.md +159 -1
README.md CHANGED
@@ -12,7 +12,14 @@ tags:
12
  - int8
13
  ---
14
 
15
- # Qwen2.5-7B-Instruct-quantized.w8a8
 
 
 
 
 
 
 
16
 
17
  ## Model Overview
18
  - **Model Architecture:** Qwen2
@@ -70,6 +77,157 @@ print(generated_text)
70
 
71
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Creation
74
 
75
  <details>
 
12
  - int8
13
  ---
14
 
15
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
16
+ Qwen2.5-7B-Instruct-quantized.w8a8
17
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
18
+ </h1>
19
+
20
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
21
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
22
+ </a>
23
 
24
  ## Model Overview
25
  - **Model Architecture:** Qwen2
 
77
 
78
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
79
 
80
+ <details>
81
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
82
+
83
+ ```bash
84
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
85
+ --ipc=host \
86
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
87
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
88
+ --name=vllm \
89
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
90
+ vllm serve \
91
+ --tensor-parallel-size 8 \
92
+ --max-model-len 32768 \
93
+ --enforce-eager --model RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8
94
+ ```
95
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
96
+ </details>
97
+
98
+ <details>
99
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
100
+
101
+ ```bash
102
+ # Download model from Red Hat Registry via docker
103
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
104
+ ilab model download --repository docker://registry.redhat.io/rhelai1/qwen2-5-7b-instruct-quantized-w8a8:1.5
105
+ ```
106
+
107
+ ```bash
108
+ # Serve model via ilab
109
+ ilab model serve --model-path ~/.cache/instructlab/models/qwen2-5-7b-instruct-quantized-w8a8
110
+
111
+ # Chat with model
112
+ ilab model chat --model ~/.cache/instructlab/models/qwen2-5-7b-instruct-quantized-w8a8
113
+ ```
114
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
115
+ </details>
116
+
117
+ <details>
118
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
119
+
120
+ ```python
121
+ # Setting up vllm server with ServingRuntime
122
+ # Save as: vllm-servingruntime.yaml
123
+ apiVersion: serving.kserve.io/v1alpha1
124
+ kind: ServingRuntime
125
+ metadata:
126
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
127
+ annotations:
128
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
129
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
130
+ labels:
131
+ opendatahub.io/dashboard: 'true'
132
+ spec:
133
+ annotations:
134
+ prometheus.io/port: '8080'
135
+ prometheus.io/path: '/metrics'
136
+ multiModel: false
137
+ supportedModelFormats:
138
+ - autoSelect: true
139
+ name: vLLM
140
+ containers:
141
+ - name: kserve-container
142
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
143
+ command:
144
+ - python
145
+ - -m
146
+ - vllm.entrypoints.openai.api_server
147
+ args:
148
+ - "--port=8080"
149
+ - "--model=/mnt/models"
150
+ - "--served-model-name={{.Name}}"
151
+ env:
152
+ - name: HF_HOME
153
+ value: /tmp/hf_home
154
+ ports:
155
+ - containerPort: 8080
156
+ protocol: TCP
157
+ ```
158
+
159
+ ```python
160
+ # Attach model to vllm server. This is an NVIDIA template
161
+ # Save as: inferenceservice.yaml
162
+ apiVersion: serving.kserve.io/v1beta1
163
+ kind: InferenceService
164
+ metadata:
165
+ annotations:
166
+ openshift.io/display-name: Qwen2.5-7B-Instruct-quantized.w8a8 # OPTIONAL CHANGE
167
+ serving.kserve.io/deploymentMode: RawDeployment
168
+ name: Qwen2.5-7B-Instruct-quantized.w8a8 # specify model name. This value will be used to invoke the model in the payload
169
+ labels:
170
+ opendatahub.io/dashboard: 'true'
171
+ spec:
172
+ predictor:
173
+ maxReplicas: 1
174
+ minReplicas: 1
175
+ model:
176
+ modelFormat:
177
+ name: vLLM
178
+ name: ''
179
+ resources:
180
+ limits:
181
+ cpu: '2' # this is model specific
182
+ memory: 8Gi # this is model specific
183
+ nvidia.com/gpu: '1' # this is accelerator specific
184
+ requests: # same comment for this block
185
+ cpu: '1'
186
+ memory: 4Gi
187
+ nvidia.com/gpu: '1'
188
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
189
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct-quantized-w8a8:1.5
190
+ tolerations:
191
+ - effect: NoSchedule
192
+ key: nvidia.com/gpu
193
+ operator: Exists
194
+ ```
195
+
196
+ ```bash
197
+ # make sure first to be in the project where you want to deploy the model
198
+ # oc project <project-name>
199
+ # apply both resources to run model
200
+ # Apply the ServingRuntime
201
+ oc apply -f vllm-servingruntime.yaml
202
+ # Apply the InferenceService
203
+ oc apply -f qwen-inferenceservice.yaml
204
+ ```
205
+
206
+ ```python
207
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
208
+ # - Run `oc get inferenceservice` to find your URL if unsure.
209
+ # Call the server using curl:
210
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
211
+ -H "Content-Type: application/json" \
212
+ -d '{
213
+ "model": "Qwen2.5-7B-Instruct-quantized.w8a8",
214
+ "stream": true,
215
+ "stream_options": {
216
+ "include_usage": true
217
+ },
218
+ "max_tokens": 1,
219
+ "messages": [
220
+ {
221
+ "role": "user",
222
+ "content": "How can a bee fly when its wings are so small?"
223
+ }
224
+ ]
225
+ }'
226
+ ```
227
+
228
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
229
+ </details>
230
+
231
  ## Creation
232
 
233
  <details>