Files changed (1) hide show
  1. README.md +159 -1
README.md CHANGED
@@ -12,7 +12,14 @@ tags:
12
  - int8
13
  ---
14
 
15
- # Qwen2.5-7B-Instruct-quantized.w4a16
 
 
 
 
 
 
 
16
 
17
  ## Model Overview
18
  - **Model Architecture:** Qwen2
@@ -68,6 +75,157 @@ print(generated_text)
68
 
69
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Creation
72
 
73
  <details>
 
12
  - int8
13
  ---
14
 
15
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
16
+ Qwen2.5-7B-Instruct-quantized.w4a16
17
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
18
+ </h1>
19
+
20
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
21
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
22
+ </a>
23
 
24
  ## Model Overview
25
  - **Model Architecture:** Qwen2
 
75
 
76
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
77
 
78
+ <details>
79
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
80
+
81
+ ```bash
82
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
83
+ --ipc=host \
84
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
85
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
86
+ --name=vllm \
87
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
88
+ vllm serve \
89
+ --tensor-parallel-size 8 \
90
+ --max-model-len 32768 \
91
+ --enforce-eager --model RedHatAI/Qwen2.5-7B-Instruct-quantized.w4a16
92
+ ```
93
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
94
+ </details>
95
+
96
+ <details>
97
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
98
+
99
+ ```bash
100
+ # Download model from Red Hat Registry via docker
101
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
102
+ ilab model download --repository docker://registry.redhat.io/rhelai1/qwen2-5-7b-instruct-quantized-w4a16:1.5
103
+ ```
104
+
105
+ ```bash
106
+ # Serve model via ilab
107
+ ilab model serve --model-path ~/.cache/instructlab/models/qwen2-5-7b-instruct-quantized-w4a16
108
+
109
+ # Chat with model
110
+ ilab model chat --model ~/.cache/instructlab/models/qwen2-5-7b-instruct-quantized-w4a16
111
+ ```
112
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
113
+ </details>
114
+
115
+ <details>
116
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
117
+
118
+ ```python
119
+ # Setting up vllm server with ServingRuntime
120
+ # Save as: vllm-servingruntime.yaml
121
+ apiVersion: serving.kserve.io/v1alpha1
122
+ kind: ServingRuntime
123
+ metadata:
124
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
125
+ annotations:
126
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
127
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
128
+ labels:
129
+ opendatahub.io/dashboard: 'true'
130
+ spec:
131
+ annotations:
132
+ prometheus.io/port: '8080'
133
+ prometheus.io/path: '/metrics'
134
+ multiModel: false
135
+ supportedModelFormats:
136
+ - autoSelect: true
137
+ name: vLLM
138
+ containers:
139
+ - name: kserve-container
140
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
141
+ command:
142
+ - python
143
+ - -m
144
+ - vllm.entrypoints.openai.api_server
145
+ args:
146
+ - "--port=8080"
147
+ - "--model=/mnt/models"
148
+ - "--served-model-name={{.Name}}"
149
+ env:
150
+ - name: HF_HOME
151
+ value: /tmp/hf_home
152
+ ports:
153
+ - containerPort: 8080
154
+ protocol: TCP
155
+ ```
156
+
157
+ ```python
158
+ # Attach model to vllm server. This is an NVIDIA template
159
+ # Save as: inferenceservice.yaml
160
+ apiVersion: serving.kserve.io/v1beta1
161
+ kind: InferenceService
162
+ metadata:
163
+ annotations:
164
+ openshift.io/display-name: Qwen2.5-7B-Instruct-quantized.w4a16 # OPTIONAL CHANGE
165
+ serving.kserve.io/deploymentMode: RawDeployment
166
+ name: Qwen2.5-7B-Instruct-quantized.w4a16 # specify model name. This value will be used to invoke the model in the payload
167
+ labels:
168
+ opendatahub.io/dashboard: 'true'
169
+ spec:
170
+ predictor:
171
+ maxReplicas: 1
172
+ minReplicas: 1
173
+ model:
174
+ modelFormat:
175
+ name: vLLM
176
+ name: ''
177
+ resources:
178
+ limits:
179
+ cpu: '2' # this is model specific
180
+ memory: 8Gi # this is model specific
181
+ nvidia.com/gpu: '1' # this is accelerator specific
182
+ requests: # same comment for this block
183
+ cpu: '1'
184
+ memory: 4Gi
185
+ nvidia.com/gpu: '1'
186
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
187
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct-quantized-w4a16:1.5
188
+ tolerations:
189
+ - effect: NoSchedule
190
+ key: nvidia.com/gpu
191
+ operator: Exists
192
+ ```
193
+
194
+ ```bash
195
+ # make sure first to be in the project where you want to deploy the model
196
+ # oc project <project-name>
197
+ # apply both resources to run model
198
+ # Apply the ServingRuntime
199
+ oc apply -f vllm-servingruntime.yaml
200
+ # Apply the InferenceService
201
+ oc apply -f qwen-inferenceservice.yaml
202
+ ```
203
+
204
+ ```python
205
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
206
+ # - Run `oc get inferenceservice` to find your URL if unsure.
207
+ # Call the server using curl:
208
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
209
+ -H "Content-Type: application/json" \
210
+ -d '{
211
+ "model": "Qwen2.5-7B-Instruct-quantized.w4a16",
212
+ "stream": true,
213
+ "stream_options": {
214
+ "include_usage": true
215
+ },
216
+ "max_tokens": 1,
217
+ "messages": [
218
+ {
219
+ "role": "user",
220
+ "content": "How can a bee fly when its wings are so small?"
221
+ }
222
+ ]
223
+ }'
224
+ ```
225
+
226
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
227
+ </details>
228
+
229
  ## Creation
230
 
231
  <details>