Files changed (1) hide show
  1. README.md +160 -1
README.md CHANGED
@@ -19,7 +19,14 @@ tags:
19
  - int4
20
  ---
21
 
22
- # phi-4-quantized.w4a16
 
 
 
 
 
 
 
23
 
24
  ## Model Overview
25
  - **Model Architecture:** Phi3ForCausalLM
@@ -81,6 +88,158 @@ print(generated_text)
81
 
82
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ## Creation
85
 
86
  <details>
 
19
  - int4
20
  ---
21
 
22
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
23
+ phi-4-quantized.w4a16
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
25
+ </h1>
26
+
27
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
28
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
29
+ </a>
30
 
31
  ## Model Overview
32
  - **Model Architecture:** Phi3ForCausalLM
 
88
 
89
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
90
 
91
+ <details>
92
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
93
+
94
+ ```bash
95
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
96
+ --ipc=host \
97
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
98
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
99
+ --name=vllm \
100
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
101
+ vllm serve \
102
+ --tensor-parallel-size 8 \
103
+ --max-model-len 32768 \
104
+ --enforce-eager --model RedHatAI/phi-4-quantized.w4a16
105
+ ```
106
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
107
+ </details>
108
+
109
+ <details>
110
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
111
+
112
+ ```bash
113
+ # Download model from Red Hat Registry via docker
114
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
115
+ ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4-quantized-w4a16:1.5
116
+ ```
117
+
118
+ ```bash
119
+ # Serve model via ilab
120
+ ilab model serve --model-path ~/.cache/instructlab/models/phi-4-quantized-w4a16
121
+
122
+ # Chat with model
123
+ ilab model chat --model ~/.cache/instructlab/models/phi-4-quantized-w4a16
124
+ ```
125
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
126
+ </details>
127
+
128
+ <details>
129
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
130
+
131
+ ```python
132
+ # Setting up vllm server with ServingRuntime
133
+ # Save as: vllm-servingruntime.yaml
134
+ apiVersion: serving.kserve.io/v1alpha1
135
+ kind: ServingRuntime
136
+ metadata:
137
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
138
+ annotations:
139
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
140
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
141
+ labels:
142
+ opendatahub.io/dashboard: 'true'
143
+ spec:
144
+ annotations:
145
+ prometheus.io/port: '8080'
146
+ prometheus.io/path: '/metrics'
147
+ multiModel: false
148
+ supportedModelFormats:
149
+ - autoSelect: true
150
+ name: vLLM
151
+ containers:
152
+ - name: kserve-container
153
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
154
+ command:
155
+ - python
156
+ - -m
157
+ - vllm.entrypoints.openai.api_server
158
+ args:
159
+ - "--port=8080"
160
+ - "--model=/mnt/models"
161
+ - "--served-model-name={{.Name}}"
162
+ env:
163
+ - name: HF_HOME
164
+ value: /tmp/hf_home
165
+ ports:
166
+ - containerPort: 8080
167
+ protocol: TCP
168
+ ```
169
+
170
+ ```python
171
+ # Attach model to vllm server. This is an NVIDIA template
172
+ # Save as: inferenceservice.yaml
173
+ apiVersion: serving.kserve.io/v1beta1
174
+ kind: InferenceService
175
+ metadata:
176
+ annotations:
177
+ openshift.io/display-name: phi-4-quantized.w4a16 # OPTIONAL CHANGE
178
+ serving.kserve.io/deploymentMode: RawDeployment
179
+ name: phi-4-quantized.w4a16 # specify model name. This value will be used to invoke the model in the payload
180
+ labels:
181
+ opendatahub.io/dashboard: 'true'
182
+ spec:
183
+ predictor:
184
+ maxReplicas: 1
185
+ minReplicas: 1
186
+ model:
187
+ modelFormat:
188
+ name: vLLM
189
+ name: ''
190
+ resources:
191
+ limits:
192
+ cpu: '2' # this is model specific
193
+ memory: 8Gi # this is model specific
194
+ nvidia.com/gpu: '1' # this is accelerator specific
195
+ requests: # same comment for this block
196
+ cpu: '1'
197
+ memory: 4Gi
198
+ nvidia.com/gpu: '1'
199
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
200
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4-quantized-w4a16:1.5
201
+ tolerations:
202
+ - effect: NoSchedule
203
+ key: nvidia.com/gpu
204
+ operator: Exists
205
+ ```
206
+
207
+ ```bash
208
+ # make sure first to be in the project where you want to deploy the model
209
+ # oc project <project-name>
210
+ # apply both resources to run model
211
+ # Apply the ServingRuntime
212
+ oc apply -f vllm-servingruntime.yaml
213
+ # Apply the InferenceService
214
+ oc apply -f qwen-inferenceservice.yaml
215
+ ```
216
+
217
+ ```python
218
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
219
+ # - Run `oc get inferenceservice` to find your URL if unsure.
220
+ # Call the server using curl:
221
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
222
+ -H "Content-Type: application/json" \
223
+ -d '{
224
+ "model": "phi-4-quantized.w4a16",
225
+ "stream": true,
226
+ "stream_options": {
227
+ "include_usage": true
228
+ },
229
+ "max_tokens": 1,
230
+ "messages": [
231
+ {
232
+ "role": "user",
233
+ "content": "How can a bee fly when its wings are so small?"
234
+ }
235
+ ]
236
+ }'
237
+ ```
238
+
239
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
240
+ </details>
241
+
242
+
243
  ## Creation
244
 
245
  <details>