robgreenberg3 commited on
Commit
ba32fbc
·
verified ·
1 Parent(s): bb43537

Update README.md (#2)

Browse files

- Update README.md (18f083bc013d2692c0200373d920154bf39c36b0)

Files changed (1) hide show
  1. README.md +160 -1
README.md CHANGED
@@ -19,7 +19,14 @@ tags:
19
  - fp8
20
  ---
21
 
22
- # phi-4-FP8-dynamic
 
 
 
 
 
 
 
23
 
24
  ## Model Overview
25
  - **Model Architecture:** Phi3ForCausalLM
@@ -82,6 +89,158 @@ print(generated_text)
82
 
83
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ## Creation
86
 
87
  <details>
 
19
  - fp8
20
  ---
21
 
22
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
23
+ phi-4-FP8-dynamic
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
25
+ </h1>
26
+
27
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
28
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
29
+ </a>
30
 
31
  ## Model Overview
32
  - **Model Architecture:** Phi3ForCausalLM
 
89
 
90
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
91
 
92
+ <details>
93
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
94
+
95
+ ```bash
96
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
97
+ --ipc=host \
98
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
99
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
100
+ --name=vllm \
101
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
102
+ vllm serve \
103
+ --tensor-parallel-size 8 \
104
+ --max-model-len 32768 \
105
+ --enforce-eager --model RedHatAI/phi-4-FP8-dynamic
106
+ ```
107
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
108
+ </details>
109
+
110
+ <details>
111
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
112
+
113
+ ```bash
114
+ # Download model from Red Hat Registry via docker
115
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
116
+ ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4-fp8-dynamic:1.5
117
+ ```
118
+
119
+ ```bash
120
+ # Serve model via ilab
121
+ ilab model serve --model-path ~/.cache/instructlab/models/phi-4-fp8-dynamic
122
+
123
+ # Chat with model
124
+ ilab model chat --model ~/.cache/instructlab/models/phi-4-fp8-dynamic
125
+ ```
126
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
127
+ </details>
128
+
129
+ <details>
130
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
131
+
132
+ ```python
133
+ # Setting up vllm server with ServingRuntime
134
+ # Save as: vllm-servingruntime.yaml
135
+ apiVersion: serving.kserve.io/v1alpha1
136
+ kind: ServingRuntime
137
+ metadata:
138
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
139
+ annotations:
140
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
141
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
142
+ labels:
143
+ opendatahub.io/dashboard: 'true'
144
+ spec:
145
+ annotations:
146
+ prometheus.io/port: '8080'
147
+ prometheus.io/path: '/metrics'
148
+ multiModel: false
149
+ supportedModelFormats:
150
+ - autoSelect: true
151
+ name: vLLM
152
+ containers:
153
+ - name: kserve-container
154
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
155
+ command:
156
+ - python
157
+ - -m
158
+ - vllm.entrypoints.openai.api_server
159
+ args:
160
+ - "--port=8080"
161
+ - "--model=/mnt/models"
162
+ - "--served-model-name={{.Name}}"
163
+ env:
164
+ - name: HF_HOME
165
+ value: /tmp/hf_home
166
+ ports:
167
+ - containerPort: 8080
168
+ protocol: TCP
169
+ ```
170
+
171
+ ```python
172
+ # Attach model to vllm server. This is an NVIDIA template
173
+ # Save as: inferenceservice.yaml
174
+ apiVersion: serving.kserve.io/v1beta1
175
+ kind: InferenceService
176
+ metadata:
177
+ annotations:
178
+ openshift.io/display-name: phi-4-FP8-dynamic # OPTIONAL CHANGE
179
+ serving.kserve.io/deploymentMode: RawDeployment
180
+ name: phi-4-FP8-dynamic # specify model name. This value will be used to invoke the model in the payload
181
+ labels:
182
+ opendatahub.io/dashboard: 'true'
183
+ spec:
184
+ predictor:
185
+ maxReplicas: 1
186
+ minReplicas: 1
187
+ model:
188
+ modelFormat:
189
+ name: vLLM
190
+ name: ''
191
+ resources:
192
+ limits:
193
+ cpu: '2' # this is model specific
194
+ memory: 8Gi # this is model specific
195
+ nvidia.com/gpu: '1' # this is accelerator specific
196
+ requests: # same comment for this block
197
+ cpu: '1'
198
+ memory: 4Gi
199
+ nvidia.com/gpu: '1'
200
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
201
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4-fp8-dynamic:1.5
202
+ tolerations:
203
+ - effect: NoSchedule
204
+ key: nvidia.com/gpu
205
+ operator: Exists
206
+ ```
207
+
208
+ ```bash
209
+ # make sure first to be in the project where you want to deploy the model
210
+ # oc project <project-name>
211
+ # apply both resources to run model
212
+ # Apply the ServingRuntime
213
+ oc apply -f vllm-servingruntime.yaml
214
+ # Apply the InferenceService
215
+ oc apply -f qwen-inferenceservice.yaml
216
+ ```
217
+
218
+ ```python
219
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
220
+ # - Run `oc get inferenceservice` to find your URL if unsure.
221
+ # Call the server using curl:
222
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
223
+ -H "Content-Type: application/json" \
224
+ -d '{
225
+ "model": "phi-4-FP8-dynamic",
226
+ "stream": true,
227
+ "stream_options": {
228
+ "include_usage": true
229
+ },
230
+ "max_tokens": 1,
231
+ "messages": [
232
+ {
233
+ "role": "user",
234
+ "content": "How can a bee fly when its wings are so small?"
235
+ }
236
+ ]
237
+ }'
238
+ ```
239
+
240
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
241
+ </details>
242
+
243
+
244
  ## Creation
245
 
246
  <details>