robgreenberg3 jennyyyi commited on
Commit
e42433e
·
verified ·
1 Parent(s): d4e25af

Update README.md for model validation (#1)

Browse files

- Update README.md for model validation (c00ffc938e8db0769f3590235fd40b49bd8cf339)
- Update README.md (f64129c1d85f76c7f9eaa6dc6aacb8a7f9cd802a)
- Update README.md (d289f5581a4b267d01115cf8facc6f6338efb4dd)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +197 -0
README.md CHANGED
@@ -25,6 +25,15 @@ license: other
25
  license_name: llama4
26
  ---
27
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Model Information
30
  **Built with Llama**
@@ -96,6 +105,194 @@ These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We
96
 
97
  2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## How to use with transformers
100
 
101
  Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.
 
25
  license_name: llama4
26
  ---
27
 
28
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
29
+ Llama-4-Scout-17B-16E-Instruct
30
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
31
+ </h1>
32
+
33
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
34
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
35
+ </a>
36
+
37
 
38
  ## Model Information
39
  **Built with Llama**
 
105
 
106
  2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
107
 
108
+ ## Deployment
109
+
110
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
111
+
112
+ Deploy on <strong>vLLM</strong>
113
+
114
+ ```python
115
+ from vllm import LLM, SamplingParams
116
+ from transformers import AutoTokenizer
117
+
118
+ model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct"
119
+ number_gpus = 4
120
+
121
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
122
+
123
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
124
+
125
+ prompt = "Give me a short introduction to large language model."
126
+
127
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
128
+
129
+ outputs = llm.generate(prompt, sampling_params)
130
+
131
+ generated_text = outputs[0].outputs[0].text
132
+ print(generated_text)
133
+ ```
134
+
135
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
136
+
137
+ <details>
138
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
139
+
140
+ ```bash
141
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
142
+ --ipc=host \
143
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
144
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
145
+ --name=vllm \
146
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
147
+ vllm serve \
148
+ --tensor-parallel-size 8 \
149
+ --max-model-len 32768 \
150
+ --enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct
151
+ ```
152
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
153
+ </details>
154
+
155
+ <details>
156
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
157
+
158
+ ```bash
159
+ # Download model from Red Hat Registry via docker
160
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
161
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct:1.5
162
+ # (Ex: docker://registry.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5)
163
+ ```
164
+
165
+ ```bash
166
+ # Serve model via ilab
167
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct
168
+
169
+ # Chat with model
170
+ ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct
171
+ ```
172
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
173
+ </details>
174
+
175
+ <details>
176
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
177
+
178
+ ```python
179
+ # Setting up vllm server with ServingRuntime
180
+ # Save as: vllm-servingruntime.yaml
181
+ apiVersion: serving.kserve.io/v1alpha1
182
+ kind: ServingRuntime
183
+ metadata:
184
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
185
+ annotations:
186
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
187
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
188
+ labels:
189
+ opendatahub.io/dashboard: 'true'
190
+ spec:
191
+ annotations:
192
+ prometheus.io/port: '8080'
193
+ prometheus.io/path: '/metrics'
194
+ multiModel: false
195
+ supportedModelFormats:
196
+ - autoSelect: true
197
+ name: vLLM
198
+ containers:
199
+ - name: kserve-container
200
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
201
+ command:
202
+ - python
203
+ - -m
204
+ - vllm.entrypoints.openai.api_server
205
+ args:
206
+ - "--port=8080"
207
+ - "--model=/mnt/models"
208
+ - "--served-model-name={{.Name}}"
209
+ env:
210
+ - name: HF_HOME
211
+ value: /tmp/hf_home
212
+ ports:
213
+ - containerPort: 8080
214
+ protocol: TCP
215
+ ```
216
+
217
+ ```python
218
+ # Attach model to vllm server. This is an NVIDIA template
219
+ # Save as: inferenceservice.yaml
220
+ apiVersion: serving.kserve.io/v1beta1
221
+ kind: InferenceService
222
+ metadata:
223
+ annotations:
224
+ openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct # OPTIONAL CHANGE
225
+ serving.kserve.io/deploymentMode: RawDeployment
226
+ name: Llama-4-Scout-17B-16E-Instruct # specify model name. This value will be used to invoke the model in the payload
227
+ labels:
228
+ opendatahub.io/dashboard: 'true'
229
+ spec:
230
+ predictor:
231
+ maxReplicas: 1
232
+ minReplicas: 1
233
+ model:
234
+ modelFormat:
235
+ name: vLLM
236
+ name: ''
237
+ resources:
238
+ limits:
239
+ cpu: '2' # this is model specific
240
+ memory: 8Gi # this is model specific
241
+ nvidia.com/gpu: '1' # this is accelerator specific
242
+ requests: # same comment for this block
243
+ cpu: '1'
244
+ memory: 4Gi
245
+ nvidia.com/gpu: '1'
246
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
247
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct:1.5
248
+ tolerations:
249
+ - effect: NoSchedule
250
+ key: nvidia.com/gpu
251
+ operator: Exists
252
+ ```
253
+
254
+ ```bash
255
+ # make sure first to be in the project where you want to deploy the model
256
+ # oc project <project-name>
257
+
258
+ # apply both resources to run model
259
+
260
+ # Apply the ServingRuntime
261
+ oc apply -f vllm-servingruntime.yaml
262
+
263
+ # Apply the InferenceService
264
+ oc apply -f qwen-inferenceservice.yaml
265
+ ```
266
+
267
+ ```python
268
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
269
+ # - Run `oc get inferenceservice` to find your URL if unsure.
270
+
271
+ # Call the server using curl:
272
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
273
+ -H "Content-Type: application/json" \
274
+ -d '{
275
+ "model": "Llama-4-Scout-17B-16E-Instruct",
276
+ "stream": true,
277
+ "stream_options": {
278
+ "include_usage": true
279
+ },
280
+ "max_tokens": 1,
281
+ "messages": [
282
+ {
283
+ "role": "user",
284
+ "content": "How can a bee fly when its wings are so small?"
285
+ }
286
+ ]
287
+ }'
288
+
289
+ ```
290
+
291
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
292
+ </details>
293
+
294
+
295
+
296
  ## How to use with transformers
297
 
298
  Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.