robgreenberg3 jennyyyi commited on
Commit
fc14ce8
·
verified ·
1 Parent(s): f34f4a4

Update README.md (#1)

Browse files

- Update README.md (40769ff93eddd12d117701bb7bb5c82a1804fefc)
- Update README.md (5b6430c21712f6e727dacd1d4519b1dfe7af4bba)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +196 -0
README.md CHANGED
@@ -25,6 +25,15 @@ license: other
25
  license_name: llama4
26
  ---
27
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Model Information
30
  **Built with Llama**
@@ -97,6 +106,193 @@ These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We
97
 
98
  2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ## How to use with transformers
101
 
102
  Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.
 
25
  license_name: llama4
26
  ---
27
 
28
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
29
+ Llama-4-Maverick-17B-128E-Instruct
30
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
31
+ </h1>
32
+
33
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
34
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
35
+ </a>
36
+
37
 
38
  ## Model Information
39
  **Built with Llama**
 
106
 
107
  2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
108
 
109
+ ## Deployment
110
+
111
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
112
+
113
+ Deploy on <strong>vLLM</strong>
114
+
115
+ ```python
116
+ from vllm import LLM, SamplingParams
117
+
118
+ from transformers import AutoTokenizer
119
+
120
+ model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct"
121
+ number_gpus = 4
122
+
123
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
124
+
125
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
126
+
127
+ prompt = "Give me a short introduction to large language model."
128
+
129
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
130
+
131
+ outputs = llm.generate(prompt, sampling_params)
132
+
133
+ generated_text = outputs[0].outputs[0].text
134
+ print(generated_text)
135
+ ```
136
+
137
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
138
+
139
+ <details>
140
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
141
+
142
+ ```bash
143
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
144
+ --ipc=host \
145
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
146
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
147
+ --name=vllm \
148
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
149
+ vllm serve \
150
+ --tensor-parallel-size 8 \
151
+ --max-model-len 32768 \
152
+ --enforce-eager --model RedHatAI/Llama-4-Maverick-17B-128E-Instruct
153
+ ```
154
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
155
+ </details>
156
+
157
+ <details>
158
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
159
+
160
+ ```bash
161
+ # Download model from Red Hat Registry via docker
162
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
163
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-maverick-17b-128e-instruct:1.5
164
+ ```
165
+
166
+ ```bash
167
+ # Serve model via ilab
168
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-4-maverick-17b-128e-instruct
169
+
170
+ # Chat with model
171
+ ilab model chat --model ~/.cache/instructlab/models/llama-4-maverick-17b-128e-instruct
172
+ ```
173
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
174
+ </details>
175
+
176
+ <details>
177
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
178
+
179
+ ```python
180
+ # Setting up vllm server with ServingRuntime
181
+ # Save as: vllm-servingruntime.yaml
182
+ apiVersion: serving.kserve.io/v1alpha1
183
+ kind: ServingRuntime
184
+ metadata:
185
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
186
+ annotations:
187
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
188
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
189
+ labels:
190
+ opendatahub.io/dashboard: 'true'
191
+ spec:
192
+ annotations:
193
+ prometheus.io/port: '8080'
194
+ prometheus.io/path: '/metrics'
195
+ multiModel: false
196
+ supportedModelFormats:
197
+ - autoSelect: true
198
+ name: vLLM
199
+ containers:
200
+ - name: kserve-container
201
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
202
+ command:
203
+ - python
204
+ - -m
205
+ - vllm.entrypoints.openai.api_server
206
+ args:
207
+ - "--port=8080"
208
+ - "--model=/mnt/models"
209
+ - "--served-model-name={{.Name}}"
210
+ env:
211
+ - name: HF_HOME
212
+ value: /tmp/hf_home
213
+ ports:
214
+ - containerPort: 8080
215
+ protocol: TCP
216
+ ```
217
+
218
+ ```python
219
+ # Attach model to vllm server. This is an NVIDIA template
220
+ # Save as: inferenceservice.yaml
221
+ apiVersion: serving.kserve.io/v1beta1
222
+ kind: InferenceService
223
+ metadata:
224
+ annotations:
225
+ openshift.io/display-name: Llama-4-Maverick-17B-128E-Instruct # OPTIONAL CHANGE
226
+ serving.kserve.io/deploymentMode: RawDeployment
227
+ name: Llama-4-Maverick-17B-128E-Instruct # specify model name. This value will be used to invoke the model in the payload
228
+ labels:
229
+ opendatahub.io/dashboard: 'true'
230
+ spec:
231
+ predictor:
232
+ maxReplicas: 1
233
+ minReplicas: 1
234
+ model:
235
+ modelFormat:
236
+ name: vLLM
237
+ name: ''
238
+ resources:
239
+ limits:
240
+ cpu: '2' # this is model specific
241
+ memory: 8Gi # this is model specific
242
+ nvidia.com/gpu: '1' # this is accelerator specific
243
+ requests: # same comment for this block
244
+ cpu: '1'
245
+ memory: 4Gi
246
+ nvidia.com/gpu: '1'
247
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
248
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-maverick-17b-128e-instruct:1.5
249
+ tolerations:
250
+ - effect: NoSchedule
251
+ key: nvidia.com/gpu
252
+ operator: Exists
253
+ ```
254
+
255
+ ```bash
256
+ # make sure first to be in the project where you want to deploy the model
257
+ # oc project <project-name>
258
+
259
+ # apply both resources to run model
260
+
261
+ # Apply the ServingRuntime
262
+ oc apply -f vllm-servingruntime.yaml
263
+
264
+ # Apply the InferenceService
265
+ oc apply -f qwen-inferenceservice.yaml
266
+ ```
267
+
268
+ ```python
269
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
270
+ # - Run `oc get inferenceservice` to find your URL if unsure.
271
+
272
+ # Call the server using curl:
273
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
274
+ -H "Content-Type: application/json" \
275
+ -d '{
276
+ "model": "Llama-4-Maverick-17B-128E-Instruct",
277
+ "stream": true,
278
+ "stream_options": {
279
+ "include_usage": true
280
+ },
281
+ "max_tokens": 1,
282
+ "messages": [
283
+ {
284
+ "role": "user",
285
+ "content": "How can a bee fly when its wings are so small?"
286
+ }
287
+ ]
288
+ }'
289
+
290
+ ```
291
+
292
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
293
+ </details>
294
+
295
+
296
  ## How to use with transformers
297
 
298
  Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.