Update README.md for model validation (#1)
Browse files- Update README.md for model validation (c00ffc938e8db0769f3590235fd40b49bd8cf339)
- Update README.md (f64129c1d85f76c7f9eaa6dc6aacb8a7f9cd802a)
- Update README.md (d289f5581a4b267d01115cf8facc6f6338efb4dd)
Co-authored-by: Jenny Y <[email protected]>
README.md
CHANGED
@@ -25,6 +25,15 @@ license: other
|
|
25 |
license_name: llama4
|
26 |
---
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Model Information
|
30 |
**Built with Llama**
|
@@ -96,6 +105,194 @@ These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We
|
|
96 |
|
97 |
2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
## How to use with transformers
|
100 |
|
101 |
Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.
|
|
|
25 |
license_name: llama4
|
26 |
---
|
27 |
|
28 |
+
<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
|
29 |
+
Llama-4-Scout-17B-16E-Instruct
|
30 |
+
<img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
|
31 |
+
</h1>
|
32 |
+
|
33 |
+
<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
|
34 |
+
<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
|
35 |
+
</a>
|
36 |
+
|
37 |
|
38 |
## Model Information
|
39 |
**Built with Llama**
|
|
|
105 |
|
106 |
2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.
|
107 |
|
108 |
+
## Deployment
|
109 |
+
|
110 |
+
This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
|
111 |
+
|
112 |
+
Deploy on <strong>vLLM</strong>
|
113 |
+
|
114 |
+
```python
|
115 |
+
from vllm import LLM, SamplingParams
|
116 |
+
from transformers import AutoTokenizer
|
117 |
+
|
118 |
+
model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct"
|
119 |
+
number_gpus = 4
|
120 |
+
|
121 |
+
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
|
122 |
+
|
123 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
124 |
+
|
125 |
+
prompt = "Give me a short introduction to large language model."
|
126 |
+
|
127 |
+
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
|
128 |
+
|
129 |
+
outputs = llm.generate(prompt, sampling_params)
|
130 |
+
|
131 |
+
generated_text = outputs[0].outputs[0].text
|
132 |
+
print(generated_text)
|
133 |
+
```
|
134 |
+
|
135 |
+
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
136 |
+
|
137 |
+
<details>
|
138 |
+
<summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
|
139 |
+
|
140 |
+
```bash
|
141 |
+
$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
|
142 |
+
--ipc=host \
|
143 |
+
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
144 |
+
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
|
145 |
+
--name=vllm \
|
146 |
+
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
|
147 |
+
vllm serve \
|
148 |
+
--tensor-parallel-size 8 \
|
149 |
+
--max-model-len 32768 \
|
150 |
+
--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct
|
151 |
+
```
|
152 |
+
See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
|
153 |
+
</details>
|
154 |
+
|
155 |
+
<details>
|
156 |
+
<summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
|
157 |
+
|
158 |
+
```bash
|
159 |
+
# Download model from Red Hat Registry via docker
|
160 |
+
# Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
|
161 |
+
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct:1.5
|
162 |
+
# (Ex: docker://registry.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5)
|
163 |
+
```
|
164 |
+
|
165 |
+
```bash
|
166 |
+
# Serve model via ilab
|
167 |
+
ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct
|
168 |
+
|
169 |
+
# Chat with model
|
170 |
+
ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct
|
171 |
+
```
|
172 |
+
See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
|
173 |
+
</details>
|
174 |
+
|
175 |
+
<details>
|
176 |
+
<summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
|
177 |
+
|
178 |
+
```python
|
179 |
+
# Setting up vllm server with ServingRuntime
|
180 |
+
# Save as: vllm-servingruntime.yaml
|
181 |
+
apiVersion: serving.kserve.io/v1alpha1
|
182 |
+
kind: ServingRuntime
|
183 |
+
metadata:
|
184 |
+
name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
|
185 |
+
annotations:
|
186 |
+
openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
|
187 |
+
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
|
188 |
+
labels:
|
189 |
+
opendatahub.io/dashboard: 'true'
|
190 |
+
spec:
|
191 |
+
annotations:
|
192 |
+
prometheus.io/port: '8080'
|
193 |
+
prometheus.io/path: '/metrics'
|
194 |
+
multiModel: false
|
195 |
+
supportedModelFormats:
|
196 |
+
- autoSelect: true
|
197 |
+
name: vLLM
|
198 |
+
containers:
|
199 |
+
- name: kserve-container
|
200 |
+
image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
|
201 |
+
command:
|
202 |
+
- python
|
203 |
+
- -m
|
204 |
+
- vllm.entrypoints.openai.api_server
|
205 |
+
args:
|
206 |
+
- "--port=8080"
|
207 |
+
- "--model=/mnt/models"
|
208 |
+
- "--served-model-name={{.Name}}"
|
209 |
+
env:
|
210 |
+
- name: HF_HOME
|
211 |
+
value: /tmp/hf_home
|
212 |
+
ports:
|
213 |
+
- containerPort: 8080
|
214 |
+
protocol: TCP
|
215 |
+
```
|
216 |
+
|
217 |
+
```python
|
218 |
+
# Attach model to vllm server. This is an NVIDIA template
|
219 |
+
# Save as: inferenceservice.yaml
|
220 |
+
apiVersion: serving.kserve.io/v1beta1
|
221 |
+
kind: InferenceService
|
222 |
+
metadata:
|
223 |
+
annotations:
|
224 |
+
openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct # OPTIONAL CHANGE
|
225 |
+
serving.kserve.io/deploymentMode: RawDeployment
|
226 |
+
name: Llama-4-Scout-17B-16E-Instruct # specify model name. This value will be used to invoke the model in the payload
|
227 |
+
labels:
|
228 |
+
opendatahub.io/dashboard: 'true'
|
229 |
+
spec:
|
230 |
+
predictor:
|
231 |
+
maxReplicas: 1
|
232 |
+
minReplicas: 1
|
233 |
+
model:
|
234 |
+
modelFormat:
|
235 |
+
name: vLLM
|
236 |
+
name: ''
|
237 |
+
resources:
|
238 |
+
limits:
|
239 |
+
cpu: '2' # this is model specific
|
240 |
+
memory: 8Gi # this is model specific
|
241 |
+
nvidia.com/gpu: '1' # this is accelerator specific
|
242 |
+
requests: # same comment for this block
|
243 |
+
cpu: '1'
|
244 |
+
memory: 4Gi
|
245 |
+
nvidia.com/gpu: '1'
|
246 |
+
runtime: vllm-cuda-runtime # must match the ServingRuntime name above
|
247 |
+
storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct:1.5
|
248 |
+
tolerations:
|
249 |
+
- effect: NoSchedule
|
250 |
+
key: nvidia.com/gpu
|
251 |
+
operator: Exists
|
252 |
+
```
|
253 |
+
|
254 |
+
```bash
|
255 |
+
# make sure first to be in the project where you want to deploy the model
|
256 |
+
# oc project <project-name>
|
257 |
+
|
258 |
+
# apply both resources to run model
|
259 |
+
|
260 |
+
# Apply the ServingRuntime
|
261 |
+
oc apply -f vllm-servingruntime.yaml
|
262 |
+
|
263 |
+
# Apply the InferenceService
|
264 |
+
oc apply -f qwen-inferenceservice.yaml
|
265 |
+
```
|
266 |
+
|
267 |
+
```python
|
268 |
+
# Replace <inference-service-name> and <cluster-ingress-domain> below:
|
269 |
+
# - Run `oc get inferenceservice` to find your URL if unsure.
|
270 |
+
|
271 |
+
# Call the server using curl:
|
272 |
+
curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
|
273 |
+
-H "Content-Type: application/json" \
|
274 |
+
-d '{
|
275 |
+
"model": "Llama-4-Scout-17B-16E-Instruct",
|
276 |
+
"stream": true,
|
277 |
+
"stream_options": {
|
278 |
+
"include_usage": true
|
279 |
+
},
|
280 |
+
"max_tokens": 1,
|
281 |
+
"messages": [
|
282 |
+
{
|
283 |
+
"role": "user",
|
284 |
+
"content": "How can a bee fly when its wings are so small?"
|
285 |
+
}
|
286 |
+
]
|
287 |
+
}'
|
288 |
+
|
289 |
+
```
|
290 |
+
|
291 |
+
See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
|
292 |
+
</details>
|
293 |
+
|
294 |
+
|
295 |
+
|
296 |
## How to use with transformers
|
297 |
|
298 |
Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`.
|