robgreenberg3 jennyyyi commited on
Commit
f166611
·
verified ·
1 Parent(s): dc38470
Files changed (1) hide show
  1. README.md +167 -2
README.md CHANGED
@@ -37,7 +37,15 @@ tags:
37
  - FP8
38
  ---
39
 
40
- # Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic
 
 
 
 
 
 
 
 
41
 
42
  ## Model Overview
43
  - **Model Architecture:** Mistral3ForConditionalGeneration
@@ -80,7 +88,7 @@ from vllm import LLM, SamplingParams
80
  from transformers import AutoProcessor
81
 
82
  model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
83
- number_gpus = 1
84
 
85
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
86
  processor = AutoProcessor.from_pretrained(model_id)
@@ -99,6 +107,163 @@ print(generated_text)
99
 
100
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  ## Creation
103
 
104
  <details>
 
37
  - FP8
38
  ---
39
 
40
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
41
+ Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic
42
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
43
+ </h1>
44
+
45
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
46
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
47
+ </a>
48
+
49
 
50
  ## Model Overview
51
  - **Model Architecture:** Mistral3ForConditionalGeneration
 
88
  from transformers import AutoProcessor
89
 
90
  model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
91
+ number_gpus = 4
92
 
93
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
94
  processor = AutoProcessor.from_pretrained(model_id)
 
107
 
108
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
109
 
110
+ <details>
111
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
112
+
113
+ ```bash
114
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
115
+ --ipc=host \
116
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
117
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
118
+ --name=vllm \
119
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
120
+ vllm serve \
121
+ --tensor-parallel-size 1 \
122
+ --max-model-len 32768 \
123
+ --enforce-eager --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic
124
+ ```
125
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
126
+ </details>
127
+
128
+ <details>
129
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
130
+
131
+ ```bash
132
+ # Download model from Red Hat Registry via docker
133
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
134
+ ilab model download --repository docker://registry.redhat.io/rhelai1/mistral-small-3-1-24b-instruct-2503-fp8-dynamic:1.5
135
+ ```
136
+
137
+ ```bash
138
+ # Serve model via ilab
139
+ ilab model serve --model-path ~/.cache/instructlab/models/mistral-small-3-1-24b-instruct-2503-fp8-dynamic
140
+
141
+ # Chat with model
142
+ ilab model chat --model ~/.cache/instructlab/models/mistral-small-3-1-24b-instruct-2503-fp8-dynamic
143
+ ```
144
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
145
+ </details>
146
+
147
+ <details>
148
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
149
+
150
+ ```python
151
+ # Setting up vllm server with ServingRuntime
152
+ # Save as: vllm-servingruntime.yaml
153
+ apiVersion: serving.kserve.io/v1alpha1
154
+ kind: ServingRuntime
155
+ metadata:
156
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
157
+ annotations:
158
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
159
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
160
+ labels:
161
+ opendatahub.io/dashboard: 'true'
162
+ spec:
163
+ annotations:
164
+ prometheus.io/port: '8080'
165
+ prometheus.io/path: '/metrics'
166
+ multiModel: false
167
+ supportedModelFormats:
168
+ - autoSelect: true
169
+ name: vLLM
170
+ containers:
171
+ - name: kserve-container
172
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
173
+ command:
174
+ - python
175
+ - -m
176
+ - vllm.entrypoints.openai.api_server
177
+ args:
178
+ - "--port=8080"
179
+ - "--model=/mnt/models"
180
+ - "--served-model-name={{.Name}}"
181
+ env:
182
+ - name: HF_HOME
183
+ value: /tmp/hf_home
184
+ ports:
185
+ - containerPort: 8080
186
+ protocol: TCP
187
+ ```
188
+
189
+ ```python
190
+ # Attach model to vllm server. This is an NVIDIA template
191
+ # Save as: inferenceservice.yaml
192
+ apiVersion: serving.kserve.io/v1beta1
193
+ kind: InferenceService
194
+ metadata:
195
+ annotations:
196
+ openshift.io/display-name: mistral-small-3-1-24b-instruct-2503-fp8-dynamic # OPTIONAL CHANGE
197
+ serving.kserve.io/deploymentMode: RawDeployment
198
+ name: mistral-small-3-1-24b-instruct-2503-fp8-dynamic # specify model name. This value will be used to invoke the model in the payload
199
+ labels:
200
+ opendatahub.io/dashboard: 'true'
201
+ spec:
202
+ predictor:
203
+ maxReplicas: 1
204
+ minReplicas: 1
205
+ model:
206
+ modelFormat:
207
+ name: vLLM
208
+ name: ''
209
+ resources:
210
+ limits:
211
+ cpu: '2' # this is model specific
212
+ memory: 8Gi # this is model specific
213
+ nvidia.com/gpu: '1' # this is accelerator specific
214
+ requests: # same comment for this block
215
+ cpu: '1'
216
+ memory: 4Gi
217
+ nvidia.com/gpu: '1'
218
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
219
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503-fp8-dynamic:1.5
220
+ tolerations:
221
+ - effect: NoSchedule
222
+ key: nvidia.com/gpu
223
+ operator: Exists
224
+ ```
225
+
226
+ ```bash
227
+ # make sure first to be in the project where you want to deploy the model
228
+ # oc project <project-name>
229
+
230
+ # apply both resources to run model
231
+
232
+ # Apply the ServingRuntime
233
+ oc apply -f vllm-servingruntime.yaml
234
+
235
+ # Apply the InferenceService
236
+ oc apply -f qwen-inferenceservice.yaml
237
+ ```
238
+
239
+ ```python
240
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
241
+ # - Run `oc get inferenceservice` to find your URL if unsure.
242
+
243
+ # Call the server using curl:
244
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
245
+ -H "Content-Type: application/json" \
246
+ -d '{
247
+ "model": "mistral-small-3-1-24b-instruct-2503-fp8-dynamic",
248
+ "stream": true,
249
+ "stream_options": {
250
+ "include_usage": true
251
+ },
252
+ "max_tokens": 1,
253
+ "messages": [
254
+ {
255
+ "role": "user",
256
+ "content": "How can a bee fly when its wings are so small?"
257
+ }
258
+ ]
259
+ }'
260
+
261
+ ```
262
+
263
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
264
+ </details>
265
+
266
+
267
  ## Creation
268
 
269
  <details>