Files changed (1) hide show
  1. README.md +170 -1
README.md CHANGED
@@ -19,7 +19,14 @@ tags:
19
  - transformers
20
  ---
21
 
22
- # Model Card for Mistral-Small-24B-Instruct-2501
 
 
 
 
 
 
 
23
 
24
  Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
25
  This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
@@ -113,6 +120,168 @@ The model can be used with the following frameworks;
113
  - [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
114
  - [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
115
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  ### vLLM
117
 
118
  We recommend using this model with the [vLLM library](https://github.com/vllm-project/vllm)
 
19
  - transformers
20
  ---
21
 
22
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
23
+ Mistral-Small-24B-Instruct-2501
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
25
+ </h1>
26
+
27
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
28
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
29
+ </a>
30
 
31
  Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
32
  This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
 
120
  - [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
121
  - [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
122
 
123
+ <details>
124
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
125
+
126
+ ```bash
127
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
128
+ --ipc=host \
129
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
130
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
131
+ --name=vllm \
132
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
133
+ vllm serve \
134
+ --tensor-parallel-size 8 \
135
+ --max-model-len 32768 \
136
+ --enforce-eager --model RedHatAI/Mistral-Small-24B-Instruct-2501
137
+ ```
138
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
139
+ </details>
140
+
141
+ <details>
142
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
143
+
144
+ ```bash
145
+ # Download model from Red Hat Registry via docker
146
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
147
+ ilab model download --repository docker://registry.redhat.io/rhelai1/mistral-small-24b-instruct-2501:1.5
148
+ ```
149
+
150
+ ```bash
151
+ # Serve model via ilab
152
+ ilab model serve --model-path ~/.cache/instructlab/models/mistral-small-24b-instruct-2501 --gpu 1 -- --tokenizer-mode "mistral" --config-format "mistral" --load-format "mistral" --tool-call-parser "mistral" --enable-auto-tool-choice --limit-mm-per-prompt "image=10" --max-model-len 16384 --uvicorn-log-level "debug" --trust-remote-code
153
+
154
+ # Chat with model
155
+ ilab model chat --model ~/.cache/instructlab/models/mistral-small-24b-instruct-2501
156
+ ```
157
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
158
+ </details>
159
+
160
+ <details>
161
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
162
+
163
+ ```python
164
+ # Setting up vllm server with ServingRuntime
165
+ # Save as: vllm-servingruntime.yaml
166
+ apiVersion: serving.kserve.io/v1alpha1
167
+ kind: ServingRuntime
168
+ metadata:
169
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
170
+ annotations:
171
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
172
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
173
+ labels:
174
+ opendatahub.io/dashboard: 'true'
175
+ spec:
176
+ annotations:
177
+ prometheus.io/port: '8080'
178
+ prometheus.io/path: '/metrics'
179
+ multiModel: false
180
+ supportedModelFormats:
181
+ - autoSelect: true
182
+ name: vLLM
183
+ containers:
184
+ - name: kserve-container
185
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
186
+ command:
187
+ - python
188
+ - -m
189
+ - vllm.entrypoints.openai.api_server
190
+ args:
191
+ - "--port=8080"
192
+ - "--model=/mnt/models"
193
+ - "--served-model-name={{.Name}}"
194
+ env:
195
+ - name: HF_HOME
196
+ value: /tmp/hf_home
197
+ ports:
198
+ - containerPort: 8080
199
+ protocol: TCP
200
+ ```
201
+
202
+ ```python
203
+ # Attach model to vllm server. This is an NVIDIA template
204
+ # Save as: inferenceservice.yaml
205
+ apiVersion: serving.kserve.io/v1beta1
206
+ kind: InferenceService
207
+ metadata:
208
+ annotations:
209
+ openshift.io/display-name: Mistral-Small-24B-Instruct-2501 # OPTIONAL CHANGE
210
+ serving.kserve.io/deploymentMode: RawDeployment
211
+ name: Mistral-Small-24B-Instruct-2501 # specify model name. This value will be used to invoke the model in the payload
212
+ labels:
213
+ opendatahub.io/dashboard: 'true'
214
+ spec:
215
+ predictor:
216
+ maxReplicas: 1
217
+ minReplicas: 1
218
+ model:
219
+ args:
220
+ - "--tokenizer-mode=mistral"
221
+ - "--config-format=mistral"
222
+ - "--load-format=mistral"
223
+ - "--tool-call-parser=mistral"
224
+ - "--enable-auto-tool-choice"
225
+ - "--limit-mm-per-prompt=image=10"
226
+ - "--max-model-len=16384"
227
+ - "--uvicorn-log-level=debug"
228
+ - "--trust-remote-code"
229
+
230
+ modelFormat:
231
+ name: vLLM
232
+ name: ''
233
+ resources:
234
+ limits:
235
+ cpu: '2' # this is model specific
236
+ memory: 8Gi # this is model specific
237
+ nvidia.com/gpu: '1' # this is accelerator specific
238
+ requests: # same comment for this block
239
+ cpu: '1'
240
+ memory: 4Gi
241
+ nvidia.com/gpu: '1'
242
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
243
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501:1.5
244
+ tolerations:
245
+ - effect: NoSchedule
246
+ key: nvidia.com/gpu
247
+ operator: Exists
248
+ ```
249
+
250
+ ```bash
251
+ # make sure first to be in the project where you want to deploy the model
252
+ # oc project <project-name>
253
+ # apply both resources to run model
254
+ # Apply the ServingRuntime
255
+ oc apply -f vllm-servingruntime.yaml
256
+ # Apply the InferenceService
257
+ oc apply -f qwen-inferenceservice.yaml
258
+ ```
259
+
260
+ ```python
261
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
262
+ # - Run `oc get inferenceservice` to find your URL if unsure.
263
+ # Call the server using curl:
264
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
265
+ -H "Content-Type: application/json" \
266
+ -d '{
267
+ "model": "Mistral-Small-24B-Instruct-2501",
268
+ "stream": true,
269
+ "stream_options": {
270
+ "include_usage": true
271
+ },
272
+ "max_tokens": 1,
273
+ "messages": [
274
+ {
275
+ "role": "user",
276
+ "content": "How can a bee fly when its wings are so small?"
277
+ }
278
+ ]
279
+ }'
280
+ ```
281
+
282
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
283
+ </details>
284
+
285
  ### vLLM
286
 
287
  We recommend using this model with the [vLLM library](https://github.com/vllm-project/vllm)