robgreenberg3 commited on
Commit
c1052cf
·
verified ·
1 Parent(s): 44b25d0

Update README.md (#1)

Browse files

- Update README.md (5aa833cccdf4843f1f438afad7760cc75c92ccb2)

Files changed (1) hide show
  1. README.md +182 -2
README.md CHANGED
@@ -15,7 +15,14 @@ widget:
15
  - role: user
16
  content: What is your favorite condiment?
17
  ---
18
- # Model Card for Mixtral-8x7B
 
 
 
 
 
 
 
19
 
20
  ### Tokenization with `mistral-common`
21
 
@@ -32,8 +39,181 @@ completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explai
32
 
33
  tokens = tokenizer.encode_chat_completion(completion_request).tokens
34
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ## Inference with `mistral_inference`
37
 
38
  ```py
39
  from mistral_inference.transformer import Transformer
 
15
  - role: user
16
  content: What is your favorite condiment?
17
  ---
18
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
19
+ Mixtral-8x7B-v0.1
20
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
21
+ </h1>
22
+
23
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
25
+ </a>
26
 
27
  ### Tokenization with `mistral-common`
28
 
 
39
 
40
  tokens = tokenizer.encode_chat_completion(completion_request).tokens
41
  ```
42
+ ## Deployment
43
+
44
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
45
+
46
+ Deploy on <strong>vLLM</strong>
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+ model_id = "RedHatAI/Mixtral-8x7B-Instruct-v0.1"
52
+ number_gpus = 4
53
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+ prompt = "Give me a short introduction to large language model."
56
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
57
+ outputs = llm.generate(prompt, sampling_params)
58
+ generated_text = outputs[0].outputs[0].text
59
+ print(generated_text)
60
+ ```
61
+
62
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
63
+
64
+ <details>
65
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
66
+
67
+ ```bash
68
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
69
+ --ipc=host \
70
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
71
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
72
+ --name=vllm \
73
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
74
+ vllm serve \
75
+ --tensor-parallel-size 8 \
76
+ --max-model-len 32768 \
77
+ --enforce-eager --model RedHatAI/Mixtral-8x7B-Instruct-v0.1
78
+ ```
79
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
80
+ </details>
81
+
82
+ <details>
83
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
84
+
85
+ ```bash
86
+ # Download model from Red Hat Registry via docker
87
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
88
+ ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1:1.4
89
+ ```
90
+
91
+ ```bash
92
+ # Serve model via ilab
93
+ ilab model serve --model-path ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
94
+
95
+ # Chat with model
96
+ ilab model chat --model ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
97
+ ```
98
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
99
+ </details>
100
+
101
+ <details>
102
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
103
+
104
+ ```python
105
+ # Setting up vllm server with ServingRuntime
106
+ # Save as: vllm-servingruntime.yaml
107
+ apiVersion: serving.kserve.io/v1alpha1
108
+ kind: ServingRuntime
109
+ metadata:
110
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
111
+ annotations:
112
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
113
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
114
+ labels:
115
+ opendatahub.io/dashboard: 'true'
116
+ spec:
117
+ annotations:
118
+ prometheus.io/port: '8080'
119
+ prometheus.io/path: '/metrics'
120
+ multiModel: false
121
+ supportedModelFormats:
122
+ - autoSelect: true
123
+ name: vLLM
124
+ containers:
125
+ - name: kserve-container
126
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
127
+ command:
128
+ - python
129
+ - -m
130
+ - vllm.entrypoints.openai.api_server
131
+ args:
132
+ - "--port=8080"
133
+ - "--model=/mnt/models"
134
+ - "--served-model-name={{.Name}}"
135
+ env:
136
+ - name: HF_HOME
137
+ value: /tmp/hf_home
138
+ ports:
139
+ - containerPort: 8080
140
+ protocol: TCP
141
+ ```
142
+
143
+ ```python
144
+ # Attach model to vllm server. This is an NVIDIA template
145
+ # Save as: inferenceservice.yaml
146
+ apiVersion: serving.kserve.io/v1beta1
147
+ kind: InferenceService
148
+ metadata:
149
+ annotations:
150
+ openshift.io/display-name: Mixtral-8x7B-Instruct-v0.1 # OPTIONAL CHANGE
151
+ serving.kserve.io/deploymentMode: RawDeployment
152
+ name: Mixtral-8x7B-Instruct-v0.1 # specify model name. This value will be used to invoke the model in the payload
153
+ labels:
154
+ opendatahub.io/dashboard: 'true'
155
+ spec:
156
+ predictor:
157
+ maxReplicas: 1
158
+ minReplicas: 1
159
+ model:
160
+ modelFormat:
161
+ name: vLLM
162
+ name: ''
163
+ resources:
164
+ limits:
165
+ cpu: '2' # this is model specific
166
+ memory: 8Gi # this is model specific
167
+ nvidia.com/gpu: '1' # this is accelerator specific
168
+ requests: # same comment for this block
169
+ cpu: '1'
170
+ memory: 4Gi
171
+ nvidia.com/gpu: '1'
172
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
173
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-mixtral-8x7b-instruct-v0-1:1.4
174
+ tolerations:
175
+ - effect: NoSchedule
176
+ key: nvidia.com/gpu
177
+ operator: Exists
178
+ ```
179
+
180
+ ```bash
181
+ # make sure first to be in the project where you want to deploy the model
182
+ # oc project <project-name>
183
+ # apply both resources to run model
184
+ # Apply the ServingRuntime
185
+ oc apply -f vllm-servingruntime.yaml
186
+ # Apply the InferenceService
187
+ oc apply -f qwen-inferenceservice.yaml
188
+ ```
189
+
190
+ ```python
191
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
192
+ # - Run `oc get inferenceservice` to find your URL if unsure.
193
+ # Call the server using curl:
194
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
195
+ -H "Content-Type: application/json" \
196
+ -d '{
197
+ "model": "Mixtral-8x7B-Instruct-v0.1",
198
+ "stream": true,
199
+ "stream_options": {
200
+ "include_usage": true
201
+ },
202
+ "max_tokens": 1,
203
+ "messages": [
204
+ {
205
+ "role": "user",
206
+ "content": "How can a bee fly when its wings are so small?"
207
+ }
208
+ ]
209
+ }'
210
+ ```
211
+
212
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
213
+ </details>
214
+
215
 
216
+ ## Inference with `mistral_inference`
217
 
218
  ```py
219
  from mistral_inference.transformer import Transformer