jennyyyi commited on
Commit
9811037
·
verified ·
1 Parent(s): e6a4ff7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -2
README.md CHANGED
@@ -15,8 +15,14 @@ pipeline_tag: text-generation
15
  license: llama3.1
16
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
17
  ---
18
-
19
- # Meta-Llama-3.1-8B-Instruct-quantized.w8a8
 
 
 
 
 
 
20
 
21
  ## Model Overview
22
  - **Model Architecture:** Meta-Llama-3
@@ -82,6 +88,162 @@ print(generated_text)
82
 
83
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  ## Creation
87
 
 
15
  license: llama3.1
16
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
17
  ---
18
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
19
+ Meta-Llama-3.1-8B-Instruct-quantized.w8a8
20
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
21
+ </h1>
22
+
23
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
25
+ </a>
26
 
27
  ## Model Overview
28
  - **Model Architecture:** Meta-Llama-3
 
88
 
89
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
90
 
91
+ <details>
92
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
93
+
94
+ ```bash
95
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
96
+ --ipc=host \
97
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
98
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
99
+ --name=vllm \
100
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
101
+ vllm serve \
102
+ --tensor-parallel-size 8 \
103
+ --max-model-len 32768 \
104
+ --enforce-eager --model RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8
105
+ ```
106
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
107
+ </details>
108
+
109
+ <details>
110
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
111
+
112
+ ```bash
113
+ # Download model from Red Hat Registry via docker
114
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
115
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-8b-instruct-quantized-w8a8:1.5
116
+ ```
117
+
118
+ ```bash
119
+ # Serve model via ilab
120
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-8b-instruct-quantized-w8a8
121
+
122
+ # Chat with model
123
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-1-8b-instruct-quantized-w8a8
124
+ ```
125
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
126
+ </details>
127
+
128
+ <details>
129
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
130
+
131
+ ```python
132
+ # Setting up vllm server with ServingRuntime
133
+ # Save as: vllm-servingruntime.yaml
134
+ apiVersion: serving.kserve.io/v1alpha1
135
+ kind: ServingRuntime
136
+ metadata:
137
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
138
+ annotations:
139
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
140
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
141
+ labels:
142
+ opendatahub.io/dashboard: 'true'
143
+ spec:
144
+ annotations:
145
+ prometheus.io/port: '8080'
146
+ prometheus.io/path: '/metrics'
147
+ multiModel: false
148
+ supportedModelFormats:
149
+ - autoSelect: true
150
+ name: vLLM
151
+ containers:
152
+ - name: kserve-container
153
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
154
+ command:
155
+ - python
156
+ - -m
157
+ - vllm.entrypoints.openai.api_server
158
+ args:
159
+ - "--port=8080"
160
+ - "--model=/mnt/models"
161
+ - "--served-model-name={{.Name}}"
162
+ env:
163
+ - name: HF_HOME
164
+ value: /tmp/hf_home
165
+ ports:
166
+ - containerPort: 8080
167
+ protocol: TCP
168
+ ```
169
+
170
+ ```python
171
+ # Attach model to vllm server. This is an NVIDIA template
172
+ # Save as: inferenceservice.yaml
173
+ apiVersion: serving.kserve.io/v1beta1
174
+ kind: InferenceService
175
+ metadata:
176
+ annotations:
177
+ openshift.io/display-name: llama-3-1-8b-instruct-quantized-w8a8 # OPTIONAL CHANGE
178
+ serving.kserve.io/deploymentMode: RawDeployment
179
+ name: llama-3-1-8b-instruct-quantized-w8a8 # specify model name. This value will be used to invoke the model in the payload
180
+ labels:
181
+ opendatahub.io/dashboard: 'true'
182
+ spec:
183
+ predictor:
184
+ maxReplicas: 1
185
+ minReplicas: 1
186
+ model:
187
+ modelFormat:
188
+ name: vLLM
189
+ name: ''
190
+ resources:
191
+ limits:
192
+ cpu: '2' # this is model specific
193
+ memory: 8Gi # this is model specific
194
+ nvidia.com/gpu: '1' # this is accelerator specific
195
+ requests: # same comment for this block
196
+ cpu: '1'
197
+ memory: 4Gi
198
+ nvidia.com/gpu: '1'
199
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
200
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-quantized-w8a8:1.5
201
+ tolerations:
202
+ - effect: NoSchedule
203
+ key: nvidia.com/gpu
204
+ operator: Exists
205
+ ```
206
+
207
+ ```bash
208
+ # make sure first to be in the project where you want to deploy the model
209
+ # oc project <project-name>
210
+
211
+ # apply both resources to run model
212
+
213
+ # Apply the ServingRuntime
214
+ oc apply -f vllm-servingruntime.yaml
215
+
216
+ # Apply the InferenceService
217
+ oc apply -f qwen-inferenceservice.yaml
218
+ ```
219
+
220
+ ```python
221
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
222
+ # - Run `oc get inferenceservice` to find your URL if unsure.
223
+
224
+ # Call the server using curl:
225
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
226
+ -H "Content-Type: application/json" \
227
+ -d '{
228
+ "model": "llama-3-1-8b-instruct-quantized-w8a8",
229
+ "stream": true,
230
+ "stream_options": {
231
+ "include_usage": true
232
+ },
233
+ "max_tokens": 1,
234
+ "messages": [
235
+ {
236
+ "role": "user",
237
+ "content": "How can a bee fly when its wings are so small?"
238
+ }
239
+ ]
240
+ }'
241
+
242
+ ```
243
+
244
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
245
+ </details>
246
+
247
 
248
  ## Creation
249