jennyyyi commited on
Commit
cd45c28
·
verified ·
1 Parent(s): b0b04ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -2
README.md CHANGED
@@ -15,8 +15,14 @@ pipeline_tag: text-generation
15
  license: llama3.1
16
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
17
  ---
18
-
19
- # Meta-Llama-3.1-8B-Instruct-FP8-dynamic
 
 
 
 
 
 
20
 
21
  ## Model Overview
22
  - **Model Architecture:** Meta-Llama-3.1
@@ -77,6 +83,163 @@ print(generated_text)
77
 
78
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Creation
81
 
82
  This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
 
15
  license: llama3.1
16
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
17
  ---
18
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
19
+ Meta-Llama-3.1-8B-Instruct-FP8-dynamic
20
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
21
+ </h1>
22
+
23
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
24
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
25
+ </a>
26
 
27
  ## Model Overview
28
  - **Model Architecture:** Meta-Llama-3.1
 
83
 
84
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
85
 
86
+ <details>
87
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
88
+
89
+ ```bash
90
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
91
+ --ipc=host \
92
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
93
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
94
+ --name=vllm \
95
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
96
+ vllm serve \
97
+ --tensor-parallel-size 8 \
98
+ --max-model-len 32768 \
99
+ --enforce-eager --model RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
100
+ ```
101
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
102
+ </details>
103
+
104
+ <details>
105
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
106
+
107
+ ```bash
108
+ # Download model from Red Hat Registry via docker
109
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
110
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-8b-instruct-fp8-dynamic:1.5
111
+ ```
112
+
113
+ ```bash
114
+ # Serve model via ilab
115
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-8b-instruct-fp8-dynamic
116
+
117
+ # Chat with model
118
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-1-8b-instruct-fp8-dynamic
119
+ ```
120
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
121
+ </details>
122
+
123
+ <details>
124
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
125
+
126
+ ```python
127
+ # Setting up vllm server with ServingRuntime
128
+ # Save as: vllm-servingruntime.yaml
129
+ apiVersion: serving.kserve.io/v1alpha1
130
+ kind: ServingRuntime
131
+ metadata:
132
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
133
+ annotations:
134
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
135
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
136
+ labels:
137
+ opendatahub.io/dashboard: 'true'
138
+ spec:
139
+ annotations:
140
+ prometheus.io/port: '8080'
141
+ prometheus.io/path: '/metrics'
142
+ multiModel: false
143
+ supportedModelFormats:
144
+ - autoSelect: true
145
+ name: vLLM
146
+ containers:
147
+ - name: kserve-container
148
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
149
+ command:
150
+ - python
151
+ - -m
152
+ - vllm.entrypoints.openai.api_server
153
+ args:
154
+ - "--port=8080"
155
+ - "--model=/mnt/models"
156
+ - "--served-model-name={{.Name}}"
157
+ env:
158
+ - name: HF_HOME
159
+ value: /tmp/hf_home
160
+ ports:
161
+ - containerPort: 8080
162
+ protocol: TCP
163
+ ```
164
+
165
+ ```python
166
+ # Attach model to vllm server. This is an NVIDIA template
167
+ # Save as: inferenceservice.yaml
168
+ apiVersion: serving.kserve.io/v1beta1
169
+ kind: InferenceService
170
+ metadata:
171
+ annotations:
172
+ openshift.io/display-name: llama-3-1-8b-instruct-fp8-dynamic # OPTIONAL CHANGE
173
+ serving.kserve.io/deploymentMode: RawDeployment
174
+ name: llama-3-1-8b-instruct-fp8-dynamic # specify model name. This value will be used to invoke the model in the payload
175
+ labels:
176
+ opendatahub.io/dashboard: 'true'
177
+ spec:
178
+ predictor:
179
+ maxReplicas: 1
180
+ minReplicas: 1
181
+ model:
182
+ modelFormat:
183
+ name: vLLM
184
+ name: ''
185
+ resources:
186
+ limits:
187
+ cpu: '2' # this is model specific
188
+ memory: 8Gi # this is model specific
189
+ nvidia.com/gpu: '1' # this is accelerator specific
190
+ requests: # same comment for this block
191
+ cpu: '1'
192
+ memory: 4Gi
193
+ nvidia.com/gpu: '1'
194
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
195
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
196
+ tolerations:
197
+ - effect: NoSchedule
198
+ key: nvidia.com/gpu
199
+ operator: Exists
200
+ ```
201
+
202
+ ```bash
203
+ # make sure first to be in the project where you want to deploy the model
204
+ # oc project <project-name>
205
+
206
+ # apply both resources to run model
207
+
208
+ # Apply the ServingRuntime
209
+ oc apply -f vllm-servingruntime.yaml
210
+
211
+ # Apply the InferenceService
212
+ oc apply -f qwen-inferenceservice.yaml
213
+ ```
214
+
215
+ ```python
216
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
217
+ # - Run `oc get inferenceservice` to find your URL if unsure.
218
+
219
+ # Call the server using curl:
220
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
221
+ -H "Content-Type: application/json" \
222
+ -d '{
223
+ "model": "llama-3-1-8b-instruct-fp8-dynamic",
224
+ "stream": true,
225
+ "stream_options": {
226
+ "include_usage": true
227
+ },
228
+ "max_tokens": 1,
229
+ "messages": [
230
+ {
231
+ "role": "user",
232
+ "content": "How can a bee fly when its wings are so small?"
233
+ }
234
+ ]
235
+ }'
236
+
237
+ ```
238
+
239
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
240
+ </details>
241
+
242
+
243
  ## Creation
244
 
245
  This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.