robgreenberg3 jennyyyi commited on
Commit
2ac4ae9
·
verified ·
1 Parent(s): 1576989

Update README.md (#2)

Browse files

- Update README.md (de472da63cf6133c1e77a150bfecbe03266f041e)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +167 -2
README.md CHANGED
@@ -9,8 +9,14 @@ language:
9
  base_model: ibm-granite/granite-3.1-8b-instruct
10
  library_name: transformers
11
  ---
12
-
13
- # granite-3.1-8b-instruct-FP8-dynamic
 
 
 
 
 
 
14
 
15
  ## Model Overview
16
  - **Model Architecture:** granite-3.1-8b-instruct
@@ -61,6 +67,165 @@ print(generated_text)
61
 
62
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Creation
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
 
9
  base_model: ibm-granite/granite-3.1-8b-instruct
10
  library_name: transformers
11
  ---
12
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
13
+ Granite-3.1-8b-instruct-FP8-dynamic
14
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
15
+ </h1>
16
+
17
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
18
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
19
+ </a>
20
 
21
  ## Model Overview
22
  - **Model Architecture:** granite-3.1-8b-instruct
 
67
 
68
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
69
 
70
+ <details>
71
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
72
+
73
+ ```bash
74
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
75
+ --ipc=host \
76
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
77
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
78
+ --name=vllm \
79
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
80
+ vllm serve \
81
+ --tensor-parallel-size 1 \
82
+ --max-model-len 32768 \
83
+ --enforce-eager --model RedHatAI/granite-3.1-8b-instruct-FP8-dynamic
84
+ ```
85
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
86
+ </details>
87
+
88
+ <details>
89
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
90
+
91
+ ```bash
92
+ # Download model from Red Hat Registry via docker
93
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
94
+ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-instruct-fp8-dynamic:1.5
95
+ ```
96
+
97
+ ```bash
98
+ # Serve model via ilab
99
+ ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-instruct-fp8-dynamic -- --trust-remote-code
100
+
101
+ # Chat with model
102
+ ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-instruct-fp8-dynamic
103
+ ```
104
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
105
+ </details>
106
+
107
+ <details>
108
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
109
+
110
+ ```python
111
+ # Setting up vllm server with ServingRuntime
112
+ # Save as: vllm-servingruntime.yaml
113
+ apiVersion: serving.kserve.io/v1alpha1
114
+ kind: ServingRuntime
115
+ metadata:
116
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
117
+ annotations:
118
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
119
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
120
+ labels:
121
+ opendatahub.io/dashboard: 'true'
122
+ spec:
123
+ annotations:
124
+ prometheus.io/port: '8080'
125
+ prometheus.io/path: '/metrics'
126
+ multiModel: false
127
+ supportedModelFormats:
128
+ - autoSelect: true
129
+ name: vLLM
130
+ containers:
131
+ - name: kserve-container
132
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
133
+ command:
134
+ - python
135
+ - -m
136
+ - vllm.entrypoints.openai.api_server
137
+ args:
138
+ - "--port=8080"
139
+ - "--model=/mnt/models"
140
+ - "--served-model-name={{.Name}}"
141
+ env:
142
+ - name: HF_HOME
143
+ value: /tmp/hf_home
144
+ ports:
145
+ - containerPort: 8080
146
+ protocol: TCP
147
+ ```
148
+
149
+ ```python
150
+ # Attach model to vllm server. This is an NVIDIA template
151
+ # Save as: inferenceservice.yaml
152
+ apiVersion: serving.kserve.io/v1beta1
153
+ kind: InferenceService
154
+ metadata:
155
+ annotations:
156
+ openshift.io/display-name: granite-3-1-8b-instruct-fp8-dynamic # OPTIONAL CHANGE
157
+ serving.kserve.io/deploymentMode: RawDeployment
158
+ name: granite-3-1-8b-instruct-fp8-dynamic # specify model name. This value will be used to invoke the model in the payload
159
+ labels:
160
+ opendatahub.io/dashboard: 'true'
161
+ spec:
162
+ predictor:
163
+ maxReplicas: 1
164
+ minReplicas: 1
165
+ model:
166
+ args:
167
+ - '--trust-remote-code'
168
+ modelFormat:
169
+ name: vLLM
170
+ name: ''
171
+ resources:
172
+ limits:
173
+ cpu: '2' # this is model specific
174
+ memory: 8Gi # this is model specific
175
+ nvidia.com/gpu: '1' # this is accelerator specific
176
+ requests: # same comment for this block
177
+ cpu: '1'
178
+ memory: 4Gi
179
+ nvidia.com/gpu: '1'
180
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
181
+ storageUri: registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct-fp8-dynamic:1.5
182
+ tolerations:
183
+ - effect: NoSchedule
184
+ key: nvidia.com/gpu
185
+ operator: Exists
186
+ ```
187
+
188
+ ```bash
189
+ # make sure first to be in the project where you want to deploy the model
190
+ # oc project <project-name>
191
+
192
+ # apply both resources to run model
193
+
194
+ # Apply the ServingRuntime
195
+ oc apply -f vllm-servingruntime.yaml
196
+
197
+ # Apply the InferenceService
198
+ oc apply -f qwen-inferenceservice.yaml
199
+ ```
200
+
201
+ ```python
202
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
203
+ # - Run `oc get inferenceservice` to find your URL if unsure.
204
+
205
+ # Call the server using curl:
206
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
207
+ -H "Content-Type: application/json" \
208
+ -d '{
209
+ "model": "Llama-4-Maverick-17B-128E-Instruct-FP8",
210
+ "stream": true,
211
+ "stream_options": {
212
+ "include_usage": true
213
+ },
214
+ "max_tokens": 1,
215
+ "messages": [
216
+ {
217
+ "role": "user",
218
+ "content": "How can a bee fly when its wings are so small?"
219
+ }
220
+ ]
221
+ }'
222
+
223
+ ```
224
+
225
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
226
+ </details>
227
+
228
+
229
  ## Creation
230
 
231
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.