robgreenberg3 commited on
Commit
d9be03b
·
verified ·
1 Parent(s): 5565678

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -3
README.md CHANGED
@@ -5,7 +5,14 @@ tags:
5
  license: gemma
6
  ---
7
 
8
- # gemma-2-9b-it-FP8
 
 
 
 
 
 
 
9
 
10
  ## Model Overview
11
  - **Model Architecture:** Gemma 2
@@ -19,7 +26,7 @@ license: gemma
19
  - **Release Date:** 7/8/2024
20
  - **Version:** 1.0
21
  - **License(s):** [gemma](https://ai.google.dev/gemma/terms)
22
- - **Model Developers:** Neural Magic
23
 
24
  Quantized version of [gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it).
25
  It achieves an average score of 73.49 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.23.
@@ -42,7 +49,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
42
  from vllm import LLM, SamplingParams
43
  from transformers import AutoTokenizer
44
 
45
- model_id = "neuralmagic/gemma-2-9b-it-FP8"
46
 
47
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
48
 
@@ -64,6 +71,158 @@ print(generated_text)
64
 
65
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ## Creation
68
 
69
  This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
 
5
  license: gemma
6
  ---
7
 
8
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
9
+ gemma-2-9b-it-FP8
10
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
11
+ </h1>
12
+
13
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
14
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
15
+ </a>
16
 
17
  ## Model Overview
18
  - **Model Architecture:** Gemma 2
 
26
  - **Release Date:** 7/8/2024
27
  - **Version:** 1.0
28
  - **License(s):** [gemma](https://ai.google.dev/gemma/terms)
29
+ - **Model Developers:** Neural Magic (Red Hat)
30
 
31
  Quantized version of [gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it).
32
  It achieves an average score of 73.49 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.23.
 
49
  from vllm import LLM, SamplingParams
50
  from transformers import AutoTokenizer
51
 
52
+ model_id = "RedHatAI/gemma-2-9b-it-FP8"
53
 
54
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
55
 
 
71
 
72
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
73
 
74
+ <details>
75
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
76
+
77
+ ```bash
78
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
79
+ --ipc=host \
80
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
81
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
82
+ --name=vllm \
83
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
84
+ vllm serve \
85
+ --tensor-parallel-size 8 \
86
+ --max-model-len 32768 \
87
+ --enforce-eager --model RedHatAI/gemma-2-9b-it-FP8
88
+ ```
89
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
90
+ </details>
91
+
92
+ <details>
93
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
94
+
95
+ ```bash
96
+ # Download model from Red Hat Registry via docker
97
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
98
+ ilab model download --repository docker://registry.redhat.io/rhelai1/gemma-2-9b-it-FP8:1.5
99
+ ```
100
+
101
+ ```bash
102
+ # Serve model via ilab
103
+ ilab model serve --model-path ~/.cache/instructlab/models/gemma-2-9b-it-FP8
104
+
105
+ # Chat with model
106
+ ilab model chat --model ~/.cache/instructlab/models/gemma-2-9b-it-FP8
107
+ ```
108
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
109
+ </details>
110
+
111
+ <details>
112
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
113
+
114
+ ```python
115
+ # Setting up vllm server with ServingRuntime
116
+ # Save as: vllm-servingruntime.yaml
117
+ apiVersion: serving.kserve.io/v1alpha1
118
+ kind: ServingRuntime
119
+ metadata:
120
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
121
+ annotations:
122
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
123
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
124
+ labels:
125
+ opendatahub.io/dashboard: 'true'
126
+ spec:
127
+ annotations:
128
+ prometheus.io/port: '8080'
129
+ prometheus.io/path: '/metrics'
130
+ multiModel: false
131
+ supportedModelFormats:
132
+ - autoSelect: true
133
+ name: vLLM
134
+ containers:
135
+ - name: kserve-container
136
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
137
+ command:
138
+ - python
139
+ - -m
140
+ - vllm.entrypoints.openai.api_server
141
+ args:
142
+ - "--port=8080"
143
+ - "--model=/mnt/models"
144
+ - "--served-model-name={{.Name}}"
145
+ env:
146
+ - name: HF_HOME
147
+ value: /tmp/hf_home
148
+ ports:
149
+ - containerPort: 8080
150
+ protocol: TCP
151
+ ```
152
+
153
+ ```python
154
+ # Attach model to vllm server. This is an NVIDIA template
155
+ # Save as: inferenceservice.yaml
156
+ apiVersion: serving.kserve.io/v1beta1
157
+ kind: InferenceService
158
+ metadata:
159
+ annotations:
160
+ openshift.io/display-name: gemma-2-9b-it-FP8 # OPTIONAL CHANGE
161
+ serving.kserve.io/deploymentMode: RawDeployment
162
+ name: gemma-2-9b-it-FP8 # specify model name. This value will be used to invoke the model in the payload
163
+ labels:
164
+ opendatahub.io/dashboard: 'true'
165
+ spec:
166
+ predictor:
167
+ maxReplicas: 1
168
+ minReplicas: 1
169
+ model:
170
+ modelFormat:
171
+ name: vLLM
172
+ name: ''
173
+ resources:
174
+ limits:
175
+ cpu: '2' # this is model specific
176
+ memory: 8Gi # this is model specific
177
+ nvidia.com/gpu: '1' # this is accelerator specific
178
+ requests: # same comment for this block
179
+ cpu: '1'
180
+ memory: 4Gi
181
+ nvidia.com/gpu: '1'
182
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
183
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-gemma-2-9b-it-FP8:1.5
184
+ tolerations:
185
+ - effect: NoSchedule
186
+ key: nvidia.com/gpu
187
+ operator: Exists
188
+ ```
189
+
190
+ ```bash
191
+ # make sure first to be in the project where you want to deploy the model
192
+ # oc project <project-name>
193
+ # apply both resources to run model
194
+ # Apply the ServingRuntime
195
+ oc apply -f vllm-servingruntime.yaml
196
+ # Apply the InferenceService
197
+ oc apply -f qwen-inferenceservice.yaml
198
+ ```
199
+
200
+ ```python
201
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
202
+ # - Run `oc get inferenceservice` to find your URL if unsure.
203
+ # Call the server using curl:
204
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
205
+ -H "Content-Type: application/json" \
206
+ -d '{
207
+ "model": "gemma-2-9b-it-FP8",
208
+ "stream": true,
209
+ "stream_options": {
210
+ "include_usage": true
211
+ },
212
+ "max_tokens": 1,
213
+ "messages": [
214
+ {
215
+ "role": "user",
216
+ "content": "How can a bee fly when its wings are so small?"
217
+ }
218
+ ]
219
+ }'
220
+ ```
221
+
222
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
223
+ </details>
224
+
225
+
226
  ## Creation
227
 
228
  This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.