robgreenberg3 commited on
Commit
fa59a0d
·
verified ·
1 Parent(s): 20c0c31

Update README.md (#1)

Browse files

- Update README.md (b9f87782987505b34352fe4a24887ba295527f53)

Files changed (1) hide show
  1. README.md +184 -0
README.md CHANGED
@@ -9,6 +9,15 @@ base_model: google/gemma-2-9b
9
 
10
 
11
  # Gemma 2 model card
 
 
 
 
 
 
 
 
 
12
 
13
  **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
14
 
@@ -47,6 +56,181 @@ pip install -U transformers
47
 
48
  Then, copy the snippet from the section that is relevant for your usecase.
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  #### Running with the `pipeline` API
51
 
52
  ```python
 
9
 
10
 
11
  # Gemma 2 model card
12
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
13
+ gemma-2-9b-it
14
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
15
+ </h1>
16
+
17
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
18
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
19
+ </a>
20
+
21
 
22
  **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
23
 
 
56
 
57
  Then, copy the snippet from the section that is relevant for your usecase.
58
 
59
+ ## Deployment
60
+
61
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
62
+
63
+ Deploy on <strong>vLLM</strong>
64
+
65
+ ```python
66
+ from vllm import LLM, SamplingParams
67
+ from transformers import AutoTokenizer
68
+ model_id = "RedHatAI/gemma-2-9b-it"
69
+ number_gpus = 4
70
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
71
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
72
+ prompt = "Give me a short introduction to large language model."
73
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
74
+ outputs = llm.generate(prompt, sampling_params)
75
+ generated_text = outputs[0].outputs[0].text
76
+ print(generated_text)
77
+ ```
78
+
79
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
80
+
81
+ <details>
82
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
83
+
84
+ ```bash
85
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
86
+ --ipc=host \
87
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
88
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
89
+ --name=vllm \
90
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
91
+ vllm serve \
92
+ --tensor-parallel-size 8 \
93
+ --max-model-len 32768 \
94
+ --enforce-eager --model RedHatAI/gemma-2-9b-it
95
+ ```
96
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
97
+ </details>
98
+
99
+ <details>
100
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
101
+
102
+ ```bash
103
+ # Download model from Red Hat Registry via docker
104
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
105
+ ilab model download --repository docker://registry.redhat.io/rhelai1/gemma-2-9b-it:1.5
106
+ ```
107
+
108
+ ```bash
109
+ # Serve model via ilab
110
+ ilab model serve --model-path ~/.cache/instructlab/models/gemma-2-9b-it
111
+
112
+ # Chat with model
113
+ ilab model chat --model ~/.cache/instructlab/models/gemma-2-9b-it
114
+ ```
115
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
116
+ </details>
117
+
118
+ <details>
119
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
120
+
121
+ ```python
122
+ # Setting up vllm server with ServingRuntime
123
+ # Save as: vllm-servingruntime.yaml
124
+ apiVersion: serving.kserve.io/v1alpha1
125
+ kind: ServingRuntime
126
+ metadata:
127
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
128
+ annotations:
129
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
130
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
131
+ labels:
132
+ opendatahub.io/dashboard: 'true'
133
+ spec:
134
+ annotations:
135
+ prometheus.io/port: '8080'
136
+ prometheus.io/path: '/metrics'
137
+ multiModel: false
138
+ supportedModelFormats:
139
+ - autoSelect: true
140
+ name: vLLM
141
+ containers:
142
+ - name: kserve-container
143
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
144
+ command:
145
+ - python
146
+ - -m
147
+ - vllm.entrypoints.openai.api_server
148
+ args:
149
+ - "--port=8080"
150
+ - "--model=/mnt/models"
151
+ - "--served-model-name={{.Name}}"
152
+ env:
153
+ - name: HF_HOME
154
+ value: /tmp/hf_home
155
+ ports:
156
+ - containerPort: 8080
157
+ protocol: TCP
158
+ ```
159
+
160
+ ```python
161
+ # Attach model to vllm server. This is an NVIDIA template
162
+ # Save as: inferenceservice.yaml
163
+ apiVersion: serving.kserve.io/v1beta1
164
+ kind: InferenceService
165
+ metadata:
166
+ annotations:
167
+ openshift.io/display-name: gemma-2-9b-it # OPTIONAL CHANGE
168
+ serving.kserve.io/deploymentMode: RawDeployment
169
+ name: gemma-2-9b-it # specify model name. This value will be used to invoke the model in the payload
170
+ labels:
171
+ opendatahub.io/dashboard: 'true'
172
+ spec:
173
+ predictor:
174
+ maxReplicas: 1
175
+ minReplicas: 1
176
+ model:
177
+ modelFormat:
178
+ name: vLLM
179
+ name: ''
180
+ resources:
181
+ limits:
182
+ cpu: '2' # this is model specific
183
+ memory: 8Gi # this is model specific
184
+ nvidia.com/gpu: '1' # this is accelerator specific
185
+ requests: # same comment for this block
186
+ cpu: '1'
187
+ memory: 4Gi
188
+ nvidia.com/gpu: '1'
189
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
190
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-gemma-2-9b-it:1.5
191
+ tolerations:
192
+ - effect: NoSchedule
193
+ key: nvidia.com/gpu
194
+ operator: Exists
195
+ ```
196
+
197
+ ```bash
198
+ # make sure first to be in the project where you want to deploy the model
199
+ # oc project <project-name>
200
+ # apply both resources to run model
201
+ # Apply the ServingRuntime
202
+ oc apply -f vllm-servingruntime.yaml
203
+ # Apply the InferenceService
204
+ oc apply -f qwen-inferenceservice.yaml
205
+ ```
206
+
207
+ ```python
208
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
209
+ # - Run `oc get inferenceservice` to find your URL if unsure.
210
+ # Call the server using curl:
211
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
212
+ -H "Content-Type: application/json" \
213
+ -d '{
214
+ "model": "gemma-2-9b-it",
215
+ "stream": true,
216
+ "stream_options": {
217
+ "include_usage": true
218
+ },
219
+ "max_tokens": 1,
220
+ "messages": [
221
+ {
222
+ "role": "user",
223
+ "content": "How can a bee fly when its wings are so small?"
224
+ }
225
+ ]
226
+ }'
227
+ ```
228
+
229
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
230
+ </details>
231
+
232
+
233
+
234
  #### Running with the `pipeline` API
235
 
236
  ```python