robgreenberg3 jennyyyi commited on
Commit
bd88524
·
verified ·
1 Parent(s): 4af938c

Update README.md (#1)

Browse files

- Update README.md (1bc580a116d9dc2e648daa702e3a9d22e730c1f3)
- Update README.md (2a510a1442e19958c937f70a10b21d20bcc8a86c)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +197 -2
README.md CHANGED
@@ -10,8 +10,14 @@ base_model:
10
  - ibm-granite/granite-3.1-8b-base
11
  new_version: ibm-granite/granite-3.3-8b-instruct
12
  ---
13
-
14
- # Granite-3.1-8B-Instruct
 
 
 
 
 
 
15
 
16
  **Model Summary:**
17
  Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
@@ -23,6 +29,195 @@ Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned
23
  - **Release Date**: December 18th, 2024
24
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  **Supported Languages:**
27
  English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.
28
 
 
10
  - ibm-granite/granite-3.1-8b-base
11
  new_version: ibm-granite/granite-3.3-8b-instruct
12
  ---
13
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
14
+ Granite-3.1-8B-Instruct
15
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
16
+ </h1>
17
+
18
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
19
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
20
+ </a>
21
 
22
  **Model Summary:**
23
  Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
 
29
  - **Release Date**: December 18th, 2024
30
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
31
 
32
+ ## Deployment
33
+
34
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
35
+
36
+ Deploy on <strong>vLLM</strong>
37
+
38
+ ```python
39
+ from vllm import LLM, SamplingParams
40
+
41
+ from transformers import AutoTokenizer
42
+
43
+ model_id = "RedHatAI/granite-3.1-8b-instruct"
44
+ number_gpus = 1
45
+
46
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
49
+
50
+ prompt = "Give me a short introduction to large language model."
51
+
52
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
53
+
54
+ outputs = llm.generate(prompt, sampling_params)
55
+
56
+ generated_text = outputs[0].outputs[0].text
57
+ print(generated_text)
58
+ ```
59
+
60
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
61
+
62
+ <details>
63
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
64
+
65
+ ```bash
66
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
67
+ --ipc=host \
68
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
69
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
70
+ --name=vllm \
71
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
72
+ vllm serve \
73
+ --tensor-parallel-size 1 \
74
+ --max-model-len 32768 \
75
+ --enforce-eager --model RedHatAI/granite-3.1-8b-instruct
76
+ ```
77
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
78
+ </details>
79
+
80
+ <details>
81
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
82
+
83
+ ```bash
84
+ # Download model from Red Hat Registry via docker
85
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
86
+ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-instruct:1.5
87
+ ```
88
+
89
+ ```bash
90
+ # Serve model via ilab
91
+ ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-instruct -- --trust-remote-code
92
+
93
+ # Chat with model
94
+ ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-instruct
95
+ ```
96
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
97
+ </details>
98
+
99
+ <details>
100
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
101
+
102
+ ```python
103
+ # Setting up vllm server with ServingRuntime
104
+ # Save as: vllm-servingruntime.yaml
105
+ apiVersion: serving.kserve.io/v1alpha1
106
+ kind: ServingRuntime
107
+ metadata:
108
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
109
+ annotations:
110
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
111
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
112
+ labels:
113
+ opendatahub.io/dashboard: 'true'
114
+ spec:
115
+ annotations:
116
+ prometheus.io/port: '8080'
117
+ prometheus.io/path: '/metrics'
118
+ multiModel: false
119
+ supportedModelFormats:
120
+ - autoSelect: true
121
+ name: vLLM
122
+ containers:
123
+ - name: kserve-container
124
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
125
+ command:
126
+ - python
127
+ - -m
128
+ - vllm.entrypoints.openai.api_server
129
+ args:
130
+ - "--port=8080"
131
+ - "--model=/mnt/models"
132
+ - "--served-model-name={{.Name}}"
133
+ env:
134
+ - name: HF_HOME
135
+ value: /tmp/hf_home
136
+ ports:
137
+ - containerPort: 8080
138
+ protocol: TCP
139
+ ```
140
+
141
+ ```python
142
+ # Attach model to vllm server. This is an NVIDIA template
143
+ # Save as: inferenceservice.yaml
144
+ apiVersion: serving.kserve.io/v1beta1
145
+ kind: InferenceService
146
+ metadata:
147
+ annotations:
148
+ openshift.io/display-name: granite-3-1-8b-instruct # OPTIONAL CHANGE
149
+ serving.kserve.io/deploymentMode: RawDeployment
150
+ name: granite-3-1-8b-instruct # specify model name. This value will be used to invoke the model in the payload
151
+ labels:
152
+ opendatahub.io/dashboard: 'true'
153
+ spec:
154
+ predictor:
155
+ maxReplicas: 1
156
+ minReplicas: 1
157
+ model:
158
+ args:
159
+ - '--trust-remote-code'
160
+ modelFormat:
161
+ name: vLLM
162
+ name: ''
163
+ resources:
164
+ limits:
165
+ cpu: '2' # this is model specific
166
+ memory: 8Gi # this is model specific
167
+ nvidia.com/gpu: '1' # this is accelerator specific
168
+ requests: # same comment for this block
169
+ cpu: '1'
170
+ memory: 4Gi
171
+ nvidia.com/gpu: '1'
172
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
173
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5
174
+ tolerations:
175
+ - effect: NoSchedule
176
+ key: nvidia.com/gpu
177
+ operator: Exists
178
+ ```
179
+
180
+ ```bash
181
+ # make sure first to be in the project where you want to deploy the model
182
+ # oc project <project-name>
183
+
184
+ # apply both resources to run model
185
+
186
+ # Apply the ServingRuntime
187
+ oc apply -f vllm-servingruntime.yaml
188
+
189
+ # Apply the InferenceService
190
+ oc apply -f qwen-inferenceservice.yaml
191
+ ```
192
+
193
+ ```python
194
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
195
+ # - Run `oc get inferenceservice` to find your URL if unsure.
196
+
197
+ # Call the server using curl:
198
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
199
+ -H "Content-Type: application/json" \
200
+ -d '{
201
+ "model": "granite-3-1-8b-instruct",
202
+ "stream": true,
203
+ "stream_options": {
204
+ "include_usage": true
205
+ },
206
+ "max_tokens": 1,
207
+ "messages": [
208
+ {
209
+ "role": "user",
210
+ "content": "How can a bee fly when its wings are so small?"
211
+ }
212
+ ]
213
+ }'
214
+
215
+ ```
216
+
217
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
218
+ </details>
219
+
220
+
221
  **Supported Languages:**
222
  English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.
223