jennyyyi commited on
Commit
1bc580a
·
verified ·
1 Parent(s): 4af938c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -2
README.md CHANGED
@@ -10,8 +10,14 @@ base_model:
10
  - ibm-granite/granite-3.1-8b-base
11
  new_version: ibm-granite/granite-3.3-8b-instruct
12
  ---
13
-
14
- # Granite-3.1-8B-Instruct
 
 
 
 
 
 
15
 
16
  **Model Summary:**
17
  Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
@@ -23,6 +29,193 @@ Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned
23
  - **Release Date**: December 18th, 2024
24
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  **Supported Languages:**
27
  English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.
28
 
 
10
  - ibm-granite/granite-3.1-8b-base
11
  new_version: ibm-granite/granite-3.3-8b-instruct
12
  ---
13
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
14
+ Granite-3.1-8B-Instruct
15
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
16
+ </h1>
17
+
18
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
19
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
20
+ </a>
21
 
22
  **Model Summary:**
23
  Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
 
29
  - **Release Date**: December 18th, 2024
30
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
31
 
32
+ ## Deployment
33
+
34
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
35
+
36
+ Deploy on <strong>vLLM</strong>
37
+
38
+ ```python
39
+ from vllm import LLM, SamplingParams
40
+
41
+ from transformers import AutoTokenizer
42
+
43
+ model_id = "RedHatAI/granite-3.1-8b-instruct"
44
+ number_gpus = 1
45
+
46
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
49
+
50
+ prompt = "Give me a short introduction to large language model."
51
+
52
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
53
+
54
+ outputs = llm.generate(prompt, sampling_params)
55
+
56
+ generated_text = outputs[0].outputs[0].text
57
+ print(generated_text)
58
+ ```
59
+
60
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
61
+
62
+ <details>
63
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
64
+
65
+ ```bash
66
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
67
+ --ipc=host \
68
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
69
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
70
+ --name=vllm \
71
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
72
+ vllm serve \
73
+ --tensor-parallel-size 1 \
74
+ --max-model-len 32768 \
75
+ --enforce-eager --model RedHatAI/granite-3.1-8b-instruct
76
+ ```
77
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
78
+ </details>
79
+
80
+ <details>
81
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
82
+
83
+ ```bash
84
+ # Download model from Red Hat Registry via docker
85
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
86
+ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3-1-8b-instruct:1.5
87
+ ```
88
+
89
+ ```bash
90
+ # Serve model via ilab
91
+ ilab model serve --model-path ~/.cache/instructlab/models/granite-3-1-8b-instruct
92
+
93
+ # Chat with model
94
+ ilab model chat --model ~/.cache/instructlab/models/granite-3-1-8b-instruct
95
+ ```
96
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
97
+ </details>
98
+
99
+ <details>
100
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
101
+
102
+ ```python
103
+ # Setting up vllm server with ServingRuntime
104
+ # Save as: vllm-servingruntime.yaml
105
+ apiVersion: serving.kserve.io/v1alpha1
106
+ kind: ServingRuntime
107
+ metadata:
108
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
109
+ annotations:
110
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
111
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
112
+ labels:
113
+ opendatahub.io/dashboard: 'true'
114
+ spec:
115
+ annotations:
116
+ prometheus.io/port: '8080'
117
+ prometheus.io/path: '/metrics'
118
+ multiModel: false
119
+ supportedModelFormats:
120
+ - autoSelect: true
121
+ name: vLLM
122
+ containers:
123
+ - name: kserve-container
124
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
125
+ command:
126
+ - python
127
+ - -m
128
+ - vllm.entrypoints.openai.api_server
129
+ args:
130
+ - "--port=8080"
131
+ - "--model=/mnt/models"
132
+ - "--served-model-name={{.Name}}"
133
+ env:
134
+ - name: HF_HOME
135
+ value: /tmp/hf_home
136
+ ports:
137
+ - containerPort: 8080
138
+ protocol: TCP
139
+ ```
140
+
141
+ ```python
142
+ # Attach model to vllm server. This is an NVIDIA template
143
+ # Save as: inferenceservice.yaml
144
+ apiVersion: serving.kserve.io/v1beta1
145
+ kind: InferenceService
146
+ metadata:
147
+ annotations:
148
+ openshift.io/display-name: RedHatAI/granite-3.1-8b-instruct # OPTIONAL CHANGE
149
+ serving.kserve.io/deploymentMode: RawDeployment
150
+ name: RedHatAI/granite-3.1-8b-instruct # specify model name. This value will be used to invoke the model in the payload
151
+ labels:
152
+ opendatahub.io/dashboard: 'true'
153
+ spec:
154
+ predictor:
155
+ maxReplicas: 1
156
+ minReplicas: 1
157
+ model:
158
+ modelFormat:
159
+ name: vLLM
160
+ name: ''
161
+ resources:
162
+ limits:
163
+ cpu: '2' # this is model specific
164
+ memory: 8Gi # this is model specific
165
+ nvidia.com/gpu: '1' # this is accelerator specific
166
+ requests: # same comment for this block
167
+ cpu: '1'
168
+ memory: 4Gi
169
+ nvidia.com/gpu: '1'
170
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
171
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5
172
+ tolerations:
173
+ - effect: NoSchedule
174
+ key: nvidia.com/gpu
175
+ operator: Exists
176
+ ```
177
+
178
+ ```bash
179
+ # make sure first to be in the project where you want to deploy the model
180
+ # oc project <project-name>
181
+
182
+ # apply both resources to run model
183
+
184
+ # Apply the ServingRuntime
185
+ oc apply -f vllm-servingruntime.yaml
186
+
187
+ # Apply the InferenceService
188
+ oc apply -f qwen-inferenceservice.yaml
189
+ ```
190
+
191
+ ```python
192
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
193
+ # - Run `oc get inferenceservice` to find your URL if unsure.
194
+
195
+ # Call the server using curl:
196
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
197
+ -H "Content-Type: application/json" \
198
+ -d '{
199
+ "model": "RedHatAI/granite-3.1-8b-instruct",
200
+ "stream": true,
201
+ "stream_options": {
202
+ "include_usage": true
203
+ },
204
+ "max_tokens": 1,
205
+ "messages": [
206
+ {
207
+ "role": "user",
208
+ "content": "How can a bee fly when its wings are so small?"
209
+ }
210
+ ]
211
+ }'
212
+
213
+ ```
214
+
215
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
216
+ </details>
217
+
218
+
219
  **Supported Languages:**
220
  English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.
221