Text Generation
Transformers
Safetensors
Finnish
llama
finnish
conversational
text-generation-inference
nielsr HF Staff commited on
Commit
6059cd3
·
verified ·
1 Parent(s): ee9147f

Improve model card: Add `transformers` library, link paper, include abstract

Browse files

This PR significantly enhances the model card for `Ahma-7B` by:

* **Adding `library_name: transformers` to the metadata**: This ensures the Hugging Face Hub correctly recognizes the model's compatible library, enabling the "how to use" button and providing relevant code snippets for users.
* **Linking to the associated research paper**: The model card now explicitly references "[Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264)", which describes the training strategy and research behind the Ahma model. This link is added to the introductory section and updated in the "2-stage pretraining" section for clarity.
* **Including the paper abstract**: A dedicated "Paper Abstract" section has been added to provide users with immediate context about the research, its motivations, and key findings directly within the model card.
* **Removing `inference: false` from metadata**: This tag was contradictory, as the model card provides clear inference usage examples. Removing it clarifies that the model is indeed ready for direct inference.

Files changed (1) hide show
  1. README.md +92 -88
README.md CHANGED
@@ -1,24 +1,23 @@
1
  ---
2
- language:
3
- - fi
4
- license: apache-2.0
5
- tags:
6
- - finnish
7
- - llama
8
  datasets:
9
  - Finnish-NLP/CulturaX_fi_cleaned
10
  - Finnish-NLP/HPLT_1.2_fi_cleaned
11
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
12
  - Finnish-NLP/Reddit_fi_2006_2022
13
  - intfloat/multilingual_cc_news
14
- inference: false
 
 
15
  pipeline_tag: text-generation
16
-
 
 
 
17
  ---
18
 
19
  # Ahma-7B for Finnish
20
 
21
- Ahma-7B is 7B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
22
  [this paper](https://arxiv.org/abs/2302.13971)
23
  and first released at [this page](https://github.com/facebookresearch/llama).
24
 
@@ -26,17 +25,21 @@ What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapl
26
 
27
  There are two different sized base Ahma models both pretrained from scratch, Ahma-3B for 139B tokens and Ahma-7B for 149B tokens:
28
 
29
- | Model | Context length | Layers | Dim | Heads | Params |
30
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
31
- | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
32
- | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
33
 
34
  And two instruct-tuned versions:
35
 
36
- | Model | Context length | Layers | Dim | Heads | Params |
37
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
38
- | [Ahma-3B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-3B-Instruct) | 2048 | 26 | 3200 | 32 | 3.6B |
39
- | [Ahma-7B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-7B-Instruct) | 2048 | 32 | 4096 | 32 | 7.0B |
 
 
 
 
40
 
41
  ## Intended uses & limitations
42
 
@@ -62,7 +65,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
62
 
63
 
64
  def format_prompt(prompt: str) -> str:
65
- prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
 
 
 
 
66
  return prompt
67
 
68
 
@@ -146,29 +153,29 @@ Finally, 20,000 text examples from each of the CulturaX, Wikipedia, Yle, STT, Su
146
  The final training dataset had 23 billion words (calculated with regex "\w+") and the evaluation dataset had 23 million words. After tokenization, the training dataset had 41 billion tokens and the evaluation dataset had 40 million tokens. For the 2-stage pretraining, training datasets are divided as follows:
147
 
148
  The first stage:
149
- |Dataset | Words | Ratio |
150
  |:-----------------------------|:------------|:-------------|
151
- |CulturaX | 12.820B | 59.88\% |
152
- |HPLT v1.2 | 5.034B | 23.51\% |
153
- |Suomi24 | 3.018B | 14.09\% |
154
- |Reddit | 0.141B | 0.66\% |
155
- |CC-News | 0.311B | 1.45\% |
156
- |FI news corpus | 0.004B | 0.02\% |
157
- |Project Lönnrot | 0.083B | 0.39\% |
158
- |**TOTAL** | **21.410B** | **100.0\%** |
159
 
160
 
161
  The second stage:
162
- |Dataset | Words | Ratio |
163
  |:--------------------------------------------------------------|:------------|:------------|
164
- |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
165
- |Wikipedia | 0.095B | 2.34\% |
166
- |STT | 0.253B | 6.23\% |
167
- |Yle | 0.212B | 5.22\% |
168
- |Finnish parliament speeches | 0.021B | 0.52\% |
169
- |Finnish higher education public theses | 0.855B | 21.07\% |
170
- |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14\% |
171
- |**TOTAL** | **4.059B** | **100.0\%** |
172
 
173
  ## Training procedure
174
 
@@ -183,7 +190,7 @@ The model was trained on TPUv4-32 VM, sponsored by the [Google TPU Research Clou
183
 
184
  The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (79% of the entire training), we used noisier web-scraped datasets. For the second stage (21% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 6e-6 for the first 7 billion tokens, followed by a stable phase for the remaining tokens.
185
 
186
- In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://arxiv.org/abs/2305.16264). In the second stage, the model was trained for 31 billion tokens, which is close to five epochs of the second-stage training data.
187
 
188
  Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
189
 
@@ -205,43 +212,41 @@ This Ahma 7B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
205
 
206
  0-shot results:
207
 
208
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
209
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
210
- | Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
211
- | Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
212
- | Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
213
- | Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
214
- | Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
215
- | General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
216
- | HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
217
- | Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
218
- | Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
219
- | Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
220
- | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
221
- | Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
222
- | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
223
- | **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
224
-
225
 
226
  3-shot results:
227
 
228
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
229
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
230
- | Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
231
- | Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
232
- | Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
233
- | Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
234
- | Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
235
- | General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
236
- | HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
237
- | Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
238
- | Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
239
- | Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
240
- | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
241
- | Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
242
- | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
243
- | **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
244
-
245
 
246
  As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
247
 
@@ -254,31 +259,31 @@ This Ahma 7B base model was also evaluated using [MTBench Finnish by LumiOpen](h
254
 
255
  Single-turn results:
256
 
257
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
258
  |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
259
- | Coding | 1.00 | 1.00 | 1.70 | 1.10 |
260
- | Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
261
- | Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
262
- | Math | 3.00 | 3.20 | 3.90 | 2.90 |
263
- | Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
264
- | Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
265
- | STEM | 5.10 | 5.95 | 6.75 | 7.30 |
266
- | Writing | 6.60 | 9.00 | 7.10 | 8.80 |
267
- | **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
268
 
269
  Multi-turn results:
270
 
271
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
272
  |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
273
- | Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
274
- | Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
275
- | Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
276
- | Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
277
- | Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
278
- | Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
279
- | STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
280
- | Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
281
- | **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
282
 
283
  As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
284
 
@@ -294,5 +299,4 @@ This project would not have been possible without compute generously provided by
294
 
295
  Feel free to contact us for more details 🤗
296
 
297
-
298
  ![Ahma](ahma.jpg)
 
1
  ---
 
 
 
 
 
 
2
  datasets:
3
  - Finnish-NLP/CulturaX_fi_cleaned
4
  - Finnish-NLP/HPLT_1.2_fi_cleaned
5
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
6
  - Finnish-NLP/Reddit_fi_2006_2022
7
  - intfloat/multilingual_cc_news
8
+ language:
9
+ - fi
10
+ license: apache-2.0
11
  pipeline_tag: text-generation
12
+ tags:
13
+ - finnish
14
+ - llama
15
+ library_name: transformers
16
  ---
17
 
18
  # Ahma-7B for Finnish
19
 
20
+ Ahma-7B is a 7B parameter decoder-only transformer model based on Meta's Llama (v1) architecture, pretrained from scratch on the Finnish language. Its development was informed by the research presented in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264). The original Llama model architecture was introduced in
21
  [this paper](https://arxiv.org/abs/2302.13971)
22
  and first released at [this page](https://github.com/facebookresearch/llama).
23
 
 
25
 
26
  There are two different sized base Ahma models both pretrained from scratch, Ahma-3B for 139B tokens and Ahma-7B for 149B tokens:
27
 
28
+ | Model | Context length | Layers | Dim | Heads | Params |
29
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
30
+ | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
31
+ | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
32
 
33
  And two instruct-tuned versions:
34
 
35
+ | Model | Context length | Layers | Dim | Heads | Params |
36
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
37
+ | [Ahma-3B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-3B-Instruct) | 2048 | 26 | 3200 | 32 | 3.6B |
38
+ | [Ahma-7B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-7B-Instruct) | 2048 | 32 | 4096 | 32 | 7.0B |
39
+
40
+ ## Paper Abstract
41
+
42
+ The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at this https URL .
43
 
44
  ## Intended uses & limitations
45
 
 
65
 
66
 
67
  def format_prompt(prompt: str) -> str:
68
+ prompt = f" [INST] <<SYS>>
69
+ {system_prompt.strip()}
70
+ <</SYS>>
71
+
72
+ {prompt.strip()} [/INST] "
73
  return prompt
74
 
75
 
 
153
  The final training dataset had 23 billion words (calculated with regex "\w+") and the evaluation dataset had 23 million words. After tokenization, the training dataset had 41 billion tokens and the evaluation dataset had 40 million tokens. For the 2-stage pretraining, training datasets are divided as follows:
154
 
155
  The first stage:
156
+ |Dataset | Words | Ratio |
157
  |:-----------------------------|:------------|:-------------|
158
+ |CulturaX | 12.820B | 59.88% |
159
+ |HPLT v1.2 | 5.034B | 23.51% |
160
+ |Suomi24 | 3.018B | 14.09% |
161
+ |Reddit | 0.141B | 0.66% |
162
+ |CC-News | 0.311B | 1.45% |
163
+ |FI news corpus | 0.004B | 0.02% |
164
+ |Project Lönnrot | 0.083B | 0.39% |
165
+ |**TOTAL** | **21.410B** | **100.0%** |
166
 
167
 
168
  The second stage:
169
+ |Dataset | Words | Ratio |
170
  |:--------------------------------------------------------------|:------------|:------------|
171
+ |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48% |
172
+ |Wikipedia | 0.095B | 2.34% |
173
+ |STT | 0.253B | 6.23% |
174
+ |Yle | 0.212B | 5.22% |
175
+ |Finnish parliament speeches | 0.021B | 0.52% |
176
+ |Finnish higher education public theses | 0.855B | 21.07% |
177
+ |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14% |
178
+ |**TOTAL** | **4.059B** | **100.0%** |
179
 
180
  ## Training procedure
181
 
 
190
 
191
  The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (79% of the entire training), we used noisier web-scraped datasets. For the second stage (21% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 6e-6 for the first 7 billion tokens, followed by a stable phase for the remaining tokens.
192
 
193
+ In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264). In the second stage, the model was trained for 31 billion tokens, which is close to five epochs of the second-stage training data.
194
 
195
  Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
196
 
 
212
 
213
  0-shot results:
214
 
215
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
216
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
217
+ | Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
218
+ | Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
219
+ | Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
220
+ | Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
221
+ | Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
222
+ | General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
223
+ | HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
224
+ | Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
225
+ | Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
226
+ | Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
227
+ | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
228
+ | Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
229
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
230
+ | **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
 
231
 
232
  3-shot results:
233
 
234
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
235
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
236
+ | Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
237
+ | Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
238
+ | Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
239
+ | Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
240
+ | Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
241
+ | General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
242
+ | HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
243
+ | Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
244
+ | Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
245
+ | Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
246
+ | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
247
+ | Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
248
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
249
+ | **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
 
250
 
251
  As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
252
 
 
259
 
260
  Single-turn results:
261
 
262
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
263
  |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
264
+ | Coding | 1.00 | 1.00 | 1.70 | 1.10 |
265
+ | Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
266
+ | Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
267
+ | Math | 3.00 | 3.20 | 3.90 | 2.90 |
268
+ | Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
269
+ | Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
270
+ | STEM | 5.10 | 5.95 | 6.75 | 7.30 |
271
+ | Writing | 6.60 | 9.00 | 7.10 | 8.80 |
272
+ | **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
273
 
274
  Multi-turn results:
275
 
276
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
277
  |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
278
+ | Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
279
+ | Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
280
+ | Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
281
+ | Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
282
+ | Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
283
+ | Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
284
+ | STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
285
+ | Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
286
+ | **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
287
 
288
  As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
289
 
 
299
 
300
  Feel free to contact us for more details 🤗
301
 
 
302
  ![Ahma](ahma.jpg)