kyynaama commited on
Commit
1e9bf1c
·
verified ·
1 Parent(s): ca9c1cf

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1 +1,261 @@
1
- 6 bpw exllamav2 quant of Finnish-NLP/Ahma-3B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fi
4
+ license: apache-2.0
5
+ tags:
6
+ - finnish
7
+ - llama
8
+ datasets:
9
+ - Finnish-NLP/CulturaX_fi_cleaned
10
+ - Finnish-NLP/HPLT_1.2_fi_cleaned
11
+ - Finnish-NLP/wikipedia_20231101_fi_cleaned
12
+ - Finnish-NLP/Reddit_fi_2006_2022
13
+ - intfloat/multilingual_cc_news
14
+ inference: false
15
+ pipeline_tag: text-generation
16
+
17
+ ---
18
+
19
+ # Ahma-3B for Finnish
20
+
21
+ Ahma is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained on Finnish language. Original Llama model architecture was introduced in
22
+ [this paper](https://arxiv.org/abs/2302.13971)
23
+ and first released at [this page](https://github.com/facebookresearch/llama).
24
+
25
+ What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
26
+
27
+ There are two different sized Ahma models, all pretrained from scratch for 139B tokens:
28
+
29
+ | Model | Context length | Layers | Dim | Heads | Params |
30
+ |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
31
+ | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
32
+ | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
33
+
34
+ ## Intended uses & limitations
35
+
36
+ This model was pretrained only in a self-supervised way, without any supervised training. You can use this model for text generation or fine-tune it for a downstream task. This model followed a 2-stage pretraining approach where single-turn instruction-following examples were mixed in with the other training data in the second stage (explained more later in this readme). Thanks to this approach, this pretrained model is already capable of instruction following, but you might get even better results if you specifically fine-tune it for instruction following or other use cases. For instruction-following fine-tuning, you should use the same prompt format showcased below.
37
+
38
+ ### How to use
39
+
40
+ **Finetuning:** \
41
+ We have now added finetuning example notebook along with video! \
42
+ Notebook: https://huggingface.co/Finnish-NLP/Ahma-3B/blob/main/Finetune_Ahma_3B_example.ipynb \
43
+ Video: https://www.youtube.com/watch?v=6mbgn9XzpS4
44
+
45
+
46
+ **Inference:** \
47
+ If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForCausalLM
51
+
52
+ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa."
53
+
54
+
55
+ def format_prompt(prompt: str) -> str:
56
+ prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
57
+ return prompt
58
+
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/Ahma-3B")
61
+ model = AutoModelForCausalLM.from_pretrained("Finnish-NLP/Ahma-3B")
62
+
63
+ # use the custom prompt format function or the chat template feature in the tokenizer to format your inputs
64
+
65
+ # prompt = format_prompt("Mitä hyötyjä pienet avoimen lähdekoodin kielimallit tuovat?")
66
+ # inputs = tokenizer(prompt, return_tensors="pt")
67
+
68
+ messages = [
69
+ {
70
+ "role": "system",
71
+ "content": system_prompt,
72
+ },
73
+ {"role": "user", "content": "Mitä hyötyjä pienet avoimen lähdekoodin kielimallit tuovat?"},
74
+ ]
75
+ inputs = tokenizer.apply_chat_template(
76
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
77
+ )
78
+
79
+ generated_ids = model.generate(
80
+ inputs,
81
+ temperature=0.6,
82
+ penalty_alpha=0.6,
83
+ top_k=4,
84
+ do_sample=True,
85
+ repetition_penalty=1.2,
86
+ min_length=5,
87
+ max_length=2048,
88
+ )
89
+ generated_text = tokenizer.batch_decode(
90
+ generated_ids, skip_special_tokens=False
91
+ )[0]
92
+
93
+ # Pienillä avoimen lähdekoodin kielimalleilla on lukuisia etuja, kuten parempi tarkkuus, nopeampi käsittelyaika ja parempi skaalautuvuus. Ne ovat myös usein edullisempia käyttää kuin kaupalliset mallit, joten ne ovat hyvä valinta pienemmille organisaatioille ja yksityishenkilöille, joilla on rajoitettu budjetti. Lisäksi ne voivat tarjota paremman joustavuuden ja mukauttamisen, koska käyttäjät voivat räätälöidä malleja vastaamaan omia tarpeitaan. Kaiken kaikkiaan pienet avoimen lähdekoodin kielimallit tarjoavat merkittäviä etuja, kuten paremman suorituskyvyn, paremman tarkkuuden, nopeamman käsittelyajan ja paremman skaalautuvuuden.
94
+ ```
95
+
96
+ You may experiment with different system prompt instructions too if you like.
97
+
98
+ ### Limitations and bias
99
+
100
+ The training data used for this model contains a lot of content from the internet, which is far from neutral. Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
101
+
102
+ To reduce toxic content, training data was filtered with a toxicity classifier but it cannot truly eliminate all toxic text.
103
+
104
+ ## Training data
105
+
106
+ This model was pretrained on the combination of 14 datasets:
107
+ - [CulturaX_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/CulturaX_fi_cleaned), we cleaned Finnish split from the original [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset
108
+ - [HPLT_1.2_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/HPLT_1.2_fi_cleaned), we cleaned Finnish split from the original [HPLT v1.2](https://hplt-project.org/datasets/v1.2) dataset
109
+ - [wikipedia_20231101_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/wikipedia_20231101_fi_cleaned), we used the Finnish subset of the wikipedia (November 2023) dataset
110
+ - [Reddit_fi_2006_2022](https://huggingface.co/datasets/Finnish-NLP/Reddit_fi_2006_2022), filtered and post-processed dataset of Finnish Reddit
111
+ - [Yle Finnish News Archive 2011-2018](http://urn.fi/urn:nbn:fi:lb-2017070501)
112
+ - [Yle Finnish News Archive 2019-2020](http://urn.fi/urn:nbn:fi:lb-2021050401)
113
+ - [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
114
+ - [The Suomi24 Sentences Corpus](http://urn.fi/urn:nbn:fi:lb-2020021803)
115
+ - [Project Lönnrot](http://www.lonnrot.net/)
116
+ - [Finnish parliament speeches](https://avoindata.eduskunta.fi)
117
+ - [multilingual_cc_news](https://huggingface.co/datasets/intfloat/multilingual_cc_news), we used the Finnish subset of the multilingual CC-News dataset
118
+ - [fi-news-corpus](https://github.com/nkrusch/fi-news-corpus)
119
+ - Finnish higher education public theses
120
+ - Finnish single-turn instruction-following datasets, combination of multiple originally openly licensed English datasets translated to Finnish. For example, [Ultrachat, Aya, Capybara, etc](https://huggingface.co/collections/Finnish-NLP/sft-dpo-dataset-65f55dde1139c3cd683ff035)
121
+
122
+
123
+ Raw datasets were automatically cleaned to filter out bad quality and non-Finnish examples. Also, a [perplexity](https://huggingface.co/course/chapter7/3#perplexity-for-language-models) score was calculated for all texts with a KenLM model which was trained with very clean Finnish texts only. This perplexity score can then be used to determine how "clean" Finnish language the text contains. To reduce toxic text, we used Finnish toxicity classifier [TurkuNLP/bert-large-finnish-cased-toxicity](https://huggingface.co/TurkuNLP/bert-large-finnish-cased-toxicity) released by TurkuNLP to classify all text examples. Classified toxicity label scores can then be used to determine how toxic the text is.
124
+
125
+ All datasets were concatenated and the whole dataset near deduplicated using MinHashLSH from [text-dedup](https://github.com/ChenghaoMou/text-dedup). Top 95% perplexity score was used as a filtering threshold to filter out the worst quality 5% of texts. To reduce amount of toxic content, the dataset was filtered to include text examples having lower than 80% score for the toxicity labels "label_identity_attack", "label_insult", "label_threat" and "label_severe_toxicity".
126
+
127
+ Finally, 20,000 text examples from each of the CulturaX, Wikipedia, Yle, STT, Suomi24, and Reddit datasets were randomly selected for evaluation dataset.
128
+
129
+ The final training dataset had 23 billion words (calculated with regex "\w+") and the evaluation dataset had 23 million words. After tokenization, the training dataset had 41 billion tokens and the evaluation dataset had 40 million tokens. For the 2-stage pretraining, training datasets are divided as follows:
130
+
131
+ The first stage:
132
+ |Dataset | Words | Ratio |
133
+ |:-----------------------------|:------------|:-------------|
134
+ |CulturaX | 12.820B | 59.88\% |
135
+ |HPLT v1.2 | 5.034B | 23.51\% |
136
+ |Suomi24 | 3.018B | 14.09\% |
137
+ |Reddit | 0.141B | 0.66\% |
138
+ |CC-News | 0.311B | 1.45\% |
139
+ |FI news corpus | 0.004B | 0.02\% |
140
+ |Project Lönnrot | 0.083B | 0.39\% |
141
+ |**TOTAL** | **21.410B** | **100.0\%** |
142
+
143
+
144
+ The second stage:
145
+ |Dataset | Words | Ratio |
146
+ |:--------------------------------------------------------------|:------------|:------------|
147
+ |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
148
+ |Wikipedia | 0.095B | 2.34\% |
149
+ |STT | 0.253B | 6.23\% |
150
+ |Yle | 0.212B | 5.22\% |
151
+ |Finnish parliament speeches | 0.021B | 0.52\% |
152
+ |Finnish higher education public theses | 0.855B | 21.07\% |
153
+ |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14\% |
154
+ |**TOTAL** | **4.059B** | **100.0\%** |
155
+
156
+ ## Training procedure
157
+
158
+ ### Preprocessing
159
+
160
+ Texts are tokenized using Byte Pair Encoding (BPE) using the implementation from SentencePiece splitting all numbers into individual digits and using bytes to decompose unknown UTF-8 characters. The total
161
+ vocabulary size is 64k tokens. Inputs are sequences of 2048 consecutive tokens. Texts are not lower cased so this model is case-sensitive: it makes a difference between finnish and Finnish. Both BOS and EOS tokens were used in the pretraining.
162
+
163
+ ### 2-stage pretraining
164
+
165
+ The model was trained on TPUv4-32 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/). Training was conducted with a slightly modified Jax/Flax based [EasyLM](https://github.com/young-geng/EasyLM) framework, and inspired by the [OpenLLaMA](https://github.com/openlm-research/open_llama) project. The optimizer used was a [Lion](https://arxiv.org/abs/2302.06675).
166
+
167
+ The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (85% of the entire training), we used noisier web-scraped datasets. For the second stage (15% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 1e-5 for the first 13 billion tokens, followed by a stable phase for the remaining tokens.
168
+
169
+ In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://arxiv.org/abs/2305.16264). In the second stage, the model was trained for 21 billion tokens, which is about three epochs of the second-stage training data.
170
+
171
+ Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
172
+
173
+ - [900K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/916632fe707a7fbe341a1902ac9eacf6e5872ec9)
174
+ - [800K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/a18d46e62823b19b4a97332c0a5a62b14372a3e2)
175
+ - [700K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/2d16e05820af108582dbfcd3d25e51c6f1d5076b)
176
+ - [600K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/949f4bfba406882d5ce0343aa1242bcf901202e2)
177
+ - [500K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/359812c02839d4085d890c6db0e57796b7e48bfc)
178
+ - [400K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/62468680cb84579a7d1885f60abe6d6607f59f45)
179
+ - [300K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/0424dcc0b3dbf505f7b20cf02cb80233289ef125)
180
+ - [200K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/e415206d791aad108bed8578009bf255c1f22c91)
181
+ - [100K](https://huggingface.co/Finnish-NLP/Ahma-3B/tree/8085f7c3fba46cfdbf95a01b7a1da1587b757f8b)
182
+
183
+ ## Evaluation results
184
+
185
+ ### FIN-bench
186
+
187
+ This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison. Below are the results with 0-shot and 3-shot settings in FIN-bench:
188
+
189
+ | Benchmark | Ahma 3B (instruct prompt format) 0-shot | Ahma 7B (instruct prompt format) 0-shot | FinGPT 8B 0-shot | Viking 7B 0-shot | Poro 34B (8bit quant) 0-shot |
190
+ |:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
191
+ | Analogies | 50.77 | TBA | 49.23 | 40.00 | 54.62 |
192
+ | Arithmetic | 27.64 | TBA | 33.15 | 30.16 | 30.34 |
193
+ | Cause and Effect | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
194
+ | Emotions | 36.25 | TBA | 22.50 | 26.25 | 35.63 |
195
+ | Empirical Judgements | 33.33 | TBA | 27.27 | 33.33 | 49.49 |
196
+ | General Knowledge | 44.29 | TBA | 40.00 | 24.29 | 51.43 |
197
+ | HHH Alignment | 42.09 | TBA | 41.81 | 42.51 | 42.92 |
198
+ | Intent Recognition | 24.42 | TBA | 17.49 | 22.40 | 68.35 |
199
+ | Misconceptions | 46.27 | TBA | 53.73 | 53.73 | 52.24 |
200
+ | Paraphrase | 59.50 | TBA | 51.00 | 50.00 | 51.00 |
201
+ | Sentence Ambiguity | 53.33 | TBA | 51.67 | 48.33 | 50.00 |
202
+ | Similarities Abstraction | 65.79 | TBA | 60.53 | 65.79 | 60.53 |
203
+ | **Non-Arithmetic Average** | **47.55** | TBA | **46.17** | **44.42** | **52.08** |
204
+ | **Overall Average** | **36.49** | TBA | **38.93** | **36.50** | **40.00** |
205
+
206
+
207
+ | Benchmark | Ahma 3B (instruct prompt format) 3-shot | Ahma 7B (instruct prompt format) 3-shot | FinGPT 8B 3-shot | Viking 7B 3-shot | Poro 34B (8bit quant) 3-shot |
208
+ |:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
209
+ | Analogies | 52.31 | TBA | 40.77 | 54.62 | 76.92 |
210
+ | Arithmetic | 44.59 | TBA | 43.63 | 45.78 | 53.68 |
211
+ | Cause and Effect | 61.44 | TBA | 64.05 | 58.17 | 67.32 |
212
+ | Emotions | 14.37 | TBA | 44.37 | 48.13 | 56.87 |
213
+ | Empirical Judgements | 38.38 | TBA | 32.32 | 43.43 | 63.64 |
214
+ | General Knowledge | 38.57 | TBA | 54.29 | 28.57 | 74.29 |
215
+ | HHH Alignment | 42.94 | TBA | 45.39 | 44.80 | 46.07 |
216
+ | Intent Recognition | 24.28 | TBA | 51.45 | 58.82 | 83.67 |
217
+ | Misconceptions | 46.27 | TBA | 52.99 | 46.27 | 52.99 |
218
+ | Paraphrase | 58.50 | TBA | 53.00 | 54.50 | 55.00 |
219
+ | Sentence Ambiguity | 53.33 | TBA | 51.67 | 53.33 | 66.67 |
220
+ | Similarities Abstraction | 72.37 | TBA | 64.47 | 73.68 | 75.00 |
221
+ | **Non-Arithmetic Average** | **47.15** | TBA | **51.19** | **50.94** | **61.96** |
222
+ | **Overall Average** | **45.73** | TBA | **46.99** | **48.07** | **57.36** |
223
+
224
+
225
+ As we can see, Ahma 3B model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
226
+
227
+ In a 3-shot setting, the results are more mixed. The poorer performance of Ahma 3B in 3-shot settings might be due to the use of the instruct prompt format and having only single-turn instruction-following training examples.
228
+
229
+
230
+ ### MTBench Finnish
231
+
232
+ This Ahma model was also evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) even though this Ahma model is not fine-tuned for chat. Since the MTBench evaluates also multi-turn chats while Ahma models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. [Poro 34B Chat](https://huggingface.co/LumiOpen/Poro-34B-chat) model's results are copied from their model card for comparison.
233
+
234
+ | Benchmark | Ahma 3B (instruct prompt format) single-turn | Ahma 3B (instruct prompt format) multi-turn | Ahma 7B (instruct prompt format) single-turn | Ahma 7B (instruct prompt format) multi-turn | Poro 34B Chat multi-turn |
235
+ |:--------------------|:---------------------------------------------|:--------------------------------------------|:---------------------------------------------|:--------------------------------------------|:-------------------------|
236
+ | Coding | 1.00 | 1.00 | TBA | TBA | 3.05 |
237
+ | Extraction | 2.00 | 1.55 | TBA | TBA | 6.05 |
238
+ | Humanities | 4.05 | 3.25 | TBA | TBA | 9.6 |
239
+ | Math | 3.00 | 2.20 | TBA | TBA | 1.25 |
240
+ | Reasoning | 2.90 | 2.45 | TBA | TBA | 3.65 |
241
+ | Roleplay | 4.80 | 4.90 | TBA | TBA | 7.0 |
242
+ | STEM | 5.10 | 4.20 | TBA | TBA | 7.65 |
243
+ | Writing | 6.60 | 3.80 | TBA | TBA | 7.6 |
244
+ | **Overall Average** | **3.68** | **2.92** | TBA | TBA | **5.73** |
245
+
246
+ As we can see, Ahma 3B model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
247
+
248
+ ## Acknowledgements
249
+
250
+ This project would not have been possible without compute generously provided by Google through the
251
+ [TPU Research Cloud](https://sites.research.google/trc/).
252
+
253
+ ## Team Members
254
+
255
+ - Aapo Tanskanen, [Hugging Face profile](https://huggingface.co/aapot), [LinkedIn profile](https://www.linkedin.com/in/aapotanskanen/)
256
+ - Rasmus Toivanen, [Hugging Face profile](https://huggingface.co/RASMUS), [LinkedIn profile](https://www.linkedin.com/in/rasmustoivanen/)
257
+
258
+ Feel free to contact us for more details 🤗
259
+
260
+
261
+ ![Ahma](ahma.jpg)
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 3200,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 8640,
13
+ "max_position_embeddings": 2048,
14
+ "mlp_bias": false,
15
+ "model_type": "llama",
16
+ "num_attention_heads": 32,
17
+ "num_hidden_layers": 26,
18
+ "num_key_value_heads": 32,
19
+ "pretraining_tp": 1,
20
+ "rms_norm_eps": 1e-06,
21
+ "rope_scaling": null,
22
+ "rope_theta": 10000.0,
23
+ "tie_word_embeddings": false,
24
+ "torch_dtype": "float16",
25
+ "transformers_version": "4.42.0.dev0",
26
+ "use_cache": true,
27
+ "vocab_size": 64256
28
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.42.0.dev0"
6
+ }
huggingface-metadata.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ url: https://huggingface.co/Finnish-NLP/Ahma-3B
2
+ branch: main
3
+ download date: 2024-07-02 01:04:58
4
+ sha256sum:
5
+ ebea50b3d4ba830085a26f5a608edeaf36f78c3bc5b5c81c83ebec436576116c model-00001-of-00002.safetensors
6
+ 0296bfe931223bd0ecb717a60bb9aea6ac40ab6c7e960f0a897eb9e8e2550c8b model-00002-of-00002.safetensors
7
+ 1980c00aa3cb5455177a39efa3e60e7b8887ee89c3f7b8950719592a08ad9456 tokenizer.model
measurement.json ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9af1548bca5413f2f3cb59659c5df6e37f9c988a99f80c7098074d607ad3ad6f
3
+ size 3222432534
model.safetensors.index.json ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 7265824000
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00002-of-00002.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
17
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
19
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
26
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
29
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
31
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
38
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
41
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
43
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
50
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
53
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
55
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
62
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
64
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
65
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
66
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
67
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
73
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
74
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
76
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
77
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
78
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
79
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
86
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
88
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
89
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
90
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
91
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
97
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
98
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
99
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
100
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
101
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
102
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
103
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
108
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
109
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
110
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
111
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
112
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
113
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
114
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
115
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
116
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
122
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
125
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
126
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
127
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
128
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
129
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
130
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
131
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
132
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
133
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
134
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
135
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
136
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
137
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
138
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
139
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
140
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
142
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
143
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
145
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
146
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
147
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
148
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
149
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
150
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
151
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
152
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
153
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
154
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
155
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
156
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
157
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
158
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
159
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
160
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
161
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
162
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
163
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
164
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
165
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
166
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
167
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
168
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
169
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
170
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
171
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
172
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
173
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
174
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
175
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
176
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
177
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
178
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
179
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
180
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
181
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
182
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
183
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
184
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
185
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
186
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
187
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
188
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
189
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
190
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
191
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
192
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
193
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
194
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
195
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
196
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
197
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
198
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
199
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
200
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
201
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
202
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
203
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
204
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
205
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
206
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
207
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
208
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
209
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
210
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
211
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
212
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
213
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
214
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
215
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
216
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
217
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
218
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
219
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
220
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
221
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
222
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
223
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
224
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
225
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
226
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
227
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
228
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
229
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
230
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
231
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
232
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
233
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
234
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
235
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
236
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
237
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
238
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
239
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
240
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
241
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
242
+ "model.norm.weight": "model-00002-of-00002.safetensors"
243
+ }
244
+ }
setup_steps.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Setup Runpod account and add few dollars.
2
+ Setup RTX 4090 with
3
+ Disk 50GB/50GB
4
+ Select image:
5
+ runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
6
+
7
+ pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton --index-url https://download.pytorch.org/whl/cu121
8
+ pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"
9
+
10
+ --> run notebook
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1980c00aa3cb5455177a39efa3e60e7b8887ee89c3f7b8950719592a08ad9456
3
+ size 1400411
tokenizer_config.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "[INST]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "[/INST]",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<<SYS>>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<</SYS>>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ }
62
+ },
63
+ "bos_token": "<s>",
64
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = 'Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa.' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + eos_token }}{% endif %}{% endfor %}",
65
+ "clean_up_tokenization_spaces": false,
66
+ "eos_token": "</s>",
67
+ "legacy": false,
68
+ "model_max_length": 1000000000000000019884624838656,
69
+ "pad_token": null,
70
+ "sp_model_kwargs": {},
71
+ "spaces_between_special_tokens": false,
72
+ "tokenizer_class": "PreTrainedTokenizerFast",
73
+ "unk_token": "<unk>",
74
+ "use_default_system_prompt": false
75
+ }
train_sentencepiece.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import sentencepiece as spm
2
+
3
+ spm.SentencePieceTrainer.train(input="/researchdisk/training_dataset_sentences/train.txt", model_prefix="tokenizer",
4
+ model_type="bpe", split_digits=True, vocab_size=64256, byte_fallback=True,
5
+ normalization_rule_name="nfkc",
6
+ user_defined_symbols=["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>"],
7
+ required_chars="abcdefghijklmnopqrstuvwxyzåäöABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ",
8
+ train_extremely_large_corpus=True,
9
+ input_sentence_size=500000000, shuffle_input_sentence=True,
10
+ num_threads=96)