mmosiolek commited on
Commit
d2eeee4
·
1 Parent(s): 685ce48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -1
README.md CHANGED
@@ -1,3 +1,72 @@
1
  ---
2
  license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - mmosiolek/pl_alpaca_data_cleaned
5
+ language:
6
+ - pl
7
+ tags:
8
+ - alpaca
9
+ - llama
10
+ - self-instruct
11
+ - casual language model
12
+ - llm
13
+ - gpt
14
+ - chat-gpt
15
+ ---
16
+ # Polpaca: The Alpaca Speaks Polish
17
+
18
+ Dataset for the project: https://huggingface.co/datasets/mmosiolek/pl_alpaca_data_cleaned
19
+
20
+ [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) is a state-of-the-art, foundational, open-source large language model designed to help engineers and researchers advance their work in NLP.
21
+ For example, Stanford researchers have fine-tuned LLaMA to construct an alternative to the famous ChatGPT - a model called [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html).
22
+ Unfortunately, [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) was trained on a dataset consisting mainly of English texts, with only 4.5% of the data relating to other languages.
23
+ In addition, the [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) instruction training dataset consists only of examples of English instructions.
24
+ So [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) simply doesn't work for the other languages.
25
+
26
+ This repo makes [Alpaca-Lora-7B](https://huggingface.co/tloen/alpaca-lora-7b) speak Polish.
27
+
28
+ ### Usage
29
+
30
+ ```python
31
+ from transformers import LlamaTokenizer, LlamaForCausalLM
32
+ from peft import PeftModel
33
+ import bitsandbytes as bnb
34
+
35
+ base = "decapoda-research/llama-7b-hf"
36
+ finetuned = "mmosiolek/polpaca-lora-7b"
37
+
38
+ tokenizer = LlamaTokenizer.from_pretrained(base)
39
+ tokenizer.pad_token_id = 0
40
+ tokenizer.padding_side = "left"
41
+
42
+ model = LlamaForCausalLM.from_pretrained(base)
43
+ model = PeftModel.from_pretrained(model, finetuned).to("cuda")
44
+ ```
45
+
46
+ For output generation use the following code:
47
+
48
+
49
+ ```python
50
+ from transformers import GenerationConfig
51
+
52
+ config = GenerationConfig(
53
+ temperature=0.1,
54
+ top_p=0.75,
55
+ top_k=40,
56
+ num_beams=4,
57
+ max_new_tokens=128,
58
+ )
59
+
60
+ def run(instruction, model, tokenizer):
61
+ encodings = tokenizer(instruction, padding=True, return_tensors="pt").to('cuda')
62
+ generated_ids = model.generate(
63
+ **encodings,
64
+ generation_config=GENERATION_CONFIG,
65
+ )
66
+ decoded = tokenizer.batch_decode(generated_ids)
67
+ del encodings, generated_ids
68
+ torch.cuda.empty_cache()
69
+ return decoded[0].split("\n")[-1]
70
+ ```
71
+
72
+ Example input/output