lemonilia commited on
Commit
caa42e7
1 Parent(s): 7a839c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -31
README.md CHANGED
@@ -2,10 +2,10 @@
2
  license: apache-2.0
3
  ---
4
 
5
- # LimaRP-Mistral-7B-v0.1 (Alpaca, 8-bit LoRA adapter)
6
 
7
  This is a version of LimaRP for [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
8
- about 1900 training samples _up to_ 9k tokens length
9
 
10
  For more details about LimaRP, see the model page for the [previously released v2 version for Llama-2](https://huggingface.co/lemonilia/limarp-llama2-v2).
11
  Most details written there apply for this version as well. Generally speaking, LimaRP is a longform-oriented, novel-style
@@ -13,11 +13,6 @@ roleplaying chat model intended to replicate the experience of 1-on-1 roleplay o
13
  IRC/Discord-style RP (aka "Markdown format") is not supported yet. The model does not include instruction tuning,
14
  only manually picked and slightly edited RP conversations with persona and scenario data.
15
 
16
- ## Known issues
17
- - Despite performing a few finetuning attempts, including one that followed almost the same procedure as in previous releases,
18
- Mistral-7B-v0.1 appears to have strange repetition issues.
19
- - Even though benchmarks tell a different story, in practice the model doesn't feel smarter during roleplay than Llama-2-13B.
20
-
21
  ## Prompt format
22
  Same as before. It uses the [extended Alpaca format](https://github.com/tatsu-lab/stanford_alpaca),
23
  with `### Input:` immediately preceding user inputs and `### Response:` immediately preceding
@@ -83,39 +78,33 @@ the modifier, the model will choose the most appropriate response length** (alth
83
  not necessarily be what the user desires).
84
 
85
  ## Suggested settings
86
- You can follow these instruction format settings in SillyTavern. Replace `tiny` with
87
  your desired response length:
88
 
89
- ![settings](https://files.catbox.moe/6lcz0u.png)
90
 
91
  ## Text generation settings
92
- Mistral-7B-v0.1 appears to have repetition issues. A low temperature combined with a relatively high
93
- repetition penalty and low penalty range (about as long as the prior 2 messages) appears to help:
94
 
95
- - TFS = 0.90~0.95
96
- - Temperature = 0.50~0.55
97
- - Repetition penalty = ~1.15
98
- - Repetition penalty range = ~512
99
  - top-k = 0 (disabled)
100
  - top-p = 1 (disabled)
101
 
102
  ## Training procedure
103
  [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
104
- on 2x NVidia A40 GPUs.
105
 
106
  The A40 GPUs have been graciously provided by [Arc Compute](https://www.arccompute.io/).
107
 
108
- The model has been trained as an 8-bit LoRA adapter, and
109
- it's so large because a LoRA rank of 256 was also used. The reasoning was that this
110
- might have helped the model internalize any newly acquired information, making the
111
- training process closer to a full finetune. It's suggested to merge the adapter to
112
- the base Mistral-7B-v0.1 model.
113
-
114
  ### Training hyperparameters
115
- - learning_rate: 0.0005
116
- - lr_scheduler_type: cosine
 
117
  - num_epochs: 2
118
- - sequence_len: 9000
119
  - lora_r: 256
120
  - lora_alpha: 16
121
  - lora_dropout: 0.05
@@ -125,12 +114,15 @@ the base Mistral-7B-v0.1 model.
125
  - tf32: True
126
  - load_in_8bit: True
127
  - adapter: lora
128
- - micro_batch_size: 2
129
- - gradient_accumulation_steps: 32
130
- - warmup_steps: 2
131
  - optimizer: adamw_torch
 
 
 
132
 
133
- For the second pass, the `lora_model_dir` option was used to continue finetuning on the LoRA
134
- adapter obtained from the first pass.
135
 
136
- Using 2 GPUs, the effective global batch size would have been 128.
 
 
2
  license: apache-2.0
3
  ---
4
 
5
+ # LimaRP-Mistral-7B-v0.1 (Alpaca)
6
 
7
  This is a version of LimaRP for [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
8
+ about 2000 training samples _up to_ 9k tokens length
9
 
10
  For more details about LimaRP, see the model page for the [previously released v2 version for Llama-2](https://huggingface.co/lemonilia/limarp-llama2-v2).
11
  Most details written there apply for this version as well. Generally speaking, LimaRP is a longform-oriented, novel-style
 
13
  IRC/Discord-style RP (aka "Markdown format") is not supported yet. The model does not include instruction tuning,
14
  only manually picked and slightly edited RP conversations with persona and scenario data.
15
 
 
 
 
 
 
16
  ## Prompt format
17
  Same as before. It uses the [extended Alpaca format](https://github.com/tatsu-lab/stanford_alpaca),
18
  with `### Input:` immediately preceding user inputs and `### Response:` immediately preceding
 
78
  not necessarily be what the user desires).
79
 
80
  ## Suggested settings
81
+ You can follow these instruction format settings in SillyTavern. Replace `medium` with
82
  your desired response length:
83
 
84
+ ![settings](https://files.catbox.moe/24c1w0.png)
85
 
86
  ## Text generation settings
87
+ These settings could be a good general starting point:
 
88
 
89
+ - TFS = 0.92
90
+ - Temperature = 0.70
91
+ - Repetition penalty = ~1.1
92
+ - Repetition penalty range = ~2048
93
  - top-k = 0 (disabled)
94
  - top-p = 1 (disabled)
95
 
96
  ## Training procedure
97
  [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
98
+ on 4x NVidia A40 GPUs.
99
 
100
  The A40 GPUs have been graciously provided by [Arc Compute](https://www.arccompute.io/).
101
 
 
 
 
 
 
 
102
  ### Training hyperparameters
103
+ - learning_rate: 0.0003
104
+ - lr_scheduler: constant_with_warmup
105
+ - noisy_embedding_alpha: 5
106
  - num_epochs: 2
107
+ - sequence_len: 8750
108
  - lora_r: 256
109
  - lora_alpha: 16
110
  - lora_dropout: 0.05
 
114
  - tf32: True
115
  - load_in_8bit: True
116
  - adapter: lora
117
+ - micro_batch_size: 1
118
+ - gradient_accumulation_steps: 1
119
+ - warmup_steps: 10
120
  - optimizer: adamw_torch
121
+ - flash_attention: true
122
+ - sample_packing: true
123
+ - pad_to_sequence_len: true
124
 
125
+ Using 4 GPUs, the effective global batch size would have been 4.
 
126
 
127
+ ### Training loss graph
128
+ ![Train loss](https://files.catbox.moe/0pj84w.png)