Update README.md
Browse files
README.md
CHANGED
@@ -4,11 +4,10 @@ license: apache-2.0
|
|
4 |
|
5 |
# LimaRP-Mistral-7B-v0.1 (Alpaca, 8-bit LoRA adapter)
|
6 |
|
7 |
-
This is
|
8 |
-
about 1800 training samples _up to_ 4k tokens length.
|
9 |
-
|
10 |
-
|
11 |
-
seem to have been filtered for content type.
|
12 |
|
13 |
**Due to software limitations, finetuning didn't take advantage yet of the Sliding Window Attention (SWA) which would have allowed
|
14 |
to use longer conversations in the training data. Thus, this version of LimaRP should be considered an _initial finetuning attempt_ and
|
@@ -100,7 +99,7 @@ generation settings may be:
|
|
100 |
|
101 |
## Training procedure
|
102 |
[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
|
103 |
-
on a
|
104 |
|
105 |
The A40 GPU cluster has been graciously provided by [Arc Compute](https://www.arccompute.io/).
|
106 |
|
@@ -111,19 +110,25 @@ training process closer to a full finetune. It's suggested to merge the adapter
|
|
111 |
the base Mistral-7B-v0.1 model.
|
112 |
|
113 |
### Training hyperparameters
|
114 |
-
- learning_rate: 0.
|
115 |
- lr_scheduler_type: cosine
|
116 |
-
- num_epochs: 2
|
|
|
117 |
- lora_r: 256
|
118 |
- lora_alpha: 16
|
119 |
- lora_dropout: 0.05
|
120 |
- lora_target_linear: True
|
121 |
- bf16: True
|
|
|
122 |
- tf32: True
|
123 |
- load_in_8bit: True
|
124 |
- adapter: lora
|
125 |
-
- micro_batch_size:
|
126 |
-
- gradient_accumulation_steps:
|
|
|
127 |
- optimizer: adamw_torch
|
128 |
|
129 |
-
|
|
|
|
|
|
|
|
4 |
|
5 |
# LimaRP-Mistral-7B-v0.1 (Alpaca, 8-bit LoRA adapter)
|
6 |
|
7 |
+
This is a version of LimaRP for [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
|
8 |
+
about 1800 training samples _up to_ 4k tokens length. A 2-pass training procedure has been employed. The first pass includes
|
9 |
+
finetuning on about 6800 stories within 4k tokens length and the second pass is LimaRP with changes introducing more effective
|
10 |
+
control on response length.
|
|
|
11 |
|
12 |
**Due to software limitations, finetuning didn't take advantage yet of the Sliding Window Attention (SWA) which would have allowed
|
13 |
to use longer conversations in the training data. Thus, this version of LimaRP should be considered an _initial finetuning attempt_ and
|
|
|
99 |
|
100 |
## Training procedure
|
101 |
[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
|
102 |
+
on a 4x NVidia A40 GPU cluster.
|
103 |
|
104 |
The A40 GPU cluster has been graciously provided by [Arc Compute](https://www.arccompute.io/).
|
105 |
|
|
|
110 |
the base Mistral-7B-v0.1 model.
|
111 |
|
112 |
### Training hyperparameters
|
113 |
+
- learning_rate: 0.0001
|
114 |
- lr_scheduler_type: cosine
|
115 |
+
- num_epochs: 2 (1 for the first pass)
|
116 |
+
- sequence_len: 4096
|
117 |
- lora_r: 256
|
118 |
- lora_alpha: 16
|
119 |
- lora_dropout: 0.05
|
120 |
- lora_target_linear: True
|
121 |
- bf16: True
|
122 |
+
- fp16: false
|
123 |
- tf32: True
|
124 |
- load_in_8bit: True
|
125 |
- adapter: lora
|
126 |
+
- micro_batch_size: 2
|
127 |
+
- gradient_accumulation_steps: 1
|
128 |
+
- warmup_steps: 40
|
129 |
- optimizer: adamw_torch
|
130 |
|
131 |
+
For the second pass, the `lora_model_dir` option was used to continue finetuning on the LoRA
|
132 |
+
adapter obtained from the first pass.
|
133 |
+
|
134 |
+
Using 2 GPUs, the effective global batch size would have been 8.
|