lemonilia
/

LimaRP-Mistral-7B-v0.1

@@ -4,11 +4,10 @@ license: apache-2.0
 # LimaRP-Mistral-7B-v0.1 (Alpaca, 8-bit LoRA adapter)
-This is an experimental version of LimaRP for [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
-about 1800 training samples _up to_ 4k tokens length. Contrarily to the previously released "v3" version for Llama-2, this one does
-not include a preliminary finetuning pass on several thousands short stories. Initial testing has shown Mistral to be capable of
-generating on its own the kind of stories that were included there; its training data appears to be quite diverse and does not
-seem to have been filtered for content type.
 **Due to software limitations, finetuning didn't take advantage yet of the Sliding Window Attention (SWA) which would have allowed
 to use longer conversations in the training data. Thus, this version of LimaRP should be considered an _initial finetuning attempt_ and
@@ -100,7 +99,7 @@ generation settings may be:
 ## Training procedure
 [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
-on a 2x NVidia A40 GPU cluster.
 The A40 GPU cluster has been graciously provided by [Arc Compute](https://www.arccompute.io/).
@@ -111,19 +110,25 @@ training process closer to a full finetune. It's suggested to merge the adapter
 the base Mistral-7B-v0.1 model.
 ### Training hyperparameters
-- learning_rate: 0.001
 - lr_scheduler_type: cosine
-- num_epochs: 2
 - lora_r: 256
 - lora_alpha: 16
 - lora_dropout: 0.05
 - lora_target_linear: True
 - bf16: True
 - tf32: True
 - load_in_8bit: True
 - adapter: lora
-- micro_batch_size: 1
-- gradient_accumulation_steps: 16
 - optimizer: adamw_torch
-Using 2 GPUs, the effective global batch size would have been 32.

 # LimaRP-Mistral-7B-v0.1 (Alpaca, 8-bit LoRA adapter)
+This is a version of LimaRP for [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
+about 1800 training samples _up to_ 4k tokens length. A 2-pass training procedure has been employed. The first pass includes
+finetuning on about 6800 stories within 4k tokens length and the second pass is LimaRP with changes introducing more effective
+control on response length.
 **Due to software limitations, finetuning didn't take advantage yet of the Sliding Window Attention (SWA) which would have allowed
 to use longer conversations in the training data. Thus, this version of LimaRP should be considered an _initial finetuning attempt_ and
 ## Training procedure
 [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) was used for training
+on a 4x NVidia A40 GPU cluster.
 The A40 GPU cluster has been graciously provided by [Arc Compute](https://www.arccompute.io/).
 the base Mistral-7B-v0.1 model.
 ### Training hyperparameters
+- learning_rate: 0.0001
 - lr_scheduler_type: cosine
+- num_epochs: 2 (1 for the first pass)
+- sequence_len: 4096
 - lora_r: 256
 - lora_alpha: 16
 - lora_dropout: 0.05
 - lora_target_linear: True
 - bf16: True
+- fp16: false
 - tf32: True
 - load_in_8bit: True
 - adapter: lora
+- micro_batch_size: 2
+- gradient_accumulation_steps: 1
+- warmup_steps: 40
 - optimizer: adamw_torch
+For the second pass, the `lora_model_dir` option was used to continue finetuning on the LoRA
+adapter obtained from the first pass.
+Using 2 GPUs, the effective global batch size would have been 8.