File size: 2,679 Bytes
72e2539
 
538cd6e
 
 
 
 
 
 
72e2539
 
bd14a7c
 
 
 
 
72e2539
 
0ea9a29
72e2539
 
 
 
 
538cd6e
4fcdfcc
90553e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cd4a5e
72e2539
 
 
 
 
 
538cd6e
72e2539
538cd6e
72e2539
 
 
 
 
 
 
 
 
538cd6e
72e2539
538cd6e
72e2539
538cd6e
72e2539
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
library_name: transformers
datasets:
- Na0s/sft-ready-Text-Generation-Augmented-Data
language:
- en
base_model:
- mistralai/Mixtral-8x7B-Instruct-v0.1
pipeline_tag: text-generation
---


<a href="https://ibb.co/G5j5XNh"><img src="https://i.ibb.co/2kBkwHb/photo-model.webp" alt="photo-model" border="0"></a>



# Model Card for Model ID

LoRA fine-tuned version of mistralai/Mixtral-8x7B-Instruct-v0.1 only targeting the gate/router.



#### Training Hyperparameters

- **Training regime:**
  
```python
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", truncation=True, padding=True, padding_side="right")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", quantization_config=quantization_config)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = prepare_model_for_kbit_training(model)

config = LoraConfig(r = 4, 
                    lora_alpha=4, 
                    target_modules = ["gate"], 
                    lora_dropout=0.1
                    )

lora_model = get_peft_model(model, config)

lora_model.print_trainable_parameters()

dataset = load_dataset("Na0s/sft-ready-Text-Generation-Augmented-Data", split="train")

trainer = SFTTrainer(
    model = lora_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    packing = True,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,
        group_by_length = True,
        warmup_steps = 5,
        bf16 = True,
        max_steps=5000,
        learning_rate = 2e-4,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        eval_strategy="no",
        do_eval=False,
        output_dir = "./outputs",
        push_to_hub=True,
        remove_unused_columns=False,
    )
)
```






#### Metrics and results:

Upcoming.

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).



## Technical Specifications 

### Model Architecture and Objective

The objective of the fine-tuning of this MoE based transformer is to implement the expert pruning detailed in the following paper: [A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts](https://arxiv.org/abs/2405.16646)