File size: 5,394 Bytes

8512c48
 
786d7b1
 
 
 
 
8512c48
 
786d7b1
 
dd3df15
786d7b1
a0ca04a
 
786d7b1

---
library_name: transformers
tags:
- mergekit
- block expansion
- progressive mistral
- arcee cpt
---

# Mistral-7B-Instruct-v0.2-expanded

This method employs mergekit's passthrough method to expand blocks within the "mistralai/Mistral-7B-Instruct-v0.2" model. For every 5th layer, 
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro. 

### It's important to note that this configuration has not undergone fine-tuning. So this won't work. Therefore, when fine-tuning, ensure that only every 5th layer is trainable,while all other layers remain frozen.

## 🧩 Configuration

```yaml
slices:
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [0, 4]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [3, 4]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [4, 8]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [7, 8]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [8, 12]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [11, 12]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [12, 16]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [15, 16]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [16, 20]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [19, 20]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [20, 24]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [23, 24]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [24, 28]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [27, 28]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
            
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [28, 32]
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [31, 32]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0

merge_method: passthrough
dtype: bfloat16
```

# Function to freeze layers

```
from transformers import AutoModelForCausalLM

def enable_grad_only_every_nth(model, n):
    """
    This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting 
    from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients 
    for all other components of the model, including the embedding layers and the model's head. This setup is particularly 
    useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and 
    adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
    """

    # Freeze embeddings.
    for param in model.model.embed_tokens.parameters():
        param.requires_grad = False

    # Freeze lm_head.
    for param in model.lm_head.parameters():
        param.requires_grad = False

    # Enable gradients for every nth layer
    layers = model.model.layers  # Access the ModuleList containing the layers

    for index, layer in enumerate(layers):

        if (index + 1) % n == 0:  # Enables gradients for every nth layer, starting from the layer after the 0th
            for param in layer.parameters():
                param.requires_grad = True
        else:
            for param in layer.parameters():
                param.requires_grad = False

model = transformers.AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
    )
# Update layer gradients, specify the correct value for n based on your model's architecture
n =5
enable_grad_only_every_nth(model, n)
```