ChrisMcCormick commited on
Commit
c02b2f3
verified
1 Parent(s): 1845c59

Adding patching code

Browse files
Files changed (1) hide show
  1. README.md +90 -38
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  datasets:
@@ -14,9 +14,9 @@ tags:
14
  - output-subspace
15
  ---
16
 
17
- # Deepseek Tiny Mla O V0.1
18
 
19
- 6-layer DeepSeek-V3 with MLA + shared output latent space trained for research on shared subspaces in Transformer attention mechanisms.
20
 
21
  ## Model Description
22
 
@@ -43,62 +43,114 @@ This model is part of the [shared-subspaces](https://github.com/chrisjmccormick/
43
  ### Output Subspace Decomposition
44
  This model implements a shared output latent space where the attention output projection W^O is decomposed into:
45
  ```
46
- W^O = W^OB 路 W^OA
47
  ```
48
  Where W^OA are per-head projections to the latent space and W^OB is a shared projection back to the model dimension.
49
 
50
  ## Usage
51
 
 
 
 
 
52
  ```python
53
  import torch
 
54
  from transformers import DeepseekV3ForCausalLM, AutoTokenizer
55
-
56
- # Load model and tokenizer
57
- model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-mla-o-v0.1")
58
- tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-mla-o-v0.1")
59
-
60
- # For MLA-o model, apply the output subspace patch
61
- from patch_o_proj import patch_o_proj_implementation
62
- patch_o_proj_implementation(
63
- model=model,
64
- o_latent_dim=96,
65
- variant="sequential_norm"
66
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  # Generate text
69
  inputs = tokenizer("The future of AI is", return_tensors="pt")
70
- outputs = model.generate(**inputs, max_length=50, temperature=0.7)
 
 
 
 
 
 
 
 
 
71
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
72
  ```
73
 
74
  ## Training Details
75
 
76
  - **Pre-training Dataset**: WikiText-103
77
- - **Fine-tuning Dataset**: SST-2 (GLUE)
78
  - **Optimizer**: AdamW
79
- - **Learning Rate**: 5e-4 (pre-training), 5e-5 (fine-tuning)
80
- - **Weight Decay**: 0.01 (pre-training), 0.05 (fine-tuning)
81
  - **Precision**: bfloat16
82
  - **Compilation**: torch.compile with inductor backend
83
- - **Training Steps**: 12,500 (pre-training), 1,500 (fine-tuning)
 
84
 
85
  ## Limitations
86
 
87
  - Small scale model (16M parameters) intended for research purposes
88
- - Trained on limited data compared to production models
89
- - May require custom loading code for output subspace variants
90
-
91
- ## Citation
92
-
93
- ```bibtex
94
- @misc{mccormick2025sharedsubspaces,
95
- title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces},
96
- author={McCormick, Chris},
97
- year={2025},
98
- howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}}
99
- }
100
- ```
101
-
102
- ## License
103
-
104
- Apache 2.0
 
1
  ---
2
+ license: mit
3
  language:
4
  - en
5
  datasets:
 
14
  - output-subspace
15
  ---
16
 
17
+ # DeepSeek-Tiny with MLA-o V0.1
18
 
19
+ 6-layer DeepSeek-V3 with MLA + shared output latent space ("MLA-o") trained for research on shared subspaces in Transformer attention mechanisms.
20
 
21
  ## Model Description
22
 
 
43
  ### Output Subspace Decomposition
44
  This model implements a shared output latent space where the attention output projection W^O is decomposed into:
45
  ```
46
+ W^O = W^OA 路 W^OB
47
  ```
48
  Where W^OA are per-head projections to the latent space and W^OB is a shared projection back to the model dimension.
49
 
50
  ## Usage
51
 
52
+ Rather than overwrite the entire attention layer, we simply patched the `o_proj` parameter with a `nn.Sequential`. It's an easy way to modify the model prior to pre-training, but loading the weights is a different story.
53
+
54
+ The below code applies the patch, and then loads in the necessary weights manually.
55
+
56
  ```python
57
  import torch
58
+ import torch.nn as nn
59
  from transformers import DeepseekV3ForCausalLM, AutoTokenizer
60
+ from safetensors.torch import load_file
61
+ from huggingface_hub import hf_hub_download
62
+
63
+ def load_mla_o_model(repo_id="ChrisMcCormick/deepseek-tiny-mla-o-v0.1"):
64
+ """
65
+ Load the MLA-o model with output subspace decomposition
66
+ """
67
+
68
+ print("\n<<Ignore the 'weights not used' warning>>\n")
69
+
70
+ # Load base model (without decomposed weights)
71
+ model = DeepseekV3ForCausalLM.from_pretrained(repo_id)
72
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
73
+
74
+ print("\nPatching weights...\n")
75
+
76
+ # Download the safetensors file to get the decomposed weights
77
+ weights_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors")
78
+ weights = load_file(weights_path)
79
+
80
+ # Apply output subspace decomposition to all attention layers
81
+ for layer_idx, layer in enumerate(model.model.layers):
82
+ attn = layer.self_attn
83
+
84
+ # Calculate dimensions
85
+ in_features = attn.num_heads * attn.v_head_dim # 8 * 32 = 256
86
+ o_latent_dim = 96 # Output latent dimension
87
+ out_features = model.config.hidden_size # 256
88
+ bias = bool(getattr(model.config, "attention_bias", False))
89
+
90
+ # Replace o_proj with sequential decomposition
91
+ attn.o_proj = nn.Sequential(
92
+ nn.Linear(in_features, o_latent_dim, bias=False), # W^OA: 256 -> 96
93
+ nn.RMSNorm(o_latent_dim, eps=model.config.rms_norm_eps), # Normalization
94
+ nn.Linear(o_latent_dim, out_features, bias=bias), # W^OB: 96 -> 256
95
+ )
96
+
97
+ # Load the decomposed weights
98
+ layer_prefix = f"model.layers.{layer_idx}.self_attn.o_proj"
99
+
100
+ # Load W^OA weights (o_proj.0.weight)
101
+ w_oa_key = f"{layer_prefix}.0.weight"
102
+ if w_oa_key in weights:
103
+ attn.o_proj[0].weight.data = weights[w_oa_key]
104
+
105
+ # Load RMSNorm weights (o_proj.1.weight)
106
+ w_norm_key = f"{layer_prefix}.1.weight"
107
+ if w_norm_key in weights:
108
+ attn.o_proj[1].weight.data = weights[w_norm_key]
109
+
110
+ # Load W^OB weights (o_proj.2.weight)
111
+ w_ob_key = f"{layer_prefix}.2.weight"
112
+ if w_ob_key in weights:
113
+ attn.o_proj[2].weight.data = weights[w_ob_key]
114
+
115
+ # Load W^OB bias if it exists
116
+ w_ob_bias_key = f"{layer_prefix}.2.bias"
117
+ if w_ob_bias_key in weights and attn.o_proj[2].bias is not None:
118
+ attn.o_proj[2].bias.data = weights[w_ob_bias_key]
119
+
120
+ print("Model loaded and patched.")
121
+ return model, tokenizer
122
+
123
+ # Load the model
124
+ model, tokenizer = load_mla_o_model()
125
 
126
  # Generate text
127
  inputs = tokenizer("The future of AI is", return_tensors="pt")
128
+ with torch.no_grad():
129
+ outputs = model.generate(
130
+ **inputs,
131
+ max_length=50,
132
+ temperature=0.7,
133
+ do_sample=True,
134
+ pad_token_id=tokenizer.eos_token_id
135
+ )
136
+
137
+ print("Generated text:")
138
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
139
+
140
  ```
141
 
142
  ## Training Details
143
 
144
  - **Pre-training Dataset**: WikiText-103
 
145
  - **Optimizer**: AdamW
146
+ - **Learning Rate**: 5e-4
147
+ - **Weight Decay**: 0.01
148
  - **Precision**: bfloat16
149
  - **Compilation**: torch.compile with inductor backend
150
+ - **Training Steps**: 12,500
151
+ - **Effective Batch Size**: 1,024
152
 
153
  ## Limitations
154
 
155
  - Small scale model (16M parameters) intended for research purposes
156
+ - Trained on limited data compared to production models