Safetensors
qwen3
TroyDoesAI commited on
Commit
9ae0062
·
verified ·
1 Parent(s): d191028

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -78,6 +78,27 @@ down_proj: [5120, 25600] → [8192, 29568]
78
  - Group Query Attention (GQA) maintained with 8 KV heads
79
  - All interpolations preserve the mathematical properties of the original weights
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ## Usage
82
 
83
 
 
78
  - Group Query Attention (GQA) maintained with 8 KV heads
79
  - All interpolations preserve the mathematical properties of the original weights
80
 
81
+ ## Evaluation Results
82
+
83
+ To answer the question "is it smarter or dumber than the original?", the model was evaluated on the **IFEval** (Instruction Following Evaluation) benchmark and compared directly against its base model, `Qwen/Qwen3-32B`.
84
+
85
+ ### IFEval: Instruction Following Comparison
86
+
87
+ Evaluation was performed using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in a 0-shot setting. The results show that while the raw interpolated model is not yet as capable as the highly polished base model, it has successfully retained a significant portion of its instruction-following ability.
88
+
89
+ | Metric (Higher is Better) | 🥇 **Base Model (Qwen3-32B)** | **Embiggened Model (This Model)** | Performance Change |
90
+ | :--- | :---: | :---: | :---: |
91
+ | **Prompt-level Strict Accuracy** | **81.25%** | 68.75% | **-12.5 pts** |
92
+ | **Instruction-level Strict Accuracy**| **87.50%** | 75.00% | **-12.5 pts** |
93
+ | Prompt-level Loose Accuracy | **87.50%** | 68.75% | **-18.75 pts** |
94
+ | Instruction-level Loose Accuracy | **91.67%** | 75.00% | **-16.67 pts** |
95
+
96
+ ### Analysis of Results
97
+
98
+ * **Expected Performance Drop:** The drop in performance is an expected and normal consequence of the architectural expansion. The interpolation process, while structure-aware, cannot perfectly preserve the intricate balance of a fine-tuned model's weights.
99
+ * **Success in Retaining Capability:** The key takeaway is not the performance drop, but how much capability the model **retained**. Achieving ~85% of the original's strict accuracy (68.75% vs 81.25%) without any post-expansion training is a strong indicator of a successful architectural merge. The model remained coherent and functional.
100
+ * **Strong Foundation for Fine-Tuning:** These results establish a powerful baseline. The model is now a larger, coherent architecture that serves as an excellent starting point for further fine-tuning, which would likely recover and ultimately exceed the performance of the original 32B model.
101
+
102
  ## Usage
103
 
104