codemalt / NOTES.md

Sarthak

chore: update README and REPORT with performance insights and dataset changes

0dbb356 about 2 months ago

8.64 kB

	# Research Notes: Performance Analysis of C4 Fine-tuning vs Base Distillation

	## 📊 Executive Summary

	Key Finding: C4 fine-tuning significantly degraded performance across almost all metrics and programming languages compared to simple Model2Vec distillation.

	Recommendation: Use simple Model2Vec distillation without additional training for optimal code embedding performance.

	---

	## 📉 Overall Performance Degradation

	The comparison between base distilled models and C4-fine-tuned models reveals substantial performance regression:

	\| Metric \| Base Model \| Fine-tuned Model \| Performance Drop \|
	\|--------\|------------\|------------------\|------------------\|
	\| NDCG@10 \| 0.7387 \| 0.6147 \| -16.8% \|
	\| MRR \| 0.7010 \| 0.5720 \| -18.4% \|
	\| Recall@5 \| 0.8017 \| 0.6950 \| -13.3% \|
	\| Recall@1 \| 0.6169 \| 0.4650 \| -24.6% \|

	Impact: Double-digit performance drops across all major retrieval metrics, with Recall@1 suffering the most severe degradation at nearly 25%.

	---

	## 🔍 Language-Specific Impact Analysis

	The performance degradation varied significantly across programming languages, revealing interesting patterns about domain sensitivity:

	### 🚨 Severely Affected Languages

	#### Java (Catastrophic degradation):
	- NDCG@10: 0.7027 → 0.2820 (-59.9%)
	- MRR: 0.6553 → 0.2419 (-63.1%)
	- Mean Rank: 7.24 → 20.38 (almost 3x worse ranking)
	- Analysis: Java suffered the most severe degradation, suggesting its documentation patterns are most incompatible with C4's web text distribution.

	#### PHP (Major degradation):
	- NDCG@10: 0.7055 → 0.4453 (-36.9%)
	- MRR: 0.6631 → 0.3981 (-40.0%)
	- Analysis: PHP's unique syntax and documentation style may have been particularly disrupted by general web text training.

	### 📊 Moderately Affected Languages

	#### Python (Best preserved):
	- NDCG@10: 0.9674 → 0.9219 (-4.7%)
	- MRR: 0.9572 → 0.8964 (-6.3%)
	- Analysis: Python showed the smallest degradation, likely due to its prevalence in web tutorials and documentation that might overlap with C4 content.

	#### Ruby (Minor degradation):
	- NDCG@10: 0.7287 → 0.7178 (-1.5%)
	- MRR: 0.6869 → 0.6776 (-1.4%)

	#### Go (Minor degradation):
	- NDCG@10: 0.7529 → 0.7250 (-3.7%)
	- MRR: 0.7059 → 0.6699 (-5.1%)

	### ✅ Single Improvement

	#### JavaScript (Slight improvement):
	- NDCG@10: 0.5752 → 0.5959 (+3.6%)
	- MRR: 0.5378 → 0.5481 (+1.9%)
	- Analysis: JavaScript was the only language to show improvement, possibly due to extensive JavaScript content in web pages that align with C4's distribution.

	---

	## 🔍 Model Characteristics Comparison

	\| Aspect \| Base Model \| Fine-tuned Model \| Change \| Impact \|
	\|--------\|------------\|------------------\|--------\|---------\|
	\| Parameters \| 7.56M \| 9.38M \| +24% larger \| Increased complexity \|
	\| Disk Size \| 15.07MB \| 36.94MB \| +145% larger \| Storage overhead \|
	\| Performance \| Superior \| Inferior \| Significantly worse \| Counterproductive \|
	\| Efficiency \| High \| Low \| Worse per parameter \| Resource waste \|

	Key Insight: The fine-tuned model is larger, more complex, and performs worse—a clear example of the "bigger is not always better" principle.

	---

	## 🧠 Root Cause Analysis

	### 1. 🌐 Domain Mismatch
	- Problem: C4 contains general web text (articles, forums, websites, news)
	- Impact: Code documentation has fundamentally different linguistic patterns, vocabulary, and structure
	- Result: Training on web text actively degraded code-specific knowledge

	### 2. 🧠 Catastrophic Forgetting
	- Problem: The model "forgot" code-specific embeddings during C4 training
	- Evidence: Java and PHP were hit hardest (59.9% and 36.9% NDCG@10 drops respectively)
	- Mechanism: New training overwrote previously learned code-specific representations

	### 3. 📊 Distribution Shift
	- Problem: C4's token distribution is vastly different from code comments and documentation
	- Impact: Model learned patterns that are irrelevant or harmful for code retrieval
	- Evidence: Uniform degradation across most languages suggests systematic distribution mismatch

	### 4. ⚖️ Training Methodology Issues
	- Problem: Tokenlearn training on C4 introduced noise rather than signal
	- Analysis: The POTION approach works well for general text but fails for specialized domains
	- Conclusion: Domain-agnostic training methods can be counterproductive

	---

	## 📈 Performance vs Complexity Analysis

	```
	Performance Efficiency = NDCG@10 / Model_Size_MB

	Base Model: 0.7387 / 15.07 = 0.049 (High efficiency)
	Fine-tuned Model: 0.6147 / 36.94 = 0.017 (Low efficiency)

	Efficiency Loss: 65.3%
	```

	The fine-tuned model is not only worse performing but also dramatically less efficient, representing a significant regression in both absolute and relative terms.

	---

	## 🎯 Key Research Insights

	### 1. Domain Specificity Matters
	Code embeddings require domain-specific training data. General web text (C4) actively harms code retrieval performance.

	### 2. Language-Dependent Vulnerability
	Programming languages show different sensitivity to domain shift:
	- High vulnerability: Java, PHP (enterprise/web languages)
	- Medium vulnerability: Go, Ruby
	- Low vulnerability: Python (ubiquitous in tutorials)
	- Potential benefit: JavaScript (web-native language)

	### 3. Simple Distillation Superiority
	Model2Vec's simple distillation approach outperforms complex fine-tuning when training data is misaligned with the target domain.

	### 4. Training Data Quality > Quantity
	Using massive but irrelevant data (C4) is worse than using no additional training at all.

	---

	## 📋 Actionable Recommendations

	### ❌ What NOT to Do
	1. Don't use C4 for code models: General web text degrades code-specific performance
	2. Don't assume more training is better: Additional training can be counterproductive
	3. Don't ignore domain alignment: Training data must match target application domain
	4. Don't prioritize model size: Larger models can perform worse if poorly trained

	### ✅ What TO Do
	1. Stick to base distillation: Simple Model2Vec distillation gives optimal results for code tasks
	2. Use code-specific datasets only: If fine-tuning is needed, use CodeSearchNet or similar datasets
	3. Validate domain alignment: Ensure training data distribution matches target use case
	4. Measure efficiency: Consider performance per parameter, not just absolute performance
	5. Test incrementally: Validate that each training step improves rather than degrades performance

	### 🔬 Future Research Directions
	1. Code-specific fine-tuning: Investigate tokenlearn training with CodeSearchNet instead of C4
	2. Selective fine-tuning: Apply additional training only to languages that show potential benefit (JavaScript)
	3. Hybrid approaches: Combine base distillation with minimal, targeted code-specific training
	4. Domain adaptation techniques: Develop methods to prevent catastrophic forgetting during domain transfer

	---

	## 📊 Statistical Significance

	All performance drops are substantial and consistent across metrics:
	- Minimum degradation: 1.4% (Ruby MRR)
	- Maximum degradation: 63.1% (Java MRR)
	- Median degradation: ~15% across all metrics
	- Only improvement: JavaScript (+3.6% NDCG@10)

	Conclusion: The degradation is not due to random variation but represents a systematic failure of the C4 fine-tuning approach.

	---

	## 🎓 Lessons Learned

	1. Domain expertise beats scale: Code-specific knowledge is more valuable than training on massive general datasets
	2. Validate training approaches: Always compare against simpler baselines before deploying complex training pipelines
	3. Language-specific patterns matter: Different programming languages have varying sensitivity to domain shift
	4. Efficiency is crucial: Model performance per parameter is often more important than absolute performance
	5. Simple can be superior: Sometimes the simplest approach (basic distillation) outperforms sophisticated alternatives

	---

	Documentation Date: December 2024
	Model Comparison: `sentence-transformers/all-mpnet-base-v2` teacher → Model2Vec distillation vs Model2Vec + C4 tokenlearn fine-tuning
	Evaluation Dataset: CodeSearchNet across 6 programming languages
	Key Finding: Simple distillation outperforms complex fine-tuning by 16.8% NDCG@10 on average

	# Research Notes: Performance Analysis of C4 Fine-tuning vs Base Distillation

	## 📊 Executive Summary

	Key Finding: C4 fine-tuning significantly degraded performance across almost all metrics and programming languages compared to simple Model2Vec distillation.

	Recommendation: Use simple Model2Vec distillation without additional training for optimal code embedding performance.

	---

	## 📉 Overall Performance Degradation

	The comparison between base distilled models and C4-fine-tuned models reveals substantial performance regression:

	\| Metric \| Base Model \| Fine-tuned Model \| Performance Drop \|
	\|--------\|------------\|------------------\|------------------\|
	\| NDCG@10 \| 0.7387 \| 0.6147 \| -16.8% \|
	\| MRR \| 0.7010 \| 0.5720 \| -18.4% \|
	\| Recall@5 \| 0.8017 \| 0.6950 \| -13.3% \|
	\| Recall@1 \| 0.6169 \| 0.4650 \| -24.6% \|

	Impact: Double-digit performance drops across all major retrieval metrics, with Recall@1 suffering the most severe degradation at nearly 25%.

	---

	## 🔍 Language-Specific Impact Analysis

	The performance degradation varied significantly across programming languages, revealing interesting patterns about domain sensitivity:

	### 🚨 Severely Affected Languages

	#### Java (Catastrophic degradation):
	- NDCG@10: 0.7027 → 0.2820 (-59.9%)
	- MRR: 0.6553 → 0.2419 (-63.1%)
	- Mean Rank: 7.24 → 20.38 (almost 3x worse ranking)
	- Analysis: Java suffered the most severe degradation, suggesting its documentation patterns are most incompatible with C4's web text distribution.

	#### PHP (Major degradation):
	- NDCG@10: 0.7055 → 0.4453 (-36.9%)
	- MRR: 0.6631 → 0.3981 (-40.0%)
	- Analysis: PHP's unique syntax and documentation style may have been particularly disrupted by general web text training.

	### 📊 Moderately Affected Languages

	#### Python (Best preserved):
	- NDCG@10: 0.9674 → 0.9219 (-4.7%)
	- MRR: 0.9572 → 0.8964 (-6.3%)
	- Analysis: Python showed the smallest degradation, likely due to its prevalence in web tutorials and documentation that might overlap with C4 content.

	#### Ruby (Minor degradation):
	- NDCG@10: 0.7287 → 0.7178 (-1.5%)
	- MRR: 0.6869 → 0.6776 (-1.4%)

	#### Go (Minor degradation):
	- NDCG@10: 0.7529 → 0.7250 (-3.7%)
	- MRR: 0.7059 → 0.6699 (-5.1%)

	### ✅ Single Improvement

	#### JavaScript (Slight improvement):
	- NDCG@10: 0.5752 → 0.5959 (+3.6%)
	- MRR: 0.5378 → 0.5481 (+1.9%)
	- Analysis: JavaScript was the only language to show improvement, possibly due to extensive JavaScript content in web pages that align with C4's distribution.

	---

	## 🔍 Model Characteristics Comparison

	\| Aspect \| Base Model \| Fine-tuned Model \| Change \| Impact \|
	\|--------\|------------\|------------------\|--------\|---------\|
	\| Parameters \| 7.56M \| 9.38M \| +24% larger \| Increased complexity \|
	\| Disk Size \| 15.07MB \| 36.94MB \| +145% larger \| Storage overhead \|
	\| Performance \| Superior \| Inferior \| Significantly worse \| Counterproductive \|
	\| Efficiency \| High \| Low \| Worse per parameter \| Resource waste \|

	Key Insight: The fine-tuned model is larger, more complex, and performs worse—a clear example of the "bigger is not always better" principle.

	---

	## 🧠 Root Cause Analysis

	### 1. 🌐 Domain Mismatch
	- Problem: C4 contains general web text (articles, forums, websites, news)
	- Impact: Code documentation has fundamentally different linguistic patterns, vocabulary, and structure
	- Result: Training on web text actively degraded code-specific knowledge

	### 2. 🧠 Catastrophic Forgetting
	- Problem: The model "forgot" code-specific embeddings during C4 training
	- Evidence: Java and PHP were hit hardest (59.9% and 36.9% NDCG@10 drops respectively)
	- Mechanism: New training overwrote previously learned code-specific representations

	### 3. 📊 Distribution Shift
	- Problem: C4's token distribution is vastly different from code comments and documentation
	- Impact: Model learned patterns that are irrelevant or harmful for code retrieval
	- Evidence: Uniform degradation across most languages suggests systematic distribution mismatch

	### 4. ⚖️ Training Methodology Issues
	- Problem: Tokenlearn training on C4 introduced noise rather than signal
	- Analysis: The POTION approach works well for general text but fails for specialized domains
	- Conclusion: Domain-agnostic training methods can be counterproductive

	---

	## 📈 Performance vs Complexity Analysis

	```
	Performance Efficiency = NDCG@10 / Model_Size_MB

	Base Model: 0.7387 / 15.07 = 0.049 (High efficiency)
	Fine-tuned Model: 0.6147 / 36.94 = 0.017 (Low efficiency)

	Efficiency Loss: 65.3%
	```

	The fine-tuned model is not only worse performing but also dramatically less efficient, representing a significant regression in both absolute and relative terms.

	---

	## 🎯 Key Research Insights

	### 1. Domain Specificity Matters
	Code embeddings require domain-specific training data. General web text (C4) actively harms code retrieval performance.

	### 2. Language-Dependent Vulnerability
	Programming languages show different sensitivity to domain shift:
	- High vulnerability: Java, PHP (enterprise/web languages)
	- Medium vulnerability: Go, Ruby
	- Low vulnerability: Python (ubiquitous in tutorials)
	- Potential benefit: JavaScript (web-native language)

	### 3. Simple Distillation Superiority
	Model2Vec's simple distillation approach outperforms complex fine-tuning when training data is misaligned with the target domain.

	### 4. Training Data Quality > Quantity
	Using massive but irrelevant data (C4) is worse than using no additional training at all.

	---

	## 📋 Actionable Recommendations

	### ❌ What NOT to Do
	1. Don't use C4 for code models: General web text degrades code-specific performance
	2. Don't assume more training is better: Additional training can be counterproductive
	3. Don't ignore domain alignment: Training data must match target application domain
	4. Don't prioritize model size: Larger models can perform worse if poorly trained

	### ✅ What TO Do
	1. Stick to base distillation: Simple Model2Vec distillation gives optimal results for code tasks
	2. Use code-specific datasets only: If fine-tuning is needed, use CodeSearchNet or similar datasets
	3. Validate domain alignment: Ensure training data distribution matches target use case
	4. Measure efficiency: Consider performance per parameter, not just absolute performance
	5. Test incrementally: Validate that each training step improves rather than degrades performance

	### 🔬 Future Research Directions
	1. Code-specific fine-tuning: Investigate tokenlearn training with CodeSearchNet instead of C4
	2. Selective fine-tuning: Apply additional training only to languages that show potential benefit (JavaScript)
	3. Hybrid approaches: Combine base distillation with minimal, targeted code-specific training
	4. Domain adaptation techniques: Develop methods to prevent catastrophic forgetting during domain transfer

	---

	## 📊 Statistical Significance

	All performance drops are substantial and consistent across metrics:
	- Minimum degradation: 1.4% (Ruby MRR)
	- Maximum degradation: 63.1% (Java MRR)
	- Median degradation: ~15% across all metrics
	- Only improvement: JavaScript (+3.6% NDCG@10)

	Conclusion: The degradation is not due to random variation but represents a systematic failure of the C4 fine-tuning approach.

	---

	## 🎓 Lessons Learned

	1. Domain expertise beats scale: Code-specific knowledge is more valuable than training on massive general datasets
	2. Validate training approaches: Always compare against simpler baselines before deploying complex training pipelines
	3. Language-specific patterns matter: Different programming languages have varying sensitivity to domain shift
	4. Efficiency is crucial: Model performance per parameter is often more important than absolute performance
	5. Simple can be superior: Sometimes the simplest approach (basic distillation) outperforms sophisticated alternatives

	---

	Documentation Date: December 2024
	Model Comparison: `sentence-transformers/all-mpnet-base-v2` teacher → Model2Vec distillation vs Model2Vec + C4 tokenlearn fine-tuning
	Evaluation Dataset: CodeSearchNet across 6 programming languages
	Key Finding: Simple distillation outperforms complex fine-tuning by 16.8% NDCG@10 on average