Research Notes: Performance Analysis of C4 Fine-tuning vs Base Distillation

📊 Executive Summary

Key Finding: C4 fine-tuning significantly degraded performance across almost all metrics and programming languages compared to simple Model2Vec distillation.

Recommendation: Use simple Model2Vec distillation without additional training for optimal code embedding performance.

📉 Overall Performance Degradation

The comparison between base distilled models and C4-fine-tuned models reveals substantial performance regression:

Metric	Base Model	Fine-tuned Model	Performance Drop
NDCG@10	0.7387	0.6147	-16.8%
MRR	0.7010	0.5720	-18.4%
Recall@5	0.8017	0.6950	-13.3%
Recall@1	0.6169	0.4650	-24.6%

Impact: Double-digit performance drops across all major retrieval metrics, with Recall@1 suffering the most severe degradation at nearly 25%.

🔍 Language-Specific Impact Analysis

The performance degradation varied significantly across programming languages, revealing interesting patterns about domain sensitivity:

🚨 Severely Affected Languages

Java (Catastrophic degradation):

NDCG@10: 0.7027 → 0.2820 (-59.9%)
MRR: 0.6553 → 0.2419 (-63.1%)
Mean Rank: 7.24 → 20.38 (almost 3x worse ranking)
Analysis: Java suffered the most severe degradation, suggesting its documentation patterns are most incompatible with C4's web text distribution.

PHP (Major degradation):

NDCG@10: 0.7055 → 0.4453 (-36.9%)
MRR: 0.6631 → 0.3981 (-40.0%)
Analysis: PHP's unique syntax and documentation style may have been particularly disrupted by general web text training.

📊 Moderately Affected Languages

Python (Best preserved):

NDCG@10: 0.9674 → 0.9219 (-4.7%)
MRR: 0.9572 → 0.8964 (-6.3%)
Analysis: Python showed the smallest degradation, likely due to its prevalence in web tutorials and documentation that might overlap with C4 content.

Ruby (Minor degradation):

NDCG@10: 0.7287 → 0.7178 (-1.5%)
MRR: 0.6869 → 0.6776 (-1.4%)

Go (Minor degradation):

NDCG@10: 0.7529 → 0.7250 (-3.7%)
MRR: 0.7059 → 0.6699 (-5.1%)

✅ Single Improvement

JavaScript (Slight improvement):

NDCG@10: 0.5752 → 0.5959 (+3.6%)
MRR: 0.5378 → 0.5481 (+1.9%)
Analysis: JavaScript was the only language to show improvement, possibly due to extensive JavaScript content in web pages that align with C4's distribution.

🔍 Model Characteristics Comparison

Aspect	Base Model	Fine-tuned Model	Change	Impact
Parameters	7.56M	9.38M	+24% larger	Increased complexity
Disk Size	15.07MB	36.94MB	+145% larger	Storage overhead
Performance	Superior	Inferior	Significantly worse	Counterproductive
Efficiency	High	Low	Worse per parameter	Resource waste

Key Insight: The fine-tuned model is larger, more complex, and performs worse—a clear example of the "bigger is not always better" principle.

🧠 Root Cause Analysis

1. 🌐 Domain Mismatch

Problem: C4 contains general web text (articles, forums, websites, news)
Impact: Code documentation has fundamentally different linguistic patterns, vocabulary, and structure
Result: Training on web text actively degraded code-specific knowledge

2. 🧠 Catastrophic Forgetting

Problem: The model "forgot" code-specific embeddings during C4 training
Evidence: Java and PHP were hit hardest (59.9% and 36.9% NDCG@10 drops respectively)
Mechanism: New training overwrote previously learned code-specific representations

3. 📊 Distribution Shift

Problem: C4's token distribution is vastly different from code comments and documentation
Impact: Model learned patterns that are irrelevant or harmful for code retrieval
Evidence: Uniform degradation across most languages suggests systematic distribution mismatch

4. ⚖️ Training Methodology Issues

Problem: Tokenlearn training on C4 introduced noise rather than signal
Analysis: The POTION approach works well for general text but fails for specialized domains
Conclusion: Domain-agnostic training methods can be counterproductive

📈 Performance vs Complexity Analysis

Performance Efficiency = NDCG@10 / Model_Size_MB

Base Model: 0.7387 / 15.07 = 0.049 (High efficiency)
Fine-tuned Model: 0.6147 / 36.94 = 0.017 (Low efficiency)

Efficiency Loss: 65.3%

The fine-tuned model is not only worse performing but also dramatically less efficient, representing a significant regression in both absolute and relative terms.

🎯 Key Research Insights

1. Domain Specificity Matters

Code embeddings require domain-specific training data. General web text (C4) actively harms code retrieval performance.

2. Language-Dependent Vulnerability

Programming languages show different sensitivity to domain shift:

High vulnerability: Java, PHP (enterprise/web languages)
Medium vulnerability: Go, Ruby
Low vulnerability: Python (ubiquitous in tutorials)
Potential benefit: JavaScript (web-native language)

3. Simple Distillation Superiority

Model2Vec's simple distillation approach outperforms complex fine-tuning when training data is misaligned with the target domain.

4. Training Data Quality > Quantity

Using massive but irrelevant data (C4) is worse than using no additional training at all.

📋 Actionable Recommendations

❌ What NOT to Do

Don't use C4 for code models: General web text degrades code-specific performance
Don't assume more training is better: Additional training can be counterproductive
Don't ignore domain alignment: Training data must match target application domain
Don't prioritize model size: Larger models can perform worse if poorly trained

✅ What TO Do

Stick to base distillation: Simple Model2Vec distillation gives optimal results for code tasks
Use code-specific datasets only: If fine-tuning is needed, use CodeSearchNet or similar datasets
Validate domain alignment: Ensure training data distribution matches target use case
Measure efficiency: Consider performance per parameter, not just absolute performance
Test incrementally: Validate that each training step improves rather than degrades performance

🔬 Future Research Directions

Code-specific fine-tuning: Investigate tokenlearn training with CodeSearchNet instead of C4
Selective fine-tuning: Apply additional training only to languages that show potential benefit (JavaScript)
Hybrid approaches: Combine base distillation with minimal, targeted code-specific training
Domain adaptation techniques: Develop methods to prevent catastrophic forgetting during domain transfer

📊 Statistical Significance

All performance drops are substantial and consistent across metrics:

Minimum degradation: 1.4% (Ruby MRR)
Maximum degradation: 63.1% (Java MRR)
Median degradation: ~15% across all metrics
Only improvement: JavaScript (+3.6% NDCG@10)

Conclusion: The degradation is not due to random variation but represents a systematic failure of the C4 fine-tuning approach.

🎓 Lessons Learned

Domain expertise beats scale: Code-specific knowledge is more valuable than training on massive general datasets
Validate training approaches: Always compare against simpler baselines before deploying complex training pipelines
Language-specific patterns matter: Different programming languages have varying sensitivity to domain shift
Efficiency is crucial: Model performance per parameter is often more important than absolute performance
Simple can be superior: Sometimes the simplest approach (basic distillation) outperforms sophisticated alternatives

Documentation Date: December 2024
Model Comparison: sentence-transformers/all-mpnet-base-v2 teacher → Model2Vec distillation vs Model2Vec + C4 tokenlearn fine-tuning
Evaluation Dataset: CodeSearchNet across 6 programming languages
Key Finding: Simple distillation outperforms complex fine-tuning by 16.8% NDCG@10 on average