codemalt / REPORT.md
Sarthak
chore: update README and REPORT with performance insights and dataset changes
0dbb356
# Code-Specialized Model2Vec Distillation Analysis
## 🎯 Executive Summary
This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.
### Evaluated Models Overview
**Simplified Distillation Models:** 14
**Peer Comparison Models:** 19
**Total Models Analyzed:** 33
### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2
**Overall CodeSearchNet Performance:**
- **NDCG@10**: 0.7387
- **Mean Reciprocal Rank (MRR)**: 0.7010
- **Recall@5**: 0.8017
- **Mean Rank**: 6.4
## πŸ“Š Comprehensive Model Comparison
### All Simplified Distillation Models Performance
| Model | Teacher | NDCG@10 | MRR | Recall@5 | Status |
|-------|---------|---------|-----|----------|--------|
| code_model2vec_all_mpnet_base_v2 | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.7387 | 0.7010 | 0.8017 | πŸ₯‡ Best |
| code_model2vec_all_MiniLM_L6_v2 | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 0.7385 | 0.7049 | 0.7910 | πŸ₯ˆ 2nd |
| code_model2vec_jina_embeddings_v2_base_code | [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) | 0.7381 | 0.6996 | 0.8130 | πŸ₯‰ 3rd |
| code_model2vec_paraphrase_MiniLM_L6_v2 | [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) | 0.7013 | 0.6638 | 0.7665 | #4 |
| code_model2vec_Reason_ModernColBERT | [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) | 0.6598 | 0.6228 | 0.7260 | #5 |
| code_model2vec_all_mpnet_base_v2_fine_tuned | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.6147 | 0.5720 | 0.6950 | #6 |
| code_model2vec_bge_m3 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.4863 | 0.4439 | 0.5514 | #7 |
| code_model2vec_jina_embeddings_v3 | [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 0.4755 | 0.4416 | 0.5456 | #8 |
| code_model2vec_nomic_embed_text_v2_moe | [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) | 0.4532 | 0.4275 | 0.5094 | #9 |
| code_model2vec_gte_Qwen2_1.5B_instruct | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | 0.4238 | 0.3879 | 0.4719 | #10 |
| code_model2vec_Qodo_Embed_1_1.5B | [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) | 0.4101 | 0.3810 | 0.4532 | #11 |
| code_model2vec_graphcodebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.3420 | 0.3140 | 0.3704 | #12 |
| code_model2vec_Linq_Embed_Mistral | [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 0.2868 | 0.2581 | 0.3412 | #13 |
| code_model2vec_codebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.2779 | 0.2534 | 0.3136 | #14 |
### πŸ“Š Model Specifications Analysis
Our distilled models exhibit consistent architectural characteristics across different teacher models:
| Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size |
|-------|----------------|------------|---------------|-----------|
| all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB |
| all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
| jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB |
| paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
| Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB |
| all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB |
| bge_m3 | 249,999 | 64.0M | 256 | 122.1MB |
| jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB |
| nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB |
| gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB |
| Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB |
| graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB |
| Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB |
| codebert_base | 50,262 | 12.9M | 256 | 24.5MB |
![Model Specifications](analysis_charts/model_specifications.png)
*Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.*
#### Key Insights from Model Specifications:
- **Vocabulary Consistency**: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
- **Parameter Efficiency**: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
- **Storage Efficiency**: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
- **Embedding Dimensions**: Consistent 256 dimensions across all models (optimized for efficiency)
### Key Findings
- **Best Teacher Model**: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
- **Least Effective Teacher**: code_model2vec_codebert_base (NDCG@10: 0.2779)
- **Performance Range**: 62.4% difference between best and worst
- **Average Performance**: 0.5248 NDCG@10
## 🎯 Language Performance Radar Charts
### Best Model vs Peer Models Comparison
![Comparative Radar Chart](analysis_charts/comparative_radar.png)
*Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.*
### Individual Model Performance by Language
#### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387
![code_model2vec_all_mpnet_base_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2.png)
#### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385
![code_model2vec_all_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_MiniLM_L6_v2.png)
#### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381
![code_model2vec_jina_embeddings_v2_base_code Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v2_base_code.png)
#### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013
![code_model2vec_paraphrase_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_paraphrase_MiniLM_L6_v2.png)
#### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598
![code_model2vec_Reason_ModernColBERT Radar Chart](analysis_charts/radar_code_model2vec_Reason_ModernColBERT.png)
#### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147
![code_model2vec_all_mpnet_base_v2_fine_tuned Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2_fine_tuned.png)
#### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863
![code_model2vec_bge_m3 Radar Chart](analysis_charts/radar_code_model2vec_bge_m3.png)
#### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755
![code_model2vec_jina_embeddings_v3 Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v3.png)
#### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532
![code_model2vec_nomic_embed_text_v2_moe Radar Chart](analysis_charts/radar_code_model2vec_nomic_embed_text_v2_moe.png)
#### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238
![code_model2vec_gte_Qwen2_1.5B_instruct Radar Chart](analysis_charts/radar_code_model2vec_gte_Qwen2_15B_instruct.png)
#### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101
![code_model2vec_Qodo_Embed_1_1.5B Radar Chart](analysis_charts/radar_code_model2vec_Qodo_Embed_1_15B.png)
#### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420
![code_model2vec_graphcodebert_base Radar Chart](analysis_charts/radar_code_model2vec_graphcodebert_base.png)
#### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868
![code_model2vec_Linq_Embed_Mistral Radar Chart](analysis_charts/radar_code_model2vec_Linq_Embed_Mistral.png)
#### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779
![code_model2vec_codebert_base Radar Chart](analysis_charts/radar_code_model2vec_codebert_base.png)
## πŸ† Peer Model Comparison
![Peer Comparison](analysis_charts/peer_comparison.png)
*Comparison with established code-specialized embedding models using actual evaluation results.*
### Complete Model Ranking
| Rank | Model | Type | NDCG@10 | MRR | Recall@5 |
|------|-------|------|---------|-----|----------|
| 1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 |
| 2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 |
| 3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 |
| 4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 |
| 5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 |
| 6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 |
| 7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 |
| 8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 |
| 9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 |
| 10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 |
| 11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 |
| 12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 |
| 13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 |
| 14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 |
| 15 | code_model2vec_all_mpnet_base_v2 | **πŸ”₯ Simplified Distillation** | 0.7387 | 0.7010 | 0.8017 |
| 16 | code_model2vec_all_MiniLM_L6_v2 | **πŸ”₯ Simplified Distillation** | 0.7385 | 0.7049 | 0.7910 |
| 17 | code_model2vec_jina_embeddings_v2_base_code | **πŸ”₯ Simplified Distillation** | 0.7381 | 0.6996 | 0.8130 |
| 18 | code_model2vec_paraphrase_MiniLM_L6_v2 | **πŸ”₯ Simplified Distillation** | 0.7013 | 0.6638 | 0.7665 |
| 19 | code_model2vec_Reason_ModernColBERT | **πŸ”₯ Simplified Distillation** | 0.6598 | 0.6228 | 0.7260 |
| 20 | code_model2vec_all_mpnet_base_v2_fine_tuned | **πŸŽ“ Fine-tuned Distillation** | 0.6147 | 0.5720 | 0.6950 |
| 21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 |
| 22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 |
| 23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 |
| 24 | code_model2vec_bge_m3 | **πŸ”₯ Simplified Distillation** | 0.4863 | 0.4439 | 0.5514 |
| 25 | code_model2vec_jina_embeddings_v3 | **πŸ”₯ Simplified Distillation** | 0.4755 | 0.4416 | 0.5456 |
| 26 | code_model2vec_nomic_embed_text_v2_moe | **πŸ”₯ Simplified Distillation** | 0.4532 | 0.4275 | 0.5094 |
| 27 | code_model2vec_gte_Qwen2_1.5B_instruct | **πŸ”₯ Simplified Distillation** | 0.4238 | 0.3879 | 0.4719 |
| 28 | code_model2vec_Qodo_Embed_1_1.5B | **πŸ”₯ Simplified Distillation** | 0.4101 | 0.3810 | 0.4532 |
| 29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 |
| 30 | code_model2vec_graphcodebert_base | **πŸ”₯ Simplified Distillation** | 0.3420 | 0.3140 | 0.3704 |
| 31 | code_model2vec_Linq_Embed_Mistral | **πŸ”₯ Simplified Distillation** | 0.2868 | 0.2581 | 0.3412 |
| 32 | code_model2vec_codebert_base | **πŸ”₯ Simplified Distillation** | 0.2779 | 0.2534 | 0.3136 |
| 33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 |
## πŸ“ˆ Performance Analysis
### Multi-Model Comparison Charts
![Model Comparison](analysis_charts/model_comparison.png)
*Comprehensive comparison across all evaluation metrics.*
### Language Performance Analysis
![Language Heatmap](analysis_charts/language_heatmap.png)
*Performance heatmap showing how different models perform across programming languages.*
### Efficiency Analysis
![Efficiency Analysis](analysis_charts/efficiency_analysis.png)
*Performance vs model size analysis showing the efficiency benefits of distillation.*
## ⚑ Operational Performance Analysis
![Benchmark Performance](analysis_charts/benchmark_performance.png)
*Comprehensive performance benchmarking across multiple operational metrics.*
### Performance Scaling Analysis
![Batch Size Scaling](analysis_charts/batch_size_scaling.png)
*How performance scales with different batch sizes for optimal throughput.*
![Memory Scaling](analysis_charts/memory_scaling.png)
*Memory usage patterns across different batch sizes.*
## πŸ” Language-Specific Analysis
### Performance by Programming Language
| Language | Best Model Performance | Average Performance | Language Difficulty |
|----------|------------------------|--------------------|--------------------|
| Go | 0.9780 | 0.6960 | Easy |
| Java | 0.9921 | 0.6553 | Easy |
| Javascript | 0.9550 | 0.5850 | Easy |
| Php | 1.0000 | 0.6321 | Easy |
| Python | 1.0000 | 0.8623 | Easy |
| Ruby | 0.9493 | 0.6397 | Easy |
## 🎯 Conclusions and Recommendations
### Teacher Model Analysis
Based on the evaluation results across all simplified distillation models:
1. **Best Teacher Model**: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
2. **Least Effective Teacher**: microsoft/codebert-base (NDCG@10: 0.2779)
3. **Teacher Model Impact**: Choice of teacher model affects performance by 62.4%
### Recommendations
- **For Production**: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
- **For Efficiency**: Model2Vec distillation provides significant size reduction with competitive performance
- **For Code Tasks**: Specialized models consistently outperform general-purpose models
## πŸ“„ Methodology
### Evaluation Protocol
- **Dataset**: CodeSearchNet test sets for 6 programming languages
- **Metrics**: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
- **Query Format**: Natural language documentation strings
- **Corpus Format**: Function code strings
- **Evaluation**: Retrieval of correct code for each documentation query
### Teacher Models Tested
- [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline)
- [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose)
- [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model)
- [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized)
- [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model)
- [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model)
- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model)
- [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model)
- [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts)
- [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized)
- [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture)
- [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based)
- [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE)
- [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model)
### Distillation Method
- **Technique**: Model2Vec static embedding generation
- **Parameters**: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
- **Training Data**: CodeSearchNet comment-code pairs
- **Languages**: Python, JavaScript, Java, PHP, Ruby, Go
---
*Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.*
*For questions about methodology or results, please refer to the CodeSearchNet documentation.*