|
# Code-Specialized Model2Vec Distillation Analysis |
|
|
|
## π― Executive Summary |
|
|
|
This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation. |
|
|
|
### Evaluated Models Overview |
|
|
|
**Simplified Distillation Models:** 14 |
|
**Peer Comparison Models:** 19 |
|
**Total Models Analyzed:** 33 |
|
|
|
### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2 |
|
|
|
**Overall CodeSearchNet Performance:** |
|
- **NDCG@10**: 0.7387 |
|
- **Mean Reciprocal Rank (MRR)**: 0.7010 |
|
- **Recall@5**: 0.8017 |
|
- **Mean Rank**: 6.4 |
|
|
|
## π Comprehensive Model Comparison |
|
|
|
### All Simplified Distillation Models Performance |
|
|
|
| Model | Teacher | NDCG@10 | MRR | Recall@5 | Status | |
|
|-------|---------|---------|-----|----------|--------| |
|
| code_model2vec_all_mpnet_base_v2 | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.7387 | 0.7010 | 0.8017 | π₯ Best | |
|
| code_model2vec_all_MiniLM_L6_v2 | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 0.7385 | 0.7049 | 0.7910 | π₯ 2nd | |
|
| code_model2vec_jina_embeddings_v2_base_code | [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) | 0.7381 | 0.6996 | 0.8130 | π₯ 3rd | |
|
| code_model2vec_paraphrase_MiniLM_L6_v2 | [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) | 0.7013 | 0.6638 | 0.7665 | #4 | |
|
| code_model2vec_Reason_ModernColBERT | [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) | 0.6598 | 0.6228 | 0.7260 | #5 | |
|
| code_model2vec_all_mpnet_base_v2_fine_tuned | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.6147 | 0.5720 | 0.6950 | #6 | |
|
| code_model2vec_bge_m3 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.4863 | 0.4439 | 0.5514 | #7 | |
|
| code_model2vec_jina_embeddings_v3 | [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 0.4755 | 0.4416 | 0.5456 | #8 | |
|
| code_model2vec_nomic_embed_text_v2_moe | [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) | 0.4532 | 0.4275 | 0.5094 | #9 | |
|
| code_model2vec_gte_Qwen2_1.5B_instruct | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | 0.4238 | 0.3879 | 0.4719 | #10 | |
|
| code_model2vec_Qodo_Embed_1_1.5B | [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) | 0.4101 | 0.3810 | 0.4532 | #11 | |
|
| code_model2vec_graphcodebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.3420 | 0.3140 | 0.3704 | #12 | |
|
| code_model2vec_Linq_Embed_Mistral | [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 0.2868 | 0.2581 | 0.3412 | #13 | |
|
| code_model2vec_codebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.2779 | 0.2534 | 0.3136 | #14 | |
|
|
|
|
|
### π Model Specifications Analysis |
|
|
|
Our distilled models exhibit consistent architectural characteristics across different teacher models: |
|
|
|
| Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size | |
|
|-------|----------------|------------|---------------|-----------| |
|
| all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB | |
|
| all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | |
|
| jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB | |
|
| paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | |
|
| Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB | |
|
| all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB | |
|
| bge_m3 | 249,999 | 64.0M | 256 | 122.1MB | |
|
| jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB | |
|
| nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB | |
|
| gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB | |
|
| Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB | |
|
| graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB | |
|
| Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB | |
|
| codebert_base | 50,262 | 12.9M | 256 | 24.5MB | |
|
|
|
|
|
 |
|
|
|
*Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.* |
|
|
|
#### Key Insights from Model Specifications: |
|
|
|
|
|
- **Vocabulary Consistency**: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594) |
|
- **Parameter Efficiency**: Models range from 7.6M to 64.0M parameters (avg: 26.0M) |
|
- **Storage Efficiency**: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB) |
|
- **Embedding Dimensions**: Consistent 256 dimensions across all models (optimized for efficiency) |
|
|
|
|
|
### Key Findings |
|
|
|
|
|
- **Best Teacher Model**: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387) |
|
- **Least Effective Teacher**: code_model2vec_codebert_base (NDCG@10: 0.2779) |
|
- **Performance Range**: 62.4% difference between best and worst |
|
- **Average Performance**: 0.5248 NDCG@10 |
|
|
|
|
|
## π― Language Performance Radar Charts |
|
|
|
### Best Model vs Peer Models Comparison |
|
|
|
 |
|
|
|
*Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.* |
|
|
|
### Individual Model Performance by Language |
|
|
|
#### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387 |
|
|
|
 |
|
|
|
#### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385 |
|
|
|
 |
|
|
|
#### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381 |
|
|
|
 |
|
|
|
#### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013 |
|
|
|
 |
|
|
|
#### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598 |
|
|
|
 |
|
|
|
#### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147 |
|
|
|
 |
|
|
|
#### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863 |
|
|
|
 |
|
|
|
#### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755 |
|
|
|
 |
|
|
|
#### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532 |
|
|
|
 |
|
|
|
#### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238 |
|
|
|
 |
|
|
|
#### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101 |
|
|
|
 |
|
|
|
#### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420 |
|
|
|
 |
|
|
|
#### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868 |
|
|
|
 |
|
|
|
#### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779 |
|
|
|
 |
|
|
|
|
|
|
|
## π Peer Model Comparison |
|
|
|
 |
|
|
|
*Comparison with established code-specialized embedding models using actual evaluation results.* |
|
|
|
### Complete Model Ranking |
|
|
|
| Rank | Model | Type | NDCG@10 | MRR | Recall@5 | |
|
|------|-------|------|---------|-----|----------| |
|
| 1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 | |
|
| 2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 | |
|
| 3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 | |
|
| 4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 | |
|
| 5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 | |
|
| 6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 | |
|
| 7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 | |
|
| 8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 | |
|
| 9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 | |
|
| 10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 | |
|
| 11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 | |
|
| 12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 | |
|
| 13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 | |
|
| 14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 | |
|
| 15 | code_model2vec_all_mpnet_base_v2 | **π₯ Simplified Distillation** | 0.7387 | 0.7010 | 0.8017 | |
|
| 16 | code_model2vec_all_MiniLM_L6_v2 | **π₯ Simplified Distillation** | 0.7385 | 0.7049 | 0.7910 | |
|
| 17 | code_model2vec_jina_embeddings_v2_base_code | **π₯ Simplified Distillation** | 0.7381 | 0.6996 | 0.8130 | |
|
| 18 | code_model2vec_paraphrase_MiniLM_L6_v2 | **π₯ Simplified Distillation** | 0.7013 | 0.6638 | 0.7665 | |
|
| 19 | code_model2vec_Reason_ModernColBERT | **π₯ Simplified Distillation** | 0.6598 | 0.6228 | 0.7260 | |
|
| 20 | code_model2vec_all_mpnet_base_v2_fine_tuned | **π Fine-tuned Distillation** | 0.6147 | 0.5720 | 0.6950 | |
|
| 21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 | |
|
| 22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 | |
|
| 23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 | |
|
| 24 | code_model2vec_bge_m3 | **π₯ Simplified Distillation** | 0.4863 | 0.4439 | 0.5514 | |
|
| 25 | code_model2vec_jina_embeddings_v3 | **π₯ Simplified Distillation** | 0.4755 | 0.4416 | 0.5456 | |
|
| 26 | code_model2vec_nomic_embed_text_v2_moe | **π₯ Simplified Distillation** | 0.4532 | 0.4275 | 0.5094 | |
|
| 27 | code_model2vec_gte_Qwen2_1.5B_instruct | **π₯ Simplified Distillation** | 0.4238 | 0.3879 | 0.4719 | |
|
| 28 | code_model2vec_Qodo_Embed_1_1.5B | **π₯ Simplified Distillation** | 0.4101 | 0.3810 | 0.4532 | |
|
| 29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 | |
|
| 30 | code_model2vec_graphcodebert_base | **π₯ Simplified Distillation** | 0.3420 | 0.3140 | 0.3704 | |
|
| 31 | code_model2vec_Linq_Embed_Mistral | **π₯ Simplified Distillation** | 0.2868 | 0.2581 | 0.3412 | |
|
| 32 | code_model2vec_codebert_base | **π₯ Simplified Distillation** | 0.2779 | 0.2534 | 0.3136 | |
|
| 33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 | |
|
|
|
|
|
## π Performance Analysis |
|
|
|
### Multi-Model Comparison Charts |
|
|
|
 |
|
|
|
*Comprehensive comparison across all evaluation metrics.* |
|
|
|
### Language Performance Analysis |
|
|
|
 |
|
|
|
*Performance heatmap showing how different models perform across programming languages.* |
|
|
|
### Efficiency Analysis |
|
|
|
 |
|
|
|
*Performance vs model size analysis showing the efficiency benefits of distillation.* |
|
|
|
|
|
|
|
## β‘ Operational Performance Analysis |
|
|
|
 |
|
|
|
*Comprehensive performance benchmarking across multiple operational metrics.* |
|
|
|
### Performance Scaling Analysis |
|
|
|
 |
|
|
|
*How performance scales with different batch sizes for optimal throughput.* |
|
|
|
 |
|
|
|
*Memory usage patterns across different batch sizes.* |
|
|
|
|
|
|
|
## π Language-Specific Analysis |
|
|
|
### Performance by Programming Language |
|
|
|
| Language | Best Model Performance | Average Performance | Language Difficulty | |
|
|----------|------------------------|--------------------|--------------------| |
|
| Go | 0.9780 | 0.6960 | Easy | |
|
| Java | 0.9921 | 0.6553 | Easy | |
|
| Javascript | 0.9550 | 0.5850 | Easy | |
|
| Php | 1.0000 | 0.6321 | Easy | |
|
| Python | 1.0000 | 0.8623 | Easy | |
|
| Ruby | 0.9493 | 0.6397 | Easy | |
|
|
|
|
|
## π― Conclusions and Recommendations |
|
|
|
### Teacher Model Analysis |
|
|
|
Based on the evaluation results across all simplified distillation models: |
|
|
|
|
|
1. **Best Teacher Model**: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385) |
|
2. **Least Effective Teacher**: microsoft/codebert-base (NDCG@10: 0.2779) |
|
3. **Teacher Model Impact**: Choice of teacher model affects performance by 62.4% |
|
|
|
### Recommendations |
|
|
|
- **For Production**: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance |
|
- **For Efficiency**: Model2Vec distillation provides significant size reduction with competitive performance |
|
- **For Code Tasks**: Specialized models consistently outperform general-purpose models |
|
|
|
|
|
## π Methodology |
|
|
|
### Evaluation Protocol |
|
- **Dataset**: CodeSearchNet test sets for 6 programming languages |
|
- **Metrics**: NDCG@k, MRR, Recall@k following CodeSearchNet methodology |
|
- **Query Format**: Natural language documentation strings |
|
- **Corpus Format**: Function code strings |
|
- **Evaluation**: Retrieval of correct code for each documentation query |
|
|
|
### Teacher Models Tested |
|
- [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline) |
|
- [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose) |
|
- [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model) |
|
- [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized) |
|
- [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model) |
|
- [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model) |
|
- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model) |
|
- [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model) |
|
- [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts) |
|
- [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized) |
|
- [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture) |
|
- [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based) |
|
- [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE) |
|
- [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model) |
|
|
|
### Distillation Method |
|
- **Technique**: Model2Vec static embedding generation |
|
- **Parameters**: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True |
|
- **Training Data**: CodeSearchNet comment-code pairs |
|
- **Languages**: Python, JavaScript, Java, PHP, Ruby, Go |
|
|
|
--- |
|
|
|
*Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.* |
|
*For questions about methodology or results, please refer to the CodeSearchNet documentation.* |
|
|