File size: 17,039 Bytes

# Code-Specialized Model2Vec Distillation Analysis

## 🎯 Executive Summary

This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.

### Evaluated Models Overview

**Simplified Distillation Models:** 14
**Peer Comparison Models:** 19
**Total Models Analyzed:** 33

### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2

**Overall CodeSearchNet Performance:**
- **NDCG@10**: 0.7387
- **Mean Reciprocal Rank (MRR)**: 0.7010
- **Recall@5**: 0.8017
- **Mean Rank**: 6.4

## 📊 Comprehensive Model Comparison

### All Simplified Distillation Models Performance

| Model | Teacher | NDCG@10 | MRR | Recall@5 | Status |
|-------|---------|---------|-----|----------|--------|
| code_model2vec_all_mpnet_base_v2 | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.7387 | 0.7010 | 0.8017 | 🥇 Best |
| code_model2vec_all_MiniLM_L6_v2 | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 0.7385 | 0.7049 | 0.7910 | 🥈 2nd |
| code_model2vec_jina_embeddings_v2_base_code | [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) | 0.7381 | 0.6996 | 0.8130 | 🥉 3rd |
| code_model2vec_paraphrase_MiniLM_L6_v2 | [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) | 0.7013 | 0.6638 | 0.7665 | #4 |
| code_model2vec_Reason_ModernColBERT | [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) | 0.6598 | 0.6228 | 0.7260 | #5 |
| code_model2vec_all_mpnet_base_v2_fine_tuned | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.6147 | 0.5720 | 0.6950 | #6 |
| code_model2vec_bge_m3 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.4863 | 0.4439 | 0.5514 | #7 |
| code_model2vec_jina_embeddings_v3 | [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 0.4755 | 0.4416 | 0.5456 | #8 |
| code_model2vec_nomic_embed_text_v2_moe | [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) | 0.4532 | 0.4275 | 0.5094 | #9 |
| code_model2vec_gte_Qwen2_1.5B_instruct | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | 0.4238 | 0.3879 | 0.4719 | #10 |
| code_model2vec_Qodo_Embed_1_1.5B | [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) | 0.4101 | 0.3810 | 0.4532 | #11 |
| code_model2vec_graphcodebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.3420 | 0.3140 | 0.3704 | #12 |
| code_model2vec_Linq_Embed_Mistral | [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 0.2868 | 0.2581 | 0.3412 | #13 |
| code_model2vec_codebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.2779 | 0.2534 | 0.3136 | #14 |


### 📊 Model Specifications Analysis

Our distilled models exhibit consistent architectural characteristics across different teacher models:

| Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size |
|-------|----------------|------------|---------------|-----------|
| all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB |
| all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
| jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB |
| paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
| Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB |
| all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB |
| bge_m3 | 249,999 | 64.0M | 256 | 122.1MB |
| jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB |
| nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB |
| gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB |
| Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB |
| graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB |
| Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB |
| codebert_base | 50,262 | 12.9M | 256 | 24.5MB |


![Model Specifications](analysis_charts/model_specifications.png)

*Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.*

#### Key Insights from Model Specifications:


- **Vocabulary Consistency**: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
- **Parameter Efficiency**: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
- **Storage Efficiency**: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
- **Embedding Dimensions**: Consistent 256 dimensions across all models (optimized for efficiency)


### Key Findings


- **Best Teacher Model**: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
- **Least Effective Teacher**: code_model2vec_codebert_base (NDCG@10: 0.2779)
- **Performance Range**: 62.4% difference between best and worst
- **Average Performance**: 0.5248 NDCG@10


## 🎯 Language Performance Radar Charts

### Best Model vs Peer Models Comparison

![Comparative Radar Chart](analysis_charts/comparative_radar.png)

*Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.*

### Individual Model Performance by Language

#### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387

![code_model2vec_all_mpnet_base_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2.png)

#### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385

![code_model2vec_all_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_MiniLM_L6_v2.png)

#### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381

![code_model2vec_jina_embeddings_v2_base_code Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v2_base_code.png)

#### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013

![code_model2vec_paraphrase_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_paraphrase_MiniLM_L6_v2.png)

#### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598

![code_model2vec_Reason_ModernColBERT Radar Chart](analysis_charts/radar_code_model2vec_Reason_ModernColBERT.png)

#### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147

![code_model2vec_all_mpnet_base_v2_fine_tuned Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2_fine_tuned.png)

#### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863

![code_model2vec_bge_m3 Radar Chart](analysis_charts/radar_code_model2vec_bge_m3.png)

#### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755

![code_model2vec_jina_embeddings_v3 Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v3.png)

#### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532

![code_model2vec_nomic_embed_text_v2_moe Radar Chart](analysis_charts/radar_code_model2vec_nomic_embed_text_v2_moe.png)

#### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238

![code_model2vec_gte_Qwen2_1.5B_instruct Radar Chart](analysis_charts/radar_code_model2vec_gte_Qwen2_15B_instruct.png)

#### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101

![code_model2vec_Qodo_Embed_1_1.5B Radar Chart](analysis_charts/radar_code_model2vec_Qodo_Embed_1_15B.png)

#### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420

![code_model2vec_graphcodebert_base Radar Chart](analysis_charts/radar_code_model2vec_graphcodebert_base.png)

#### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868

![code_model2vec_Linq_Embed_Mistral Radar Chart](analysis_charts/radar_code_model2vec_Linq_Embed_Mistral.png)

#### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779

![code_model2vec_codebert_base Radar Chart](analysis_charts/radar_code_model2vec_codebert_base.png)



## 🏆 Peer Model Comparison

![Peer Comparison](analysis_charts/peer_comparison.png)

*Comparison with established code-specialized embedding models using actual evaluation results.*

### Complete Model Ranking

| Rank | Model | Type | NDCG@10 | MRR | Recall@5 |
|------|-------|------|---------|-----|----------|
| 1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 |
| 2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 |
| 3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 |
| 4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 |
| 5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 |
| 6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 |
| 7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 |
| 8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 |
| 9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 |
| 10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 |
| 11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 |
| 12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 |
| 13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 |
| 14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 |
| 15 | code_model2vec_all_mpnet_base_v2 | **🔥 Simplified Distillation** | 0.7387 | 0.7010 | 0.8017 |
| 16 | code_model2vec_all_MiniLM_L6_v2 | **🔥 Simplified Distillation** | 0.7385 | 0.7049 | 0.7910 |
| 17 | code_model2vec_jina_embeddings_v2_base_code | **🔥 Simplified Distillation** | 0.7381 | 0.6996 | 0.8130 |
| 18 | code_model2vec_paraphrase_MiniLM_L6_v2 | **🔥 Simplified Distillation** | 0.7013 | 0.6638 | 0.7665 |
| 19 | code_model2vec_Reason_ModernColBERT | **🔥 Simplified Distillation** | 0.6598 | 0.6228 | 0.7260 |
| 20 | code_model2vec_all_mpnet_base_v2_fine_tuned | **🎓 Fine-tuned Distillation** | 0.6147 | 0.5720 | 0.6950 |
| 21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 |
| 22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 |
| 23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 |
| 24 | code_model2vec_bge_m3 | **🔥 Simplified Distillation** | 0.4863 | 0.4439 | 0.5514 |
| 25 | code_model2vec_jina_embeddings_v3 | **🔥 Simplified Distillation** | 0.4755 | 0.4416 | 0.5456 |
| 26 | code_model2vec_nomic_embed_text_v2_moe | **🔥 Simplified Distillation** | 0.4532 | 0.4275 | 0.5094 |
| 27 | code_model2vec_gte_Qwen2_1.5B_instruct | **🔥 Simplified Distillation** | 0.4238 | 0.3879 | 0.4719 |
| 28 | code_model2vec_Qodo_Embed_1_1.5B | **🔥 Simplified Distillation** | 0.4101 | 0.3810 | 0.4532 |
| 29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 |
| 30 | code_model2vec_graphcodebert_base | **🔥 Simplified Distillation** | 0.3420 | 0.3140 | 0.3704 |
| 31 | code_model2vec_Linq_Embed_Mistral | **🔥 Simplified Distillation** | 0.2868 | 0.2581 | 0.3412 |
| 32 | code_model2vec_codebert_base | **🔥 Simplified Distillation** | 0.2779 | 0.2534 | 0.3136 |
| 33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 |


## 📈 Performance Analysis

### Multi-Model Comparison Charts

![Model Comparison](analysis_charts/model_comparison.png)

*Comprehensive comparison across all evaluation metrics.*

### Language Performance Analysis

![Language Heatmap](analysis_charts/language_heatmap.png)

*Performance heatmap showing how different models perform across programming languages.*

### Efficiency Analysis

![Efficiency Analysis](analysis_charts/efficiency_analysis.png)

*Performance vs model size analysis showing the efficiency benefits of distillation.*



## ⚡ Operational Performance Analysis

![Benchmark Performance](analysis_charts/benchmark_performance.png)

*Comprehensive performance benchmarking across multiple operational metrics.*

### Performance Scaling Analysis

![Batch Size Scaling](analysis_charts/batch_size_scaling.png)

*How performance scales with different batch sizes for optimal throughput.*

![Memory Scaling](analysis_charts/memory_scaling.png)

*Memory usage patterns across different batch sizes.*



## 🔍 Language-Specific Analysis

### Performance by Programming Language

| Language | Best Model Performance | Average Performance | Language Difficulty |
|----------|------------------------|--------------------|--------------------|
| Go | 0.9780 | 0.6960 | Easy |
| Java | 0.9921 | 0.6553 | Easy |
| Javascript | 0.9550 | 0.5850 | Easy |
| Php | 1.0000 | 0.6321 | Easy |
| Python | 1.0000 | 0.8623 | Easy |
| Ruby | 0.9493 | 0.6397 | Easy |


## 🎯 Conclusions and Recommendations

### Teacher Model Analysis

Based on the evaluation results across all simplified distillation models:


1. **Best Teacher Model**: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
2. **Least Effective Teacher**: microsoft/codebert-base (NDCG@10: 0.2779)
3. **Teacher Model Impact**: Choice of teacher model affects performance by 62.4%

### Recommendations

- **For Production**: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
- **For Efficiency**: Model2Vec distillation provides significant size reduction with competitive performance
- **For Code Tasks**: Specialized models consistently outperform general-purpose models


## 📄 Methodology

### Evaluation Protocol
- **Dataset**: CodeSearchNet test sets for 6 programming languages
- **Metrics**: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
- **Query Format**: Natural language documentation strings
- **Corpus Format**: Function code strings
- **Evaluation**: Retrieval of correct code for each documentation query

### Teacher Models Tested
- [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline)
- [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose)
- [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model)
- [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized)
- [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model)
- [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model)
- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model)
- [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model)
- [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts)
- [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized)
- [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture)
- [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based)
- [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE)
- [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model)

### Distillation Method
- **Technique**: Model2Vec static embedding generation
- **Parameters**: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
- **Training Data**: CodeSearchNet comment-code pairs
- **Languages**: Python, JavaScript, Java, PHP, Ruby, Go

---

*Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.*
*For questions about methodology or results, please refer to the CodeSearchNet documentation.*