codemalt / REPORT.md

Sarthak

chore: update README and REPORT with performance insights and dataset changes

0dbb356 about 2 months ago

17 kB

	# Code-Specialized Model2Vec Distillation Analysis

	## 🎯 Executive Summary

	This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.

	### Evaluated Models Overview

	Simplified Distillation Models: 14
	Peer Comparison Models: 19
	Total Models Analyzed: 33

	### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2

	Overall CodeSearchNet Performance:
	- NDCG@10: 0.7387
	- Mean Reciprocal Rank (MRR): 0.7010
	- Recall@5: 0.8017
	- Mean Rank: 6.4

	## 📊 Comprehensive Model Comparison

	### All Simplified Distillation Models Performance

	\| Model \| Teacher \| NDCG@10 \| MRR \| Recall@5 \| Status \|
	\|-------\|---------\|---------\|-----\|----------\|--------\|
	\| code_model2vec_all_mpnet_base_v2 \| [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) \| 0.7387 \| 0.7010 \| 0.8017 \| 🥇 Best \|
	\| code_model2vec_all_MiniLM_L6_v2 \| [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) \| 0.7385 \| 0.7049 \| 0.7910 \| 🥈 2nd \|
	\| code_model2vec_jina_embeddings_v2_base_code \| [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) \| 0.7381 \| 0.6996 \| 0.8130 \| 🥉 3rd \|
	\| code_model2vec_paraphrase_MiniLM_L6_v2 \| [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) \| 0.7013 \| 0.6638 \| 0.7665 \| #4 \|
	\| code_model2vec_Reason_ModernColBERT \| [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) \| 0.6598 \| 0.6228 \| 0.7260 \| #5 \|
	\| code_model2vec_all_mpnet_base_v2_fine_tuned \| [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) \| 0.6147 \| 0.5720 \| 0.6950 \| #6 \|
	\| code_model2vec_bge_m3 \| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) \| 0.4863 \| 0.4439 \| 0.5514 \| #7 \|
	\| code_model2vec_jina_embeddings_v3 \| [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) \| 0.4755 \| 0.4416 \| 0.5456 \| #8 \|
	\| code_model2vec_nomic_embed_text_v2_moe \| [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) \| 0.4532 \| 0.4275 \| 0.5094 \| #9 \|
	\| code_model2vec_gte_Qwen2_1.5B_instruct \| [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) \| 0.4238 \| 0.3879 \| 0.4719 \| #10 \|
	\| code_model2vec_Qodo_Embed_1_1.5B \| [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) \| 0.4101 \| 0.3810 \| 0.4532 \| #11 \|
	\| code_model2vec_graphcodebert_base \| [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) \| 0.3420 \| 0.3140 \| 0.3704 \| #12 \|
	\| code_model2vec_Linq_Embed_Mistral \| [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) \| 0.2868 \| 0.2581 \| 0.3412 \| #13 \|
	\| code_model2vec_codebert_base \| [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) \| 0.2779 \| 0.2534 \| 0.3136 \| #14 \|


	### 📊 Model Specifications Analysis

	Our distilled models exhibit consistent architectural characteristics across different teacher models:

	\| Model \| Vocabulary Size \| Parameters \| Embedding Dim \| Disk Size \|
	\|-------\|----------------\|------------\|---------------\|-----------\|
	\| all_mpnet_base_v2 \| 29,528 \| 7.6M \| 256 \| 14.4MB \|
	\| all_MiniLM_L6_v2 \| 29,525 \| 7.6M \| 256 \| 14.4MB \|
	\| jina_embeddings_v2_base_code \| 61,053 \| 15.6M \| 256 \| 29.8MB \|
	\| paraphrase_MiniLM_L6_v2 \| 29,525 \| 7.6M \| 256 \| 14.4MB \|
	\| Reason_ModernColBERT \| 50,254 \| 12.9M \| 256 \| 24.5MB \|
	\| all_mpnet_base_v2_fine_tuned \| 36,624 \| 9.4M \| 256 \| 35.8MB \|
	\| bge_m3 \| 249,999 \| 64.0M \| 256 \| 122.1MB \|
	\| jina_embeddings_v3 \| 249,999 \| 64.0M \| 256 \| 122.1MB \|
	\| nomic_embed_text_v2_moe \| 249,999 \| 64.0M \| 256 \| 122.1MB \|
	\| gte_Qwen2_1.5B_instruct \| 151,644 \| 38.8M \| 256 \| 74.0MB \|
	\| Qodo_Embed_1_1.5B \| 151,644 \| 38.8M \| 256 \| 74.0MB \|
	\| graphcodebert_base \| 50,262 \| 12.9M \| 256 \| 24.5MB \|
	\| Linq_Embed_Mistral \| 31,999 \| 8.2M \| 256 \| 15.6MB \|
	\| codebert_base \| 50,262 \| 12.9M \| 256 \| 24.5MB \|


	![Model Specifications](analysis_charts/model_specifications.png)

	Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.

	#### Key Insights from Model Specifications:


	- Vocabulary Consistency: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
	- Parameter Efficiency: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
	- Storage Efficiency: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
	- Embedding Dimensions: Consistent 256 dimensions across all models (optimized for efficiency)


	### Key Findings


	- Best Teacher Model: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
	- Least Effective Teacher: code_model2vec_codebert_base (NDCG@10: 0.2779)
	- Performance Range: 62.4% difference between best and worst
	- Average Performance: 0.5248 NDCG@10


	## 🎯 Language Performance Radar Charts

	### Best Model vs Peer Models Comparison

	![Comparative Radar Chart](analysis_charts/comparative_radar.png)

	Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.

	### Individual Model Performance by Language

	#### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387

	![code_model2vec_all_mpnet_base_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2.png)

	#### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385

	![code_model2vec_all_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_MiniLM_L6_v2.png)

	#### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381

	![code_model2vec_jina_embeddings_v2_base_code Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v2_base_code.png)

	#### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013

	![code_model2vec_paraphrase_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_paraphrase_MiniLM_L6_v2.png)

	#### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598

	![code_model2vec_Reason_ModernColBERT Radar Chart](analysis_charts/radar_code_model2vec_Reason_ModernColBERT.png)

	#### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147

	![code_model2vec_all_mpnet_base_v2_fine_tuned Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2_fine_tuned.png)

	#### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863

	![code_model2vec_bge_m3 Radar Chart](analysis_charts/radar_code_model2vec_bge_m3.png)

	#### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755

	![code_model2vec_jina_embeddings_v3 Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v3.png)

	#### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532

	![code_model2vec_nomic_embed_text_v2_moe Radar Chart](analysis_charts/radar_code_model2vec_nomic_embed_text_v2_moe.png)

	#### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238

	![code_model2vec_gte_Qwen2_1.5B_instruct Radar Chart](analysis_charts/radar_code_model2vec_gte_Qwen2_15B_instruct.png)

	#### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101

	![code_model2vec_Qodo_Embed_1_1.5B Radar Chart](analysis_charts/radar_code_model2vec_Qodo_Embed_1_15B.png)

	#### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420

	![code_model2vec_graphcodebert_base Radar Chart](analysis_charts/radar_code_model2vec_graphcodebert_base.png)

	#### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868

	![code_model2vec_Linq_Embed_Mistral Radar Chart](analysis_charts/radar_code_model2vec_Linq_Embed_Mistral.png)

	#### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779

	![code_model2vec_codebert_base Radar Chart](analysis_charts/radar_code_model2vec_codebert_base.png)



	## 🏆 Peer Model Comparison

	![Peer Comparison](analysis_charts/peer_comparison.png)

	Comparison with established code-specialized embedding models using actual evaluation results.

	### Complete Model Ranking

	\| Rank \| Model \| Type \| NDCG@10 \| MRR \| Recall@5 \|
	\|------\|-------\|------\|---------\|-----\|----------\|
	\| 1 \| Alibaba-NLP/gte-Qwen2-1.5B-instruct \| General \| 0.9729 \| 0.9676 \| 0.9825 \|
	\| 2 \| Qodo/Qodo-Embed-1-1.5B \| General \| 0.9715 \| 0.9659 \| 0.9875 \|
	\| 3 \| jina-embeddings-v2-base-code \| General \| 0.9677 \| 0.9618 \| 0.9849 \|
	\| 4 \| jinaai/jina-embeddings-v3 \| General \| 0.9640 \| 0.9573 \| 0.9839 \|
	\| 5 \| sentence-transformers/all-mpnet-base-v2 \| General \| 0.9477 \| 0.9358 \| 0.9732 \|
	\| 6 \| nomic-ai/nomic-embed-text-v2-moe \| General \| 0.9448 \| 0.9357 \| 0.9659 \|
	\| 7 \| sentence-transformers/all-MiniLM-L12-v2 \| General \| 0.9398 \| 0.9265 \| 0.9732 \|
	\| 8 \| BAAI/bge-m3 \| General \| 0.9383 \| 0.9295 \| 0.9643 \|
	\| 9 \| sentence-transformers/all-MiniLM-L6-v2 \| General \| 0.9255 \| 0.9099 \| 0.9642 \|
	\| 10 \| lightonai/Reason-ModernColBERT \| General \| 0.9188 \| 0.9036 \| 0.9486 \|
	\| 11 \| Linq-AI-Research/Linq-Embed-Mistral \| General \| 0.9080 \| 0.8845 \| 0.9650 \|
	\| 12 \| sentence-transformers/paraphrase-MiniLM-L6-v2 \| General \| 0.8297 \| 0.8016 \| 0.8828 \|
	\| 13 \| minishlab/potion-base-8M \| Model2Vec \| 0.8162 \| 0.7817 \| 0.8931 \|
	\| 14 \| minishlab/potion-retrieval-32M \| Model2Vec \| 0.8137 \| 0.7810 \| 0.8792 \|
	\| 15 \| code_model2vec_all_mpnet_base_v2 \| 🔥 Simplified Distillation \| 0.7387 \| 0.7010 \| 0.8017 \|
	\| 16 \| code_model2vec_all_MiniLM_L6_v2 \| 🔥 Simplified Distillation \| 0.7385 \| 0.7049 \| 0.7910 \|
	\| 17 \| code_model2vec_jina_embeddings_v2_base_code \| 🔥 Simplified Distillation \| 0.7381 \| 0.6996 \| 0.8130 \|
	\| 18 \| code_model2vec_paraphrase_MiniLM_L6_v2 \| 🔥 Simplified Distillation \| 0.7013 \| 0.6638 \| 0.7665 \|
	\| 19 \| code_model2vec_Reason_ModernColBERT \| 🔥 Simplified Distillation \| 0.6598 \| 0.6228 \| 0.7260 \|
	\| 20 \| code_model2vec_all_mpnet_base_v2_fine_tuned \| 🎓 Fine-tuned Distillation \| 0.6147 \| 0.5720 \| 0.6950 \|
	\| 21 \| potion-multilingual-128M \| Model2Vec \| 0.6124 \| 0.5683 \| 0.7017 \|
	\| 22 \| huggingface/CodeBERTa-small-v1 \| Code-Specific \| 0.5903 \| 0.5350 \| 0.6779 \|
	\| 23 \| Salesforce/codet5-base \| Code-Specific \| 0.4872 \| 0.4500 \| 0.5742 \|
	\| 24 \| code_model2vec_bge_m3 \| 🔥 Simplified Distillation \| 0.4863 \| 0.4439 \| 0.5514 \|
	\| 25 \| code_model2vec_jina_embeddings_v3 \| 🔥 Simplified Distillation \| 0.4755 \| 0.4416 \| 0.5456 \|
	\| 26 \| code_model2vec_nomic_embed_text_v2_moe \| 🔥 Simplified Distillation \| 0.4532 \| 0.4275 \| 0.5094 \|
	\| 27 \| code_model2vec_gte_Qwen2_1.5B_instruct \| 🔥 Simplified Distillation \| 0.4238 \| 0.3879 \| 0.4719 \|
	\| 28 \| code_model2vec_Qodo_Embed_1_1.5B \| 🔥 Simplified Distillation \| 0.4101 \| 0.3810 \| 0.4532 \|
	\| 29 \| microsoft/graphcodebert-base \| Code-Specific \| 0.4039 \| 0.3677 \| 0.4650 \|
	\| 30 \| code_model2vec_graphcodebert_base \| 🔥 Simplified Distillation \| 0.3420 \| 0.3140 \| 0.3704 \|
	\| 31 \| code_model2vec_Linq_Embed_Mistral \| 🔥 Simplified Distillation \| 0.2868 \| 0.2581 \| 0.3412 \|
	\| 32 \| code_model2vec_codebert_base \| 🔥 Simplified Distillation \| 0.2779 \| 0.2534 \| 0.3136 \|
	\| 33 \| microsoft/codebert-base \| Code-Specific \| 0.1051 \| 0.1058 \| 0.1105 \|


	## 📈 Performance Analysis

	### Multi-Model Comparison Charts

	![Model Comparison](analysis_charts/model_comparison.png)

	Comprehensive comparison across all evaluation metrics.

	### Language Performance Analysis

	![Language Heatmap](analysis_charts/language_heatmap.png)

	Performance heatmap showing how different models perform across programming languages.

	### Efficiency Analysis

	![Efficiency Analysis](analysis_charts/efficiency_analysis.png)

	Performance vs model size analysis showing the efficiency benefits of distillation.



	## ⚡ Operational Performance Analysis

	![Benchmark Performance](analysis_charts/benchmark_performance.png)

	Comprehensive performance benchmarking across multiple operational metrics.

	### Performance Scaling Analysis

	![Batch Size Scaling](analysis_charts/batch_size_scaling.png)

	How performance scales with different batch sizes for optimal throughput.

	![Memory Scaling](analysis_charts/memory_scaling.png)

	Memory usage patterns across different batch sizes.



	## 🔍 Language-Specific Analysis

	### Performance by Programming Language

	\| Language \| Best Model Performance \| Average Performance \| Language Difficulty \|
	\|----------\|------------------------\|--------------------\|--------------------\|
	\| Go \| 0.9780 \| 0.6960 \| Easy \|
	\| Java \| 0.9921 \| 0.6553 \| Easy \|
	\| Javascript \| 0.9550 \| 0.5850 \| Easy \|
	\| Php \| 1.0000 \| 0.6321 \| Easy \|
	\| Python \| 1.0000 \| 0.8623 \| Easy \|
	\| Ruby \| 0.9493 \| 0.6397 \| Easy \|


	## 🎯 Conclusions and Recommendations

	### Teacher Model Analysis

	Based on the evaluation results across all simplified distillation models:


	1. Best Teacher Model: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
	2. Least Effective Teacher: microsoft/codebert-base (NDCG@10: 0.2779)
	3. Teacher Model Impact: Choice of teacher model affects performance by 62.4%

	### Recommendations

	- For Production: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
	- For Efficiency: Model2Vec distillation provides significant size reduction with competitive performance
	- For Code Tasks: Specialized models consistently outperform general-purpose models


	## 📄 Methodology

	### Evaluation Protocol
	- Dataset: CodeSearchNet test sets for 6 programming languages
	- Metrics: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
	- Query Format: Natural language documentation strings
	- Corpus Format: Function code strings
	- Evaluation: Retrieval of correct code for each documentation query

	### Teacher Models Tested
	- [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline)
	- [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose)
	- [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model)
	- [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized)
	- [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model)
	- [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model)
	- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model)
	- [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model)
	- [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts)
	- [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized)
	- [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture)
	- [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based)
	- [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE)
	- [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model)

	### Distillation Method
	- Technique: Model2Vec static embedding generation
	- Parameters: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
	- Training Data: CodeSearchNet comment-code pairs
	- Languages: Python, JavaScript, Java, PHP, Ruby, Go

	---

	Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.
	For questions about methodology or results, please refer to the CodeSearchNet documentation.