--- base_model: sentence-transformers/all-mpnet-base-v2 library_name: distiller license: apache-2.0 license_name: apache-2.0 license_link: LICENSE model_name: codemalt tags: - code-search - code-embeddings - model2vec - distillation - sentence-transformers - static-embeddings - tokenlearn datasets: - code-search-net/code_search_net - sentence-transformers/codesearchnet metrics: - ndcg@10 - mrr - recall@5 language: - code pipeline_tag: feature-extraction --- # CodeMalt **CodeMalt** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model. ## 🏆 Performance Highlights - **NDCG@10**: 0.7387 (Best among all distilled models) - **Mean Reciprocal Rank (MRR)**: 0.7010 - **Recall@5**: 0.8017 - **Model Size**: 7.6M parameters (vs 109M original) - **Inference Speed**: 15,021x faster than teacher model - **Memory Usage**: <1GB RAM (vs 8+ GB VRAM for original) ## 📊 CodeSearchNet Performance by Language | Language | NDCG@10 | MRR | Recall@5 | |----------|---------|-----|----------| | **Python** | 0.7899 | 0.7501 | 0.8421 | | **JavaScript** | 0.7234 | 0.6801 | 0.7895 | | **Java** | 0.7456 | 0.7089 | 0.8123 | | **PHP** | 0.7198 | 0.6856 | 0.7834 | | **Ruby** | 0.7312 | 0.6934 | 0.7912 | | **Go** | 0.7223 | 0.6876 | 0.7913 | ## 🔧 Model Details - **Teacher Model**: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) - **Distillation Method**: Model2Vec + Tokenlearn training on CodeSearchNet - **Architecture**: Static embeddings (no neural network inference required) - **Embedding Dimensions**: 256 - **Training Data**: CodeSearchNet code-comment pairs across 6 programming languages - **Optimization**: PCA dimensionality reduction + SIF weighting + Zipf regularization - **Vocabulary Size**: 29,528 - **Parameters**: 7.6M - **Size**: 14.4MB ## 🎯 Distiller: Code-Specialized Embedding Toolkit **Distiller** is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks. > **Note**: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work. >[!Important] >Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages. >[!Warning] >**Research Finding**: See [NOTES.md](NOTES.md) for critical analysis showing that C4 fine-tuning significantly degraded performance (-16.8% NDCG@10) compared to simple Model2Vec distillation. **Recommendation**: Use basic distillation without additional training for optimal code embedding performance. The **distiller** package provides a complete pipeline for: 1. **Distilling code-specialized embeddings** from large sentence transformer models using Model2Vec 2. **Comprehensive evaluation** on CodeSearchNet benchmarks across 6 programming languages 3. **Performance benchmarking** (speed, memory, model size analysis) 4. **Advanced training** with tokenlearn for enhanced code understanding 5. **Analysis and reporting** with visualizations and comparison charts 6. **Cloud-scale processing** with Beam support for distributed execution ### Key Benefits - **🚀 Performance**: Up to 500x faster inference with 50x smaller models - **📊 Code-Optimized**: Specialized for code search, classification, and similarity tasks - **🔬 Comprehensive**: Full evaluation pipeline with CodeSearchNet metrics - **☁️ Scalable**: Local and cloud execution with Beam support - **📈 Analytical**: Rich reporting with performance charts and comparisons ## 🚀 Quick Start ### Installation ```bash # Install with all dependencies pip install model2vec[train] torch transformers datasets sentence-transformers pip install typer pydantic plotly matplotlib seaborn # Install the distiller package (assuming local development) pip install -e . ``` ### Basic Usage ```bash # Simple distillation of a teacher model distiller distill # Distillation with advanced CodeSearchNet training distiller distill --train # Evaluate distilled models on CodeSearchNet distiller evaluate # Generate comprehensive analysis report distiller analyze ``` ### Python API ```python from distiller import distill, evaluate, analyze # Distill a specific model results = distill.run_local_distillation( teacher_models=["microsoft/codebert-base"], enable_training=True, # Include CodeSearchNet fine-tuning pca_dims=256 ) # Evaluate on CodeSearchNet evaluation_results = evaluate.run_evaluation( models=["."], max_queries=1000, languages=["python", "javascript", "java", "go", "php", "ruby"] ) # Generate analysis report analyze.main( results_dir="./code_model2vec/evaluation_results", model_name="code_model2vec_distilled_models", output="ANALYSIS_REPORT.md" ) ``` ## 📋 Features ### 🔬 Distillation Engine - **Multiple Teacher Models**: Support for 15+ pre-configured teacher models including: - Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R` - General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3` - Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct` - **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach: 1. Model2Vec distillation (basic static embeddings) 2. Feature extraction using sentence transformers 3. Tokenlearn training on CodeSearchNet data 4. Post-training re-regularization (PCA + SIF weighting) - **Robust Model Handling**: Automatic compatibility checks and specialized handling for problematic models ### 📊 Evaluation Framework - **CodeSearchNet Evaluation**: Standard code search benchmarks across 6 programming languages - **Retrieval Metrics**: NDCG@k, MRR, Recall@k, Mean/Median Rank - **Performance Benchmarking**: - Model size analysis (disk usage, parameters, memory footprint) - Inference speed testing (various batch sizes and text lengths) - CPU vs GPU performance comparison - Memory scaling analysis ### 📈 Analysis & Reporting - **Comprehensive Reports**: Automated generation of analysis reports with: - Performance comparison tables - Language-specific radar charts - Efficiency analysis (performance vs model size) - Peer model comparisons - **Rich Visualizations**: Plotly and Matplotlib charts including: - Multi-model performance heatmaps - Batch size scaling curves - Memory usage patterns - Model efficiency scatter plots ### ☁️ Cloud Integration - **Beam Support**: Distributed execution on Beam cloud infrastructure - **Volume Management**: Persistent storage with checkpoint support - **Resource Optimization**: GPU-optimized configurations (A100-40G default) - **Automatic Syncing**: Seamless model and result synchronization ## 🛠️ CLI Reference ### `distiller distill` Distill teacher models into efficient static embeddings. ```bash distiller distill [OPTIONS] Options: --use-beam Use Beam cloud for distillation --train Enable advanced training (CodeSearchNet fine-tuning) --teacher-models TEXT Specific teacher models to distill (can be repeated) --pca-dims INTEGER PCA dimensions (default: 256) --clear-cache Clear HuggingFace cache for problematic models ``` **Examples:** ```bash # Basic distillation of all default models distiller distill # Train specific models with advanced CodeSearchNet fine-tuning distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1 # Use Beam cloud with custom PCA dimensions distiller distill --use-beam --train --pca-dims 512 ``` ### `distiller evaluate` Evaluate models on CodeSearchNet benchmarks with performance analysis. ```bash distiller evaluate [OPTIONS] Options: --use-beam Use Beam cloud for evaluation --skip-third-party Skip third-party models evaluation --skip-benchmark Skip performance benchmarking --max-queries INTEGER Maximum queries per language (default: 100) ``` **Examples:** ```bash # Comprehensive evaluation with benchmarking distiller evaluate --max-queries 1000 # Quick evaluation without performance benchmarks distiller evaluate --skip-benchmark --max-queries 100 # Cloud-based evaluation distiller evaluate --use-beam --max-queries 500 ``` ### `distiller analyze` Generate comprehensive analysis reports with visualizations. ```bash distiller analyze [OPTIONS] Options: --results-dir PATH Results directory (default: code_model2vec/evaluation_results) --model-name TEXT Model name for analysis (default: gte_qwen2_m2v_code (Ours)) --output PATH Output report file (default: REPORT.md) --export-csv PATH Export results to CSV file ``` **Examples:** ```bash # Generate standard analysis report distiller analyze # Custom analysis with CSV export distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv # Analyze specific results directory distiller analyze --results-dir ./custom_results --output analysis.md ``` ## 📁 Directory Structure The distiller uses a standardized directory structure: ``` code_model2vec/ ├── base/ # Basic distilled models (Step 1) │ └── code_model2vec_{teacher_name}/ ├── final/ # Final models (copied from base or after training) │ └── code_model2vec_{teacher_name}[_fine_tuned]/ ├── evaluation_results/ # CodeSearchNet evaluation results │ └── comprehensive_eval_{model}.json ├── benchmark_results/ # Performance benchmark results ├── analysis_results/ # Analysis reports and charts │ └── charts/ ├── checkpoints/ # Training checkpoints └── cache/ # Temporary cache files ``` ## ⚙️ Configuration ### Teacher Models Default supported teacher models (configured in `config.py`): ```python TEACHER_MODELS = [ "Alibaba-NLP/gte-Qwen2-1.5B-instruct", # Instruction-tuned "BAAI/bge-m3", # Multilingual "jinaai/jina-embeddings-v3", # Modern architecture "microsoft/codebert-base", # Code-specialized "microsoft/graphcodebert-base", # Graph-aware code "sentence-transformers/all-mpnet-base-v2", # General-purpose # ... and more ] ``` ### Distillation Parameters ```python # Model2Vec distillation settings optimal_pca_dims: int = 256 sif_coefficient: float = 1e-3 apply_zipf: bool = True # Tokenlearn training settings (when --train is enabled) tokenlearn_dataset: str = "sentence-transformers/codesearchnet" tokenlearn_text_key: str = "code" # Use code field for training ``` ### Evaluation Settings ```python # CodeSearchNet evaluation evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"] max_queries_per_language: int = 1000 evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"] ``` ## 📄 License This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details. ## 🙏 Acknowledgments This independent research project builds upon several excellent open-source foundations: - [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework - [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology - [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework - [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework - [Beam](https://beam.cloud) - Distributed cloud computing infrastructure - [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities **Note**: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team.