pretty_name: Repository learning Models
tags:
- code-review
- contrastive-learning
- sentence-transformers
- lora
- fine-tuned
- nextcoder
- faiss-index
- pytorch
- transformers
license: mit
language:
- en
library_name: transformers
pipeline_tag: text-generation
base_model: microsoft/NextCoder-7B
inference: true
Model Card for Repository Learning Models
This model card describes a multi-modal AI system for context-aware code review that combines contrastive learning, fine-tuning, and semantic indexing to understand repository-specific patterns and provide code review assistance.
Model Details
Model Description
The Repository Learning Models consist of three specialized components that work together to provide context-aware code review assistance:
- Contrastive Learning Model: A fine-tuned SentenceTransformer that learns semantic relationships between code files based on Git change patterns
- Fine-Tuned Review Model: A LoRA-adapted NextCoder-7B model specialized for generating repository-specific code review comments
- Semantic Index: A FAISS-powered search system with LLM-generated function descriptions for rapid code navigation
- Developed by: Milos Kotlar
- Model type: Multi-modal (Text Generation + Embedding + Retrieval)
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: microsoft/NextCoder-7B (for review generation), sentence-transformers/all-MiniLM-L6-v2 (for embeddings)
Model Sources
- Repository: https://github.com/kotlarmilos/repository-learning
- Demo: https://huggingface.co/spaces/kotlarmilos/repository-learning, https://huggingface.co/spaces/kotlarmilos/repository-grounding
- Dataset: https://huggingface.co/datasets/kotlarmilos/repository-learning
Uses
Direct Use
The models are designed for:
- Automated Code Review: Generate contextual review comments for pull requests
- Anomaly Detection: Identify unusual file change patterns that may indicate architectural issues
- Code Search: Find relevant functions and documentation using semantic similarity
- Team Onboarding: Help new developers understand repository patterns and conventions
Downstream Use
The models can be integrated into:
- CI/CD Pipelines: GitHub Actions, Azure DevOps, Jenkins workflows
- IDE Extensions: VS Code, IntelliJ plugins for real-time review assistance
- Code Review Tools: Integration with GitHub, GitLab, Bitbucket review interfaces
- Documentation Systems: Automatic code documentation and explanation generation
Out-of-Scope Use
The models are not intended for:
- Security Vulnerability Detection: While they may catch some issues, dedicated security tools should be used
- Performance Analysis: Models don't analyze runtime performance or optimization
- Cross-Language Translation: Optimized for reviewing within single programming languages
- Legal or Compliance Review: Cannot assess licensing or regulatory compliance issues
Bias, Risks, and Limitations
Technical Limitations
- Repository Specificity: Models are trained on specific open-source repositories and may not generalize to very different codebases or proprietary patterns
- Language Coverage: Primary focus on 7 major programming languages (Python, JavaScript, TypeScript, Java, C++, C#, C)
- Context Window: Fine-tuned model limited to 2048 tokens for inputs
- Temporal Bias: Training data represents patterns from 2024-2025 timeframe
Social and Ethical Considerations
- Review Style Bias: Models learn from existing human review patterns, potentially perpetuating team-specific biases or exclusionary language
- Open Source Bias: Training primarily on open-source repositories may not reflect enterprise development patterns
- Developer Experience Bias: May favor review styles from experienced developers, potentially alienating junior developers
Recommendations
- Human Oversight: Use AI suggestions as guidance, not replacement for human code review
- Bias Monitoring: Regularly evaluate generated reviews for inclusive language and fair treatment across developer experience levels
- Continuous Updating: Retrain models periodically on recent repository activity to maintain relevance
- Domain Adaptation: Fine-tune on organization-specific data when deploying in enterprise environments
How to Get Started with the Model
# Install dependencies
pip install transformers sentence-transformers faiss-cpu torch
# Load the fine-tuned review model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("kotlarmilos/repository-learning-models")
model = AutoModelForCausalLM.from_pretrained(
"kotlarmilos/repository-learning-models",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Generate a code review
prompt = """Code diff:
diff
+def calculate_average(numbers):
+ return sum(numbers) / len(numbers)
Please write a code review comment:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
review = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(review)
Training Details
Training Data
Training data consists of curated datasets from 15 high-quality open-source repositories:
- Source: GitHub repositories with >100 stars and active development
Linked Dataset: kotlarmilos/repository-learning
Training Procedure
Preprocessing
- GitHub Data Collection: GraphQL/REST API extraction of PRs, diffs, and review comments
- Conversation Structuring: Chronological ordering of review discussions with context
- Code Analysis: Tree-sitter AST parsing for function extraction across several programming languages
- Quality Filtering: Removal of non-constructive comments, bot interactions, and duplicate content
Training Hyperparameters
Contrastive Learning Model:
- Base Model: sentence-transformers/all-MiniLM-L6-v2
- Batch Size: 32
- Epochs: 10
- Loss Function: ContrastiveLoss
- Max Pairs: 35,000 (positive/negative)
Fine-Tuned Review Model:
- Base Model: microsoft/NextCoder-7B
- Training Method: LoRA (Low-Rank Adaptation)
- LoRA Rank: 8, Alpha: 16, Dropout: 0.05
- Target Modules: ["q_proj", "v_proj"]
- Quantization: 4-bit NF4 with BitsAndBytes
- Learning Rate: 1e-4 with cosine decay
- Batch Size: 4 with 8 gradient accumulation steps
- Epochs: 3
- Training regime: bf16 mixed precision
Speeds, Sizes, Times
- Training Time: 2 hours on H100 GPU for complete pipeline
Technical Specifications
Model Architecture and Objective
Multi-Modal Architecture:
- Embedding Component: SentenceTransformer with contrastive learning objective
- Generation Component: Transformer decoder with causal language modeling
- Retrieval Component: FAISS vector index with dense embeddings
Training Objectives:
- Contrastive learning: Maximize similarity of co-changed files, minimize similarity of unrelated files
- Instruction following: Generate helpful review comments given code diff context
- Semantic indexing: Create searchable representations of code functions
Usage Examples
If you use this model, please refer to https://github.com/kotlarmilos/repository-learning