metadata

pretty_name: Repository learning Models
tags:
  - code-review
  - contrastive-learning
  - sentence-transformers
  - lora
  - fine-tuned
  - nextcoder
  - faiss-index
  - pytorch
  - transformers
license: mit
language:
  - en
library_name: transformers
pipeline_tag: text-generation
base_model: microsoft/NextCoder-7B
inference: true

Model Card for Repository Learning Models

This model card describes a multi-modal AI system for context-aware code review that combines contrastive learning, fine-tuning, and semantic indexing to understand repository-specific patterns and provide code review assistance.

Model Details

Model Description

The Repository Learning Models consist of three specialized components that work together to provide context-aware code review assistance:

Contrastive Learning Model: A fine-tuned SentenceTransformer that learns semantic relationships between code files based on Git change patterns
Fine-Tuned Review Model: A LoRA-adapted NextCoder-7B model specialized for generating repository-specific code review comments
Semantic Index: A FAISS-powered search system with LLM-generated function descriptions for rapid code navigation

Developed by: Milos Kotlar
Model type: Multi-modal (Text Generation + Embedding + Retrieval)
Language(s) (NLP): English
License: MIT
Finetuned from model: microsoft/NextCoder-7B (for review generation), sentence-transformers/all-MiniLM-L6-v2 (for embeddings)

Model Sources

Repository: https://github.com/kotlarmilos/repository-learning
Demo: https://huggingface.co/spaces/kotlarmilos/repository-learning, https://huggingface.co/spaces/kotlarmilos/repository-grounding
Dataset: https://huggingface.co/datasets/kotlarmilos/repository-learning

Uses

Direct Use

The models are designed for:

Automated Code Review: Generate contextual review comments for pull requests
Anomaly Detection: Identify unusual file change patterns that may indicate architectural issues
Code Search: Find relevant functions and documentation using semantic similarity
Team Onboarding: Help new developers understand repository patterns and conventions

Downstream Use

The models can be integrated into:

CI/CD Pipelines: GitHub Actions, Azure DevOps, Jenkins workflows
IDE Extensions: VS Code, IntelliJ plugins for real-time review assistance
Code Review Tools: Integration with GitHub, GitLab, Bitbucket review interfaces
Documentation Systems: Automatic code documentation and explanation generation

Out-of-Scope Use

The models are not intended for:

Security Vulnerability Detection: While they may catch some issues, dedicated security tools should be used
Performance Analysis: Models don't analyze runtime performance or optimization
Cross-Language Translation: Optimized for reviewing within single programming languages
Legal or Compliance Review: Cannot assess licensing or regulatory compliance issues

Bias, Risks, and Limitations

Technical Limitations

Repository Specificity: Models are trained on specific open-source repositories and may not generalize to very different codebases or proprietary patterns
Language Coverage: Primary focus on 7 major programming languages (Python, JavaScript, TypeScript, Java, C++, C#, C)
Context Window: Fine-tuned model limited to 2048 tokens for inputs
Temporal Bias: Training data represents patterns from 2024-2025 timeframe

Social and Ethical Considerations

Review Style Bias: Models learn from existing human review patterns, potentially perpetuating team-specific biases or exclusionary language
Open Source Bias: Training primarily on open-source repositories may not reflect enterprise development patterns
Developer Experience Bias: May favor review styles from experienced developers, potentially alienating junior developers

Recommendations

Human Oversight: Use AI suggestions as guidance, not replacement for human code review
Bias Monitoring: Regularly evaluate generated reviews for inclusive language and fair treatment across developer experience levels
Continuous Updating: Retrain models periodically on recent repository activity to maintain relevance
Domain Adaptation: Fine-tune on organization-specific data when deploying in enterprise environments

How to Get Started with the Model

# Install dependencies
pip install transformers sentence-transformers faiss-cpu torch

# Load the fine-tuned review model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("kotlarmilos/repository-learning-models")
model = AutoModelForCausalLM.from_pretrained(
    "kotlarmilos/repository-learning-models",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate a code review
prompt = """Code diff:
diff
+def calculate_average(numbers):
+    return sum(numbers) / len(numbers)


Please write a code review comment:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
review = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(review)

Training Details

Training Data

Training data consists of curated datasets from 15 high-quality open-source repositories:

Source: GitHub repositories with >100 stars and active development

Linked Dataset: kotlarmilos/repository-learning

Training Procedure

Preprocessing

GitHub Data Collection: GraphQL/REST API extraction of PRs, diffs, and review comments
Conversation Structuring: Chronological ordering of review discussions with context
Code Analysis: Tree-sitter AST parsing for function extraction across several programming languages
Quality Filtering: Removal of non-constructive comments, bot interactions, and duplicate content

Training Hyperparameters

Contrastive Learning Model:

Base Model: sentence-transformers/all-MiniLM-L6-v2
Batch Size: 32
Epochs: 10
Loss Function: ContrastiveLoss
Max Pairs: 35,000 (positive/negative)

Fine-Tuned Review Model:

Base Model: microsoft/NextCoder-7B
Training Method: LoRA (Low-Rank Adaptation)
LoRA Rank: 8, Alpha: 16, Dropout: 0.05
Target Modules: ["q_proj", "v_proj"]
Quantization: 4-bit NF4 with BitsAndBytes
Learning Rate: 1e-4 with cosine decay
Batch Size: 4 with 8 gradient accumulation steps
Epochs: 3
Training regime: bf16 mixed precision

Speeds, Sizes, Times

Training Time: 2 hours on H100 GPU for complete pipeline

Technical Specifications

Model Architecture and Objective

Multi-Modal Architecture:

Embedding Component: SentenceTransformer with contrastive learning objective
Generation Component: Transformer decoder with causal language modeling
Retrieval Component: FAISS vector index with dense embeddings

Training Objectives:

Contrastive learning: Maximize similarity of co-changed files, minimize similarity of unrelated files
Instruction following: Generate helpful review comments given code diff context
Semantic indexing: Create searchable representations of code functions

Usage Examples

If you use this model, please refer to https://github.com/kotlarmilos/repository-learning