repository-learning / README.md
kotlarmilos's picture
Update README.md
d3417a8 verified
metadata
pretty_name: Repository learning Models
tags:
  - code-review
  - contrastive-learning
  - sentence-transformers
  - lora
  - fine-tuned
  - nextcoder
  - faiss-index
  - pytorch
  - transformers
license: mit
language:
  - en
library_name: transformers
pipeline_tag: text-generation
base_model: microsoft/NextCoder-7B
inference: true

Model Card for Repository Learning Models

This model card describes a multi-modal AI system for context-aware code review that combines contrastive learning, fine-tuning, and semantic indexing to understand repository-specific patterns and provide code review assistance.

Model Details

Model Description

The Repository Learning Models consist of three specialized components that work together to provide context-aware code review assistance:

  1. Contrastive Learning Model: A fine-tuned SentenceTransformer that learns semantic relationships between code files based on Git change patterns
  2. Fine-Tuned Review Model: A LoRA-adapted NextCoder-7B model specialized for generating repository-specific code review comments
  3. Semantic Index: A FAISS-powered search system with LLM-generated function descriptions for rapid code navigation
  • Developed by: Milos Kotlar
  • Model type: Multi-modal (Text Generation + Embedding + Retrieval)
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: microsoft/NextCoder-7B (for review generation), sentence-transformers/all-MiniLM-L6-v2 (for embeddings)

Model Sources

Uses

Direct Use

The models are designed for:

  • Automated Code Review: Generate contextual review comments for pull requests
  • Anomaly Detection: Identify unusual file change patterns that may indicate architectural issues
  • Code Search: Find relevant functions and documentation using semantic similarity
  • Team Onboarding: Help new developers understand repository patterns and conventions

Downstream Use

The models can be integrated into:

  • CI/CD Pipelines: GitHub Actions, Azure DevOps, Jenkins workflows
  • IDE Extensions: VS Code, IntelliJ plugins for real-time review assistance
  • Code Review Tools: Integration with GitHub, GitLab, Bitbucket review interfaces
  • Documentation Systems: Automatic code documentation and explanation generation

Out-of-Scope Use

The models are not intended for:

  • Security Vulnerability Detection: While they may catch some issues, dedicated security tools should be used
  • Performance Analysis: Models don't analyze runtime performance or optimization
  • Cross-Language Translation: Optimized for reviewing within single programming languages
  • Legal or Compliance Review: Cannot assess licensing or regulatory compliance issues

Bias, Risks, and Limitations

Technical Limitations

  • Repository Specificity: Models are trained on specific open-source repositories and may not generalize to very different codebases or proprietary patterns
  • Language Coverage: Primary focus on 7 major programming languages (Python, JavaScript, TypeScript, Java, C++, C#, C)
  • Context Window: Fine-tuned model limited to 2048 tokens for inputs
  • Temporal Bias: Training data represents patterns from 2024-2025 timeframe

Social and Ethical Considerations

  • Review Style Bias: Models learn from existing human review patterns, potentially perpetuating team-specific biases or exclusionary language
  • Open Source Bias: Training primarily on open-source repositories may not reflect enterprise development patterns
  • Developer Experience Bias: May favor review styles from experienced developers, potentially alienating junior developers

Recommendations

  • Human Oversight: Use AI suggestions as guidance, not replacement for human code review
  • Bias Monitoring: Regularly evaluate generated reviews for inclusive language and fair treatment across developer experience levels
  • Continuous Updating: Retrain models periodically on recent repository activity to maintain relevance
  • Domain Adaptation: Fine-tune on organization-specific data when deploying in enterprise environments

How to Get Started with the Model

# Install dependencies
pip install transformers sentence-transformers faiss-cpu torch

# Load the fine-tuned review model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("kotlarmilos/repository-learning-models")
model = AutoModelForCausalLM.from_pretrained(
    "kotlarmilos/repository-learning-models",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate a code review
prompt = """Code diff:
diff
+def calculate_average(numbers):
+    return sum(numbers) / len(numbers)


Please write a code review comment:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
review = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(review)

Training Details

Training Data

Training data consists of curated datasets from 15 high-quality open-source repositories:

  • Source: GitHub repositories with >100 stars and active development

Linked Dataset: kotlarmilos/repository-learning

Training Procedure

Preprocessing

  1. GitHub Data Collection: GraphQL/REST API extraction of PRs, diffs, and review comments
  2. Conversation Structuring: Chronological ordering of review discussions with context
  3. Code Analysis: Tree-sitter AST parsing for function extraction across several programming languages
  4. Quality Filtering: Removal of non-constructive comments, bot interactions, and duplicate content

Training Hyperparameters

Contrastive Learning Model:

  • Base Model: sentence-transformers/all-MiniLM-L6-v2
  • Batch Size: 32
  • Epochs: 10
  • Loss Function: ContrastiveLoss
  • Max Pairs: 35,000 (positive/negative)

Fine-Tuned Review Model:

  • Base Model: microsoft/NextCoder-7B
  • Training Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank: 8, Alpha: 16, Dropout: 0.05
  • Target Modules: ["q_proj", "v_proj"]
  • Quantization: 4-bit NF4 with BitsAndBytes
  • Learning Rate: 1e-4 with cosine decay
  • Batch Size: 4 with 8 gradient accumulation steps
  • Epochs: 3
  • Training regime: bf16 mixed precision

Speeds, Sizes, Times

  • Training Time: 2 hours on H100 GPU for complete pipeline

Technical Specifications

Model Architecture and Objective

Multi-Modal Architecture:

  1. Embedding Component: SentenceTransformer with contrastive learning objective
  2. Generation Component: Transformer decoder with causal language modeling
  3. Retrieval Component: FAISS vector index with dense embeddings

Training Objectives:

  • Contrastive learning: Maximize similarity of co-changed files, minimize similarity of unrelated files
  • Instruction following: Generate helpful review comments given code diff context
  • Semantic indexing: Create searchable representations of code functions

Usage Examples

If you use this model, please refer to https://github.com/kotlarmilos/repository-learning