# Repository Explorer - Vectorization Feature

## Overview

The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.

## How It Works

### 1. **Content Chunking**
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
- Each chunk maintains metadata (repo ID, line numbers, chunk index)

### 2. **Embedding Creation** 
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
- Creates vector embeddings for each chunk 
- Embeddings capture semantic meaning of code content

### 3. **Semantic Search**
- When you ask a question, it searches for the 3 most relevant chunks
- Uses cosine similarity to rank chunks by relevance
- Returns both similarity scores and line number references

### 4. **Enhanced Responses**
- The chatbot combines both the general repository analysis AND the most relevant code sections
- Provides specific code examples and implementation details
- References exact line numbers for better context

## Installation

The vectorization feature requires additional dependencies:

```bash
pip install sentence-transformers numpy
```

These are already included in the updated `requirements.txt`.

## Testing

Run the test script to verify everything is working:

```bash
python test_vectorization.py
```

This will test:
- ✅ Dependencies import correctly  
- ✅ SentenceTransformer model loads
- ✅ Embedding creation works
- ✅ Similarity calculations function
- ✅ Integration with repo explorer

## Features

### ✅ **What's Included**
- **Simple setup**: Uses a lightweight, fast embedding model
- **Automatic chunking**: Smart content splitting with overlap for context
- **Semantic search**: Find relevant code based on meaning, not just keywords  
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis
- **Memory efficient**: In-memory storage suitable for single repository exploration
- **Clear feedback**: Status messages show when vectorization is active

### 🔍 **How to Use**
1. Load any repository in the Repository Explorer tab
2. Look for "Vector embeddings created" in the status message
3. Ask questions - the chatbot will automatically use vector search
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores

### 📊 **Example Output**
When you ask "How do I use this repository?", you might get:

```
=== MOST RELEVANT CODE SECTIONS ===

--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
# Installation and Usage
...actual code from those lines...

--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---  
def main():
    """Main usage example"""
...actual code from those lines...
```

## Technical Details

- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings)
- **Chunk size**: 500 lines with 50 line overlap
- **Search**: Top 3 most similar chunks per query
- **Storage**: In-memory (cleared when loading new repository)
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails

## Benefits

1. **Better Context**: Finds relevant code sections even with natural language queries
2. **Specific Examples**: Provides actual code snippets related to your question  
3. **Line References**: Shows exactly where information comes from
4. **Semantic Understanding**: Understands intent, not just keyword matching
5. **Fast Setup**: Lightweight model downloads quickly on first use

## Limitations

- **Single Repository**: Vector store is cleared when loading a new repository
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case)
- **Model Size**: ~80MB download for the embedding model (one-time)
- **No Persistence**: Vectors are recreated each time you load a repository

This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.