# Repository Explorer - Vectorization Feature ## Overview The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries. ## How It Works ### 1. **Content Chunking** - Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap) - Each chunk maintains metadata (repo ID, line numbers, chunk index) ### 2. **Embedding Creation** - Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model - Creates vector embeddings for each chunk - Embeddings capture semantic meaning of code content ### 3. **Semantic Search** - When you ask a question, it searches for the 3 most relevant chunks - Uses cosine similarity to rank chunks by relevance - Returns both similarity scores and line number references ### 4. **Enhanced Responses** - The chatbot combines both the general repository analysis AND the most relevant code sections - Provides specific code examples and implementation details - References exact line numbers for better context ## Installation The vectorization feature requires additional dependencies: ```bash pip install sentence-transformers numpy ``` These are already included in the updated `requirements.txt`. ## Testing Run the test script to verify everything is working: ```bash python test_vectorization.py ``` This will test: - ✅ Dependencies import correctly - ✅ SentenceTransformer model loads - ✅ Embedding creation works - ✅ Similarity calculations function - ✅ Integration with repo explorer ## Features ### ✅ **What's Included** - **Simple setup**: Uses a lightweight, fast embedding model - **Automatic chunking**: Smart content splitting with overlap for context - **Semantic search**: Find relevant code based on meaning, not just keywords - **Graceful fallback**: If vectorization fails, falls back to text-only analysis - **Memory efficient**: In-memory storage suitable for single repository exploration - **Clear feedback**: Status messages show when vectorization is active ### 🔍 **How to Use** 1. Load any repository in the Repository Explorer tab 2. Look for "Vector embeddings created" in the status message 3. Ask questions - the chatbot will automatically use vector search 4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores ### 📊 **Example Output** When you ask "How do I use this repository?", you might get: ``` === MOST RELEVANT CODE SECTIONS === --- Relevant Section 1 (similarity: 0.847, lines 25-75) --- # Installation and Usage ...actual code from those lines... --- Relevant Section 2 (similarity: 0.792, lines 150-200) --- def main(): """Main usage example""" ...actual code from those lines... ``` ## Technical Details - **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings) - **Chunk size**: 500 lines with 50 line overlap - **Search**: Top 3 most similar chunks per query - **Storage**: In-memory (cleared when loading new repository) - **Fallback**: Graceful degradation to text-only analysis if vectorization fails ## Benefits 1. **Better Context**: Finds relevant code sections even with natural language queries 2. **Specific Examples**: Provides actual code snippets related to your question 3. **Line References**: Shows exactly where information comes from 4. **Semantic Understanding**: Understands intent, not just keyword matching 5. **Fast Setup**: Lightweight model downloads quickly on first use ## Limitations - **Single Repository**: Vector store is cleared when loading a new repository - **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case) - **Model Size**: ~80MB download for the embedding model (one-time) - **No Persistence**: Vectors are recreated each time you load a repository This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.