CodeMind / README.md
devjas1
(FEAT/DOCS)[Docs: Readme + .gitignore]: add README.md with project details and setup instructions
03e744b
|
raw
history blame
5.8 kB

CodeMind

A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.

Features

  • Document Indexing: Embed and index documents for semantic search
  • Semantic Search: Find relevant documents using natural language queries
  • Smart Commit Messages: Generate meaningful commit messages from staged git changes
  • RAG (Retrieval-Augmented Generation): Answer questions using indexed document context

Setup

Prerequisites

  • Windows 11
  • Conda environment
  • Git

Installation

  1. Create a Conda environment:

    conda create -n codemind python=3.9
    conda activate codemind
    
  2. Clone the repository:

    git clone https://github.com/devjas1/codemind.git
    cd codemind
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Download models:

    Embedding Model (EmbeddingGemma-300m):

    • Download from Hugging Face: google/embeddinggemma-300m
    • Place in ./models/embeddinggemma-300m/ directory

    Generation Model (Phi-2 GGUF):

    • Download the quantized Phi-2 model: phi-2.Q4_0.gguf
    • Place in ./models/ directory
    • Download from: Microsoft Phi-2 GGUF or similar quantized versions

Directory Structure

CodeMind/
β”œβ”€β”€ cli.py                      # Main CLI entry point
β”œβ”€β”€ config.yaml                 # Configuration file
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ models/                     # Model storage
β”‚   β”œβ”€β”€ embeddinggemma-300m/    # Embedding model directory
β”‚   └── phi-2.Q4_0.gguf        # Phi-2 quantized model file
β”œβ”€β”€ src/                        # Core modules
β”‚   β”œβ”€β”€ config_loader.py        # Configuration management
β”‚   β”œβ”€β”€ embedder.py             # Document embedding
β”‚   β”œβ”€β”€ retriever.py            # Semantic search
β”‚   β”œβ”€β”€ generator.py            # Text generation
β”‚   └── diff_analyzer.py        # Git diff analysis
β”œβ”€β”€ docs/                       # Documentation
└── vector_cache/              # FAISS index storage (auto-created)

Usage

Initialize Document Index

Index documents from a directory for semantic search:

python cli.py init ./docs/

This will:

  • Embed all documents in the specified directory
  • Create a FAISS index in vector_cache/
  • Save metadata for retrieval

Semantic Search

Search for relevant documents using natural language:

python cli.py search "how to configure the model"

Returns ranked results with similarity scores.

Ask Questions (RAG)

Get answers based on your indexed documents:

python cli.py ask "What are the configuration options?"

Uses retrieval-augmented generation to provide contextual answers.

Git Commit Message Generation

Generate intelligent commit messages from staged changes:

# Preview commit message without applying
python cli.py commit --preview

# Show staged files and analysis without generating message
python cli.py commit --dry-run

# Generate and apply commit message
python cli.py commit --apply

Start API Server (Future Feature)

python cli.py serve --port 8000

Note: API server functionality is planned for future releases.

Configuration

Edit config.yaml to customize behavior:

embedding:
  model_path: "./models/embeddinggemma-300m"
  dim: 768
  truncate_to: 128

generator:
  model_path: "./models/phi-2.Q4_0.gguf"
  quantization: "Q4_0"
  max_tokens: 512
  n_ctx: 2048

retrieval:
  vector_store: "faiss"
  top_k: 5
  similarity_threshold: 0.75

commit:
  tone: "imperative"
  style: "conventional"
  max_length: 72

logging:
  verbose: true
  telemetry: false

Configuration Options

  • embedding.model_path: Path to the EmbeddingGemma-300m model
  • generator.model_path: Path to the Phi-2 GGUF model file
  • retrieval.top_k: Number of documents to retrieve for context
  • retrieval.similarity_threshold: Minimum similarity score for results
  • generator.max_tokens: Maximum tokens for generation
  • generator.n_ctx: Context window size for Phi-2

Dependencies

  • sentence-transformers>=2.2.2 - Document embedding
  • faiss-cpu>=1.7.4 - Vector similarity search
  • llama-cpp-python>=0.2.23 - Phi-2 model inference (Windows compatible)
  • typer>=0.9.0 - CLI framework
  • PyYAML>=6.0 - Configuration file parsing

Troubleshooting

Model Loading Issues

If you encounter model loading errors:

  1. Embedding Model: Ensure embeddinggemma-300m is a directory containing all model files
  2. Phi-2 Model: Ensure phi-2.Q4_0.gguf is a single GGUF file
  3. Paths: All paths in config.yaml should be relative to the project root

Memory Issues

For systems with limited RAM:

  • Use Q4_0 quantization for Phi-2 (already configured)
  • Reduce n_ctx in config.yaml if needed
  • Process documents in smaller batches

Windows-Specific Issues

  • Ensure llama-cpp-python version supports Windows
  • Use PowerShell or Command Prompt for CLI commands
  • Check file path separators in configuration

Development

To test the modules:

python -c "from src import *; print('All modules imported successfully')"

To run in development mode:

python cli.py --help

License

[Insert your license information here]

Contributing

[Insert contribution guidelines here]