# CodeMind

A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.

## Features

- **Document Indexing**: Embed and index documents for semantic search
- **Semantic Search**: Find relevant documents using natural language queries
- **Smart Commit Messages**: Generate meaningful commit messages from staged git changes
- **RAG (Retrieval-Augmented Generation)**: Answer questions using indexed document context

## Setup

### Prerequisites

- Windows 11
- Conda environment
- Git

### Installation

1. **Create a Conda environment:**

   ```bash
   conda create -n codemind python=3.9
   conda activate codemind
   ```

2. **Clone the repository:**

   ```bash
   git clone https://github.com/devjas1/codemind.git
   cd codemind
   ```

3. **Install dependencies:**

   ```bash
   pip install -r requirements.txt
   ```

4. **Download models:**

   **Embedding Model (EmbeddingGemma-300m):**

   - Download from Hugging Face: `google/embeddinggemma-300m`
   - Place in `./models/embeddinggemma-300m/` directory

   **Generation Model (Phi-2 GGUF):**

   - Download the quantized Phi-2 model: `phi-2.Q4_0.gguf`
   - Place in `./models/` directory
   - Download from: [Microsoft Phi-2 GGUF](https://huggingface.co/microsoft/phi-2-gguf) or similar quantized versions

### Directory Structure

```
CodeMind/
├── cli.py                      # Main CLI entry point
├── config.yaml                 # Configuration file
├── requirements.txt            # Python dependencies
├── models/                     # Model storage
│   ├── embeddinggemma-300m/    # Embedding model directory
│   └── phi-2.Q4_0.gguf        # Phi-2 quantized model file
├── src/                        # Core modules
│   ├── config_loader.py        # Configuration management
│   ├── embedder.py             # Document embedding
│   ├── retriever.py            # Semantic search
│   ├── generator.py            # Text generation
│   └── diff_analyzer.py        # Git diff analysis
├── docs/                       # Documentation
└── vector_cache/              # FAISS index storage (auto-created)
```

## Usage

### Initialize Document Index

Index documents from a directory for semantic search:

```bash
python cli.py init ./docs/
```

This will:

- Embed all documents in the specified directory
- Create a FAISS index in `vector_cache/`
- Save metadata for retrieval

### Semantic Search

Search for relevant documents using natural language:

```bash
python cli.py search "how to configure the model"
```

Returns ranked results with similarity scores.

### Ask Questions (RAG)

Get answers based on your indexed documents:

```bash
python cli.py ask "What are the configuration options?"
```

Uses retrieval-augmented generation to provide contextual answers.

### Git Commit Message Generation

Generate intelligent commit messages from staged changes:

```bash
# Preview commit message without applying
python cli.py commit --preview

# Show staged files and analysis without generating message
python cli.py commit --dry-run

# Generate and apply commit message
python cli.py commit --apply
```

### Start API Server (Future Feature)

```bash
python cli.py serve --port 8000
```

_Note: API server functionality is planned for future releases._

## Configuration

Edit `config.yaml` to customize behavior:

```yaml
embedding:
  model_path: "./models/embeddinggemma-300m"
  dim: 768
  truncate_to: 128

generator:
  model_path: "./models/phi-2.Q4_0.gguf"
  quantization: "Q4_0"
  max_tokens: 512
  n_ctx: 2048

retrieval:
  vector_store: "faiss"
  top_k: 5
  similarity_threshold: 0.75

commit:
  tone: "imperative"
  style: "conventional"
  max_length: 72

logging:
  verbose: true
  telemetry: false
```

### Configuration Options

- **embedding.model_path**: Path to the EmbeddingGemma-300m model
- **generator.model_path**: Path to the Phi-2 GGUF model file
- **retrieval.top_k**: Number of documents to retrieve for context
- **retrieval.similarity_threshold**: Minimum similarity score for results
- **generator.max_tokens**: Maximum tokens for generation
- **generator.n_ctx**: Context window size for Phi-2

## Dependencies

- `sentence-transformers>=2.2.2` - Document embedding
- `faiss-cpu>=1.7.4` - Vector similarity search
- `llama-cpp-python>=0.2.23` - Phi-2 model inference (Windows compatible)
- `typer>=0.9.0` - CLI framework
- `PyYAML>=6.0` - Configuration file parsing

## Troubleshooting

### Model Loading Issues

If you encounter model loading errors:

1. **Embedding Model**: Ensure `embeddinggemma-300m` is a directory containing all model files
2. **Phi-2 Model**: Ensure `phi-2.Q4_0.gguf` is a single GGUF file
3. **Paths**: All paths in `config.yaml` should be relative to the project root

### Memory Issues

For systems with limited RAM:

- Use Q4_0 quantization for Phi-2 (already configured)
- Reduce `n_ctx` in config.yaml if needed
- Process documents in smaller batches

### Windows-Specific Issues

- Ensure `llama-cpp-python` version supports Windows
- Use PowerShell or Command Prompt for CLI commands
- Check file path separators in configuration

## Development

To test the modules:

```bash
python -c "from src import *; print('All modules imported successfully')"
```

To run in development mode:

```bash
python cli.py --help
```

## Contributing

Contributions to CodeMind are welcome! Please feel free to submit pull requests, create issues, or suggest new features.

## License

This project is licensed under the terms of the LICENSE file included in the repository.

© 2025 CodeMind. All rights reserved.