CodeMind / ABOUT.md
devjas1
(UPDATE): [docs]: add ABOUT.md for project overview and installation instructions; update copyright year in index.html
d397fab
# CodeMind
A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.
## Features
- **Document Indexing**: Embed and index documents for semantic search
- **Semantic Search**: Find relevant documents using natural language queries
- **Smart Commit Messages**: Generate meaningful commit messages from staged git changes
- **RAG (Retrieval-Augmented Generation)**: Answer questions using indexed document context
## Setup
### Prerequisites
- Windows 11
- Conda environment
- Git
### Installation
1. **Create a Conda environment:**
```bash
conda create -n codemind python=3.9
conda activate codemind
```
2. **Clone the repository:**
```bash
git clone https://github.com/devjas1/codemind.git
cd codemind
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
4. **Download models:**
**Embedding Model (EmbeddingGemma-300m):**
- Download from Hugging Face: `google/embeddinggemma-300m`
- Place in `./models/embeddinggemma-300m/` directory
**Generation Model (Phi-2 GGUF):**
- Download the quantized Phi-2 model: `phi-2.Q4_0.gguf`
- Place in `./models/` directory
- Download from: [Microsoft Phi-2 GGUF](https://huggingface.co/microsoft/phi-2-gguf) or similar quantized versions
### Directory Structure
```
CodeMind/
β”œβ”€β”€ cli.py # Main CLI entry point
β”œβ”€β”€ config.yaml # Configuration file
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ models/ # Model storage
β”‚ β”œβ”€β”€ embeddinggemma-300m/ # Embedding model directory
β”‚ └── phi-2.Q4_0.gguf # Phi-2 quantized model file
β”œβ”€β”€ src/ # Core modules
β”‚ β”œβ”€β”€ config_loader.py # Configuration management
β”‚ β”œβ”€β”€ embedder.py # Document embedding
β”‚ β”œβ”€β”€ retriever.py # Semantic search
β”‚ β”œβ”€β”€ generator.py # Text generation
β”‚ └── diff_analyzer.py # Git diff analysis
β”œβ”€β”€ docs/ # Documentation
└── vector_cache/ # FAISS index storage (auto-created)
```
## Usage
### Initialize Document Index
Index documents from a directory for semantic search:
```bash
python cli.py init ./docs/
```
This will:
- Embed all documents in the specified directory
- Create a FAISS index in `vector_cache/`
- Save metadata for retrieval
### Semantic Search
Search for relevant documents using natural language:
```bash
python cli.py search "how to configure the model"
```
Returns ranked results with similarity scores.
### Ask Questions (RAG)
Get answers based on your indexed documents:
```bash
python cli.py ask "What are the configuration options?"
```
Uses retrieval-augmented generation to provide contextual answers.
### Git Commit Message Generation
Generate intelligent commit messages from staged changes:
```bash
# Preview commit message without applying
python cli.py commit --preview
# Show staged files and analysis without generating message
python cli.py commit --dry-run
# Generate and apply commit message
python cli.py commit --apply
```
### Start API Server (Future Feature)
```bash
python cli.py serve --port 8000
```
_Note: API server functionality is planned for future releases._
## Configuration
Edit `config.yaml` to customize behavior:
```yaml
embedding:
model_path: "./models/embeddinggemma-300m"
dim: 768
truncate_to: 128
generator:
model_path: "./models/phi-2.Q4_0.gguf"
quantization: "Q4_0"
max_tokens: 512
n_ctx: 2048
retrieval:
vector_store: "faiss"
top_k: 5
similarity_threshold: 0.75
commit:
tone: "imperative"
style: "conventional"
max_length: 72
logging:
verbose: true
telemetry: false
```
### Configuration Options
- **embedding.model_path**: Path to the EmbeddingGemma-300m model
- **generator.model_path**: Path to the Phi-2 GGUF model file
- **retrieval.top_k**: Number of documents to retrieve for context
- **retrieval.similarity_threshold**: Minimum similarity score for results
- **generator.max_tokens**: Maximum tokens for generation
- **generator.n_ctx**: Context window size for Phi-2
## Dependencies
- `sentence-transformers>=2.2.2` - Document embedding
- `faiss-cpu>=1.7.4` - Vector similarity search
- `llama-cpp-python>=0.2.23` - Phi-2 model inference (Windows compatible)
- `typer>=0.9.0` - CLI framework
- `PyYAML>=6.0` - Configuration file parsing
## Troubleshooting
### Model Loading Issues
If you encounter model loading errors:
1. **Embedding Model**: Ensure `embeddinggemma-300m` is a directory containing all model files
2. **Phi-2 Model**: Ensure `phi-2.Q4_0.gguf` is a single GGUF file
3. **Paths**: All paths in `config.yaml` should be relative to the project root
### Memory Issues
For systems with limited RAM:
- Use Q4_0 quantization for Phi-2 (already configured)
- Reduce `n_ctx` in config.yaml if needed
- Process documents in smaller batches
### Windows-Specific Issues
- Ensure `llama-cpp-python` version supports Windows
- Use PowerShell or Command Prompt for CLI commands
- Check file path separators in configuration
## Development
To test the modules:
```bash
python -c "from src import *; print('All modules imported successfully')"
```
To run in development mode:
```bash
python cli.py --help
```
## Contributing
Contributions to CodeMind are welcome! Please feel free to submit pull requests, create issues, or suggest new features.
## License
This project is licensed under the terms of the LICENSE file included in the repository.
Β© 2025 CodeMind. All rights reserved.