Spaces:

dev-jas
/

CodeMind

Running

App Files Files Community

CodeMind / ABOUT.md

devjas1

(UPDATE): [docs]: add ABOUT.md for project overview and installation instructions; update copyright year in index.html

d397fab 11 days ago

preview code

raw

history blame contribute delete

5.97 kB

CodeMind

A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.

Features

Document Indexing: Embed and index documents for semantic search
Semantic Search: Find relevant documents using natural language queries
Smart Commit Messages: Generate meaningful commit messages from staged git changes
RAG (Retrieval-Augmented Generation): Answer questions using indexed document context

Setup

Prerequisites

Windows 11
Conda environment
Git

Installation

Create a Conda environment:

conda create -n codemind python=3.9
conda activate codemind

Clone the repository:

git clone https://github.com/devjas1/codemind.git
cd codemind

Install dependencies:
```
pip install -r requirements.txt
```
Download models:

Embedding Model (EmbeddingGemma-300m):
- Download from Hugging Face: google/embeddinggemma-300m
- Place in ./models/embeddinggemma-300m/ directory
Generation Model (Phi-2 GGUF):
- Download the quantized Phi-2 model: phi-2.Q4_0.gguf
- Place in ./models/ directory
- Download from: Microsoft Phi-2 GGUF or similar quantized versions

Directory Structure

CodeMind/
├── cli.py                      # Main CLI entry point
├── config.yaml                 # Configuration file
├── requirements.txt            # Python dependencies
├── models/                     # Model storage
│   ├── embeddinggemma-300m/    # Embedding model directory
│   └── phi-2.Q4_0.gguf        # Phi-2 quantized model file
├── src/                        # Core modules
│   ├── config_loader.py        # Configuration management
│   ├── embedder.py             # Document embedding
│   ├── retriever.py            # Semantic search
│   ├── generator.py            # Text generation
│   └── diff_analyzer.py        # Git diff analysis
├── docs/                       # Documentation
└── vector_cache/              # FAISS index storage (auto-created)

Usage

Initialize Document Index

Index documents from a directory for semantic search:

python cli.py init ./docs/

This will:

Embed all documents in the specified directory
Create a FAISS index in vector_cache/
Save metadata for retrieval

Semantic Search

Search for relevant documents using natural language:

python cli.py search "how to configure the model"

Returns ranked results with similarity scores.

Ask Questions (RAG)

Get answers based on your indexed documents:

python cli.py ask "What are the configuration options?"

Uses retrieval-augmented generation to provide contextual answers.

Git Commit Message Generation

Generate intelligent commit messages from staged changes:

# Preview commit message without applying
python cli.py commit --preview

# Show staged files and analysis without generating message
python cli.py commit --dry-run

# Generate and apply commit message
python cli.py commit --apply

Start API Server (Future Feature)

python cli.py serve --port 8000

Note: API server functionality is planned for future releases.

Configuration

Edit config.yaml to customize behavior:

embedding:
  model_path: "./models/embeddinggemma-300m"
  dim: 768
  truncate_to: 128

generator:
  model_path: "./models/phi-2.Q4_0.gguf"
  quantization: "Q4_0"
  max_tokens: 512
  n_ctx: 2048

retrieval:
  vector_store: "faiss"
  top_k: 5
  similarity_threshold: 0.75

commit:
  tone: "imperative"
  style: "conventional"
  max_length: 72

logging:
  verbose: true
  telemetry: false

Configuration Options

embedding.model_path: Path to the EmbeddingGemma-300m model
generator.model_path: Path to the Phi-2 GGUF model file
retrieval.top_k: Number of documents to retrieve for context
retrieval.similarity_threshold: Minimum similarity score for results
generator.max_tokens: Maximum tokens for generation
generator.n_ctx: Context window size for Phi-2

Dependencies

sentence-transformers>=2.2.2 - Document embedding
faiss-cpu>=1.7.4 - Vector similarity search
llama-cpp-python>=0.2.23 - Phi-2 model inference (Windows compatible)
typer>=0.9.0 - CLI framework
PyYAML>=6.0 - Configuration file parsing

Troubleshooting

Model Loading Issues

If you encounter model loading errors:

Embedding Model: Ensure embeddinggemma-300m is a directory containing all model files
Phi-2 Model: Ensure phi-2.Q4_0.gguf is a single GGUF file
Paths: All paths in config.yaml should be relative to the project root

Memory Issues

For systems with limited RAM:

Use Q4_0 quantization for Phi-2 (already configured)
Reduce n_ctx in config.yaml if needed
Process documents in smaller batches

Windows-Specific Issues

Ensure llama-cpp-python version supports Windows
Use PowerShell or Command Prompt for CLI commands
Check file path separators in configuration

Development

To test the modules:

python -c "from src import *; print('All modules imported successfully')"

To run in development mode:

python cli.py --help

Contributing

Contributions to CodeMind are welcome! Please feel free to submit pull requests, create issues, or suggest new features.

License

This project is licensed under the terms of the LICENSE file included in the repository.