File size: 5,795 Bytes
03e744b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# CodeMind

A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.

## Features

- **Document Indexing**: Embed and index documents for semantic search
- **Semantic Search**: Find relevant documents using natural language queries
- **Smart Commit Messages**: Generate meaningful commit messages from staged git changes
- **RAG (Retrieval-Augmented Generation)**: Answer questions using indexed document context

## Setup

### Prerequisites

- Windows 11
- Conda environment
- Git

### Installation

1. **Create a Conda environment:**

   ```bash

   conda create -n codemind python=3.9

   conda activate codemind

   ```

2. **Clone the repository:**

   ```bash

   git clone https://github.com/devjas1/codemind.git

   cd codemind

   ```

3. **Install dependencies:**

   ```bash

   pip install -r requirements.txt

   ```

4. **Download models:**

   **Embedding Model (EmbeddingGemma-300m):**

   - Download from Hugging Face: `google/embeddinggemma-300m`
   - Place in `./models/embeddinggemma-300m/` directory

   **Generation Model (Phi-2 GGUF):**

   - Download the quantized Phi-2 model: `phi-2.Q4_0.gguf`
   - Place in `./models/` directory
   - Download from: [Microsoft Phi-2 GGUF](https://huggingface.co/microsoft/phi-2-gguf) or similar quantized versions

### Directory Structure

```

CodeMind/

β”œβ”€β”€ cli.py                      # Main CLI entry point

β”œβ”€β”€ config.yaml                 # Configuration file

β”œβ”€β”€ requirements.txt            # Python dependencies

β”œβ”€β”€ models/                     # Model storage

β”‚   β”œβ”€β”€ embeddinggemma-300m/    # Embedding model directory

β”‚   └── phi-2.Q4_0.gguf        # Phi-2 quantized model file

β”œβ”€β”€ src/                        # Core modules

β”‚   β”œβ”€β”€ config_loader.py        # Configuration management

β”‚   β”œβ”€β”€ embedder.py             # Document embedding

β”‚   β”œβ”€β”€ retriever.py            # Semantic search

β”‚   β”œβ”€β”€ generator.py            # Text generation

β”‚   └── diff_analyzer.py        # Git diff analysis

β”œβ”€β”€ docs/                       # Documentation

└── vector_cache/              # FAISS index storage (auto-created)

```

## Usage

### Initialize Document Index

Index documents from a directory for semantic search:

```bash

python cli.py init ./docs/

```

This will:

- Embed all documents in the specified directory
- Create a FAISS index in `vector_cache/`
- Save metadata for retrieval

### Semantic Search

Search for relevant documents using natural language:

```bash

python cli.py search "how to configure the model"

```

Returns ranked results with similarity scores.

### Ask Questions (RAG)

Get answers based on your indexed documents:

```bash

python cli.py ask "What are the configuration options?"

```

Uses retrieval-augmented generation to provide contextual answers.

### Git Commit Message Generation

Generate intelligent commit messages from staged changes:

```bash

# Preview commit message without applying

python cli.py commit --preview



# Show staged files and analysis without generating message

python cli.py commit --dry-run



# Generate and apply commit message

python cli.py commit --apply

```

### Start API Server (Future Feature)

```bash

python cli.py serve --port 8000

```

_Note: API server functionality is planned for future releases._

## Configuration

Edit `config.yaml` to customize behavior:

```yaml

embedding:

  model_path: "./models/embeddinggemma-300m"

  dim: 768

  truncate_to: 128



generator:

  model_path: "./models/phi-2.Q4_0.gguf"

  quantization: "Q4_0"

  max_tokens: 512

  n_ctx: 2048



retrieval:

  vector_store: "faiss"

  top_k: 5

  similarity_threshold: 0.75



commit:

  tone: "imperative"

  style: "conventional"

  max_length: 72



logging:

  verbose: true

  telemetry: false

```

### Configuration Options

- **embedding.model_path**: Path to the EmbeddingGemma-300m model

- **generator.model_path**: Path to the Phi-2 GGUF model file
- **retrieval.top_k**: Number of documents to retrieve for context

- **retrieval.similarity_threshold**: Minimum similarity score for results
- **generator.max_tokens**: Maximum tokens for generation

- **generator.n_ctx**: Context window size for Phi-2

## Dependencies

- `sentence-transformers>=2.2.2` - Document embedding
- `faiss-cpu>=1.7.4` - Vector similarity search
- `llama-cpp-python>=0.2.23` - Phi-2 model inference (Windows compatible)
- `typer>=0.9.0` - CLI framework
- `PyYAML>=6.0` - Configuration file parsing

## Troubleshooting

### Model Loading Issues

If you encounter model loading errors:

1. **Embedding Model**: Ensure `embeddinggemma-300m` is a directory containing all model files
2. **Phi-2 Model**: Ensure `phi-2.Q4_0.gguf` is a single GGUF file
3. **Paths**: All paths in `config.yaml` should be relative to the project root

### Memory Issues

For systems with limited RAM:

- Use Q4_0 quantization for Phi-2 (already configured)

- Reduce `n_ctx` in config.yaml if needed
- Process documents in smaller batches

### Windows-Specific Issues

- Ensure `llama-cpp-python` version supports Windows
- Use PowerShell or Command Prompt for CLI commands
- Check file path separators in configuration

## Development

To test the modules:

```bash

python -c "from src import *; print('All modules imported successfully')"

```

To run in development mode:

```bash

python cli.py --help

```

## License

[Insert your license information here]

## Contributing

[Insert contribution guidelines here]