File size: 10,497 Bytes
fadc4bc
 
d38cc5e
fadc4bc
 
 
 
 
 
 
 
d38cc5e
03e744b
d38cc5e
 
 
03e744b
 
 
d38cc5e
 
 
 
e7aa207
 
 
 
03e744b
e7aa207
03e744b
e7aa207
 
 
 
03e744b
e7aa207
03e744b
e7aa207
 
 
03e744b
e7aa207
03e744b
e7aa207
03e744b
e7aa207
 
 
 
03e744b
e7aa207
03e744b
e7aa207
03e744b
e7aa207
03e744b
e7aa207
 
03e744b
e7aa207
 
03e744b
e7aa207
 
 
03e744b
e7aa207
03e744b
e7aa207
 
 
 
 
 
 
 
 
 
 
03e744b
 
 
e7aa207
 
 
 
 
 
 
03e744b
e7aa207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03e744b
 
e7aa207
 
 
 
 
 
03e744b
e7aa207
03e744b
e7aa207
03e744b
 
e7aa207
 
 
 
 
 
 
 
03e744b
 
e7aa207
 
 
 
 
 
03e744b
e7aa207
03e744b
e7aa207
 
 
03e744b
e7aa207
03e744b
 
e7aa207
03e744b
 
e7aa207
03e744b
e7aa207
03e744b
 
e7aa207
03e744b
 
e7aa207
03e744b
e7aa207
 
 
 
 
03e744b
e7aa207
03e744b
 
e7aa207
03e744b
 
e7aa207
 
03e744b
 
e7aa207
03e744b
 
 
 
 
e7aa207
03e744b
e7aa207
03e744b
e7aa207
03e744b
e7aa207
 
 
 
03e744b
e7aa207
03e744b
e7aa207
 
 
 
 
 
03e744b
e7aa207
 
 
03e744b
e7aa207
 
 
 
 
 
 
 
 
 
03e744b
 
e7aa207
03e744b
e7aa207
 
03e744b
e7aa207
 
 
03e744b
e7aa207
03e744b
e7aa207
 
03e744b
e7aa207
 
 
03e744b
e7aa207
03e744b
e7aa207
03e744b
e7aa207
 
03e744b
e7aa207
03e744b
e7aa207
 
03e744b
e7aa207
03e744b
e7aa207
 
 
 
 
 
 
03e744b
e7aa207
03e744b
e7aa207
03e744b
 
e7aa207
 
03e744b
 
e7aa207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03e744b
 
e7aa207
 
 
 
 
 
 
 
 
 
 
 
 
03e744b
e7aa207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03e744b
 
 
e7aa207
 
 
 
 
d397fab
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---

title: CodeMind
emoji: πŸ”§
colorFrom: purple
colorTo: indigo
sdk: static
pinned: false
license: apache-2.0
short_description: AI-powered development assistant CLI Tool
---


**CodeMind** is a AI-powered development assistant that runs entirely on your local machine for intelligent document analysis and commit message generation. It leverages modern machine learning models for: helping you understand your codebase through semantic search and generates meaningful commit messages using locally hosted language models, ensuring complete privacy and no cloud dependencies.

- **Efficient Knowledge Retrieval**: Makes searching and querying documentation more powerful by using semantic embeddings rather than keyword search.
- **Smarter Git Workflow**: Automates the creation of meaningful commit messages by analyzing git diffs and using an LLM to summarize changes.
- **AI-Powered Documentation**: Enables you to ask questions about your project, using your own docs/context rather than just generic answers.

## Features

- **Document Embedding** (using [EmbeddingGemma-300m](https://huggingface.co/google/embeddinggemma-300m))
- **Semantic Search** (using [FAISS](https://github.com/facebookresearch/faiss) for vector similarity search)
- **Commit Message Generation** (using [Phi-2](https://huggingface.co/microsoft/phi-2-gguf) for text generation): Automatically generate descriptive commit messages based on your changes
- **Retrieval-Augmented Generation (RAG)**: Answers questions using indexed document context
- **Local Processing**: All AI processing happens on your machine with no data sent to cloud services
- **Flexible Configuration**: Customize models and parameters to suit your specific needs
- **FAISS Integration**: Efficient vector similarity search for fast retrieval
- **Multiple Model Support**: Compatible with GGUF and SentenceTransformers models

## Prerequisites

- **Python 3.8 or higher**
- **8GB+ RAM** recommended (for running language models)
- **4GB+ disk space** for model files
- **Git** for repository cloning

### Platform Recommendations

- **Linux** (Recommended for best compatibility)
- **macOS** (Good compatibility)
- **Windows** (May require additional setup for some dependencies)

## Installation

### 1. Clone the Repository

```bash

git clone https://github.com/devjas1/codemind.git

cd codemind

```

### 2. Set Up Python Environment

Create and activate a virtual environment:

```bash



# Create virtual environment

python -m venv venv



# Activate on macOS/Linux

source venv/bin/activate



# Activate on Windows

venv\Scripts\activate

```

### 3. Install Dependencies

```bash

pip install -r requirements.txt

```

**Note**: If you encounter installation errors related to C++/PyTorch/FAISS:

- Ensure you have Python development tools installed
- Linux/macOS are preferred for FAISS compatibility
- On Windows, you may need to install Visual Studio Build Tools

## Model Setup

### Directory Structure

Create the following directory structure for model files:

```text

models/

  β”œβ”€β”€ phi-2.Q4_0.gguf              # For commit message generation (Phi-2 model)

  └── embeddinggemma-300m/         # For document embedding (EmbeddingGemma model)

      └── [model files here]

```

### Downloading Models

1. **Phi-2 Model** (for commit message generation):

   - Download `phi-2.Q4_0.gguf` from a trusted source
   - Place it in the `models/` directory

2. **EmbeddingGemma Model** (for document embedding):

   - Download the EmbeddingGemma-300m model files
   - Place all files in the `models/embeddinggemma-300m/` directory

> **Note**: The specific process for obtaining these models may vary. Check the documentation in each model folder for detailed instructions.

## Configuration

Edit the `config.yaml` file to match your local setup:

```yaml

# Model configuration for commit message generation

generator:

  model_path: "./models/phi-2.Q4_0.gguf"

  quantization: "Q4_0"

  max_tokens: 512

  n_ctx: 2048



# Model configuration for document embedding

embedding:

  model_path: "./models/embeddinggemma-300m"



# Retrieval configuration for semantic search

retrieval:

  vector_store: "faiss"

  top_k: 5 # Number of results to return

  similarity_threshold: 0.7 # Minimum similarity score (0.0 to 1.0)

```

### Configuration Tips

- Adjust `top_k` to control how many results are returned for each query
- Modify `similarity_threshold` to filter results by relevance
- Ensure all file paths are correct for your system
- For larger codebases, you may need to increase `max_tokens`

## Indexing Documents

To enable semantic search over your documentation or codebase, you need to create a FAISS index:

```bash

# Basic usage

python src/embedder.py path/to/your/documents config.yaml



# Example with docs directory

python src/embedder.py ./docs config.yaml



# Example with specific code directory

python src/embedder.py ./src config.yaml

```

This process:

1. Reads all documents from the specified directory
2. Generates embeddings using the configured model
3. Creates a FAISS index in the `vector_cache/` directory
4. Enables fast semantic search capabilities

> **Note**: The indexing process may take several minutes depending on the size of your codebase and your hardware capabilities.

## Usage

### Command Line Interface

Run the main CLI interface:

```bash

python cli.py

```

### Available Commands

#### Get Help

```bash

python cli.py --help

```

#### Ask Questions About Your Codebase

```bash

python cli.py ask "How does this repository work?"

python cli.py ask "Where is the main configuration handled?"

python cli.py ask "Show me examples of API usage"

```

#### Generate Commit Messages

```bash

# Preview a generated commit message

python cli.py commit --preview



# Generate commit message without preview

python cli.py commit

```

#### API Server (Placeholder)

```bash

python cli.py serve --port 8000

```

> **Note**: The API server functionality is not yet implemented. This command will display: "API server functionality not implemented yet."

### Advanced Usage

For more advanced usage, you can modify the configuration to:

- Use different models for specific tasks
- Adjust the context window size for larger documents
- Customize the similarity threshold for retrieval
- Use different vector stores (though FAISS is currently the only supported option)

## Troubleshooting

### Common Issues

#### Model Errors

**Problem**: Model files not found or inaccessible  
**Solution**:

- Verify model files are in the correct locations
- Check file permissions
- Ensure the paths in `config.yaml` are correct

#### FAISS Errors

**Problem**: "No FAISS index found" error  
**Solution**:

- Run the embedder script to create the index
- Ensure the `vector_cache/` directory has write permissions

```bash

python src/embedder.py path/to/documents config.yaml

```

#### SentenceTransformers Issues

**Problem**: Compatibility errors with SentenceTransformers  
**Solution**:

- Check that the model format is compatible with SentenceTransformers
- Verify the version in requirements.txt
- Ensure all model files are present in the model directory

#### Performance Issues

**Problem**: Slow response times  
**Solution**:

- Ensure you have adequate RAM
- Consider using smaller quantized models
- Close other memory-intensive applications

#### Platform-Specific Issues

**Windows-specific issues**:

- FAISS may require additional compilation
- Path separators may need adjustment in configuration

**macOS/Linux**:

- Generally fewer compatibility issues
- Ensure you have write permissions for all directories

### Validation Checklist

- All model files present in correct directories
- FAISS index built in `vector_cache/`
- `config.yaml` paths match your local setup
- Python environment activated
- All dependencies installed
- Adequate disk space available
- Sufficient RAM available

### Getting Detailed Error Information

For specific errors, run commands with verbose output:

```bash

# Add debug flags if available

python cli.py --verbose ask "Your question"

```

## Project Structure

```text

codemind/

β”œβ”€β”€ models/                 # AI model files

β”‚   β”œβ”€β”€ phi-2.Q4_0.gguf    # Phi-2 model for generation

β”‚   └── embeddinggemma-300m/ # Embedding model

β”‚       └── [model files]

β”œβ”€β”€ src/                   # Source code

β”‚   └── embedder.py        # Document embedding script

β”œβ”€β”€ vector_cache/          # FAISS vector store (auto-generated)

β”œβ”€β”€ config.yaml           # Configuration file

β”œβ”€β”€ requirements.txt      # Python dependencies

β”œβ”€β”€ cli.py               # Command-line interface

└── README.md            # This file

```

## FAQ

### Q: Can I use different models?

> **A**: Yes, you can use any GGUF-compatible model for generation and any SentenceTransformers-compatible model for embeddings. Update the paths in `config.yaml` accordingly.

### Q: How much RAM do I need?

> **A**: For the Phi-2 Q4_0 model, 8GB RAM is recommended. Larger models will require more memory.



### Q: Can I index multiple directories?



> **A**: Yes, you can run the embedder script multiple times with different directories, or combine your documents into one directory before indexing.



### Q: Is my data sent to the cloud?



> **A**: No, all processing happens locally on your machine. No code or data is sent to external services.



### Q: How often should I re-index my documents?



> **A**: Re-index whenever your documentation or codebase changes significantly to keep search results relevant.



## Support



If you encounter issues:



1. Check the troubleshooting section above

2. Verify all model files are in correct locations

3. Confirm Python and library versions match requirements

4. Ensure proper directory permissions



For specific errors, please include the full traceback when seeking assistance.



## Contributing



Contributions to CodeMind are welcome! Please feel free to submit pull requests, create issues, or suggest new features.



## License



This project is licensed under the terms of the LICENSE file included in the repository.



Β© 2025 CodeMind. All rights reserved.