Spaces:

dev-jas
/

CodeMind

Running

devjas1 commited on 7 days ago

Commit

d38cc5e

1 Parent(s): 1878de6

(UPDATE): expand .gitattributes to include additional file types for LFS tracking

(UPDATE): enhance README with detailed project description and features; refactor embed_documents function for improved error handling and encoding support

Files changed (9) hide show

.gitattributes +37 -0
README.md +9 -5
src/__pycache__.py +0 -0
src/__pycache__/config_loader.cpython-310.pyc +0 -0
src/__pycache__/diff_analyzer.cpython-310.pyc +0 -0
src/__pycache__/embedder.cpython-310.pyc +0 -0
src/__pycache__/generator.cpython-310.pyc +0 -0
src/__pycache__/retriever.cpython-310.pyc +0 -0
src/embedder.py +57 -19

.gitattributes CHANGED Viewed

@@ -1,2 +1,39 @@
 *.gguf filter=lfs diff=lfs merge=lfs -text
 C:/Users/xJB6x/Projects/CodeMind/models/embeddinggemma-300m/* filter=lfs diff=lfs merge=lfs -text

 *.gguf filter=lfs diff=lfs merge=lfs -text
 C:/Users/xJB6x/Projects/CodeMind/models/embeddinggemma-300m/* filter=lfs diff=lfs merge=lfs -text
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: CodeMind
-emoji: 🏆
 colorFrom: purple
 colorTo: indigo
 sdk: static
@@ -9,14 +9,18 @@ license: apache-2.0
 short_description: AI-powered development assistant CLI Tool
 ---
-## CodeMind: Local AI Development Assistant
-CodeMind is an AI-powered development assistant that runs entirely on your local machine. It helps you understand your codebase through semantic search and generates meaningful commit messages using locally hosted language models, ensuring complete privacy and no cloud dependencies.
 ## Features
-- **Semantic Code Search**: Find relevant code and documentation using AI-powered semantic search
-- **Commit Message Generation**: Automatically generate descriptive commit messages based on your changes
 - **Local Processing**: All AI processing happens on your machine with no data sent to cloud services
 - **Flexible Configuration**: Customize models and parameters to suit your specific needs
 - **FAISS Integration**: Efficient vector similarity search for fast retrieval

 ---
 title: CodeMind
+emoji: 🔧
 colorFrom: purple
 colorTo: indigo
 sdk: static
 short_description: AI-powered development assistant CLI Tool
 ---
+**CodeMind** is a AI-powered development assistant that runs entirely on your local machine for intelligent document analysis and commit message generation. It leverages modern machine learning models for: helping you understand your codebase through semantic search and generates meaningful commit messages using locally hosted language models, ensuring complete privacy and no cloud dependencies.
+- **Efficient Knowledge Retrieval**: Makes searching and querying documentation more powerful by using semantic embeddings rather than keyword search.
+- **Smarter Git Workflow**: Automates the creation of meaningful commit messages by analyzing git diffs and using an LLM to summarize changes.
+- **AI-Powered Documentation**: Enables you to ask questions about your project, using your own docs/context rather than just generic answers.
 ## Features
+- **Document Embedding** (using [EmbeddingGemma-300m](https://huggingface.co/google/embeddinggemma-300m))
+- **Semantic Search** (using [FAISS](https://github.com/facebookresearch/faiss) for vector similarity search)
+- **Commit Message Generation** (using [Phi-2](https://huggingface.co/microsoft/phi-2-gguf) for text generation): Automatically generate descriptive commit messages based on your changes
+- **Retrieval-Augmented Generation (RAG)**: Answers questions using indexed document context
 - **Local Processing**: All AI processing happens on your machine with no data sent to cloud services
 - **Flexible Configuration**: Customize models and parameters to suit your specific needs
 - **FAISS Integration**: Efficient vector similarity search for fast retrieval

src/__pycache__.py ADDED Viewed

File without changes

src/__pycache__/config_loader.cpython-310.pyc DELETED Viewed

Binary file (763 Bytes)

src/__pycache__/diff_analyzer.cpython-310.pyc DELETED Viewed

Binary file (1.1 kB)

src/__pycache__/embedder.cpython-310.pyc DELETED Viewed

Binary file (924 Bytes)

src/__pycache__/generator.cpython-310.pyc DELETED Viewed

Binary file (1.28 kB)

src/__pycache__/retriever.cpython-310.pyc DELETED Viewed

Binary file (647 Bytes)

src/embedder.py CHANGED Viewed

@@ -2,26 +2,30 @@
 This script handles document embedding using EmbeddingGemma.
 This is the entry point for indexing documents.
 """
 import os
 import pickle
-import faiss
-import numpy as np
-from sentence_transformers import SentenceTransformer
-def embed_documents(path: str, config: dict):
     """
     Embed documents from a directory and save to FAISS index.
     Args:
         path (str): Path to the directory containing the documents to embed.
         config (dict): Configuration dictionary.
     """
     try:
         model = SentenceTransformer(config["embedding"]["model_path"])
-        print(f"Initalized embedding model: {config['embedding']['model_path']}")
-    except ValueError as e:
         print(f"Error initializing embedding model: {e}")
         return []
@@ -34,38 +38,72 @@ def embed_documents(path: str, config: dict):
         fpath = os.path.join(path, fname)
         if os.path.isfile(fpath):
             try:
-                with open(fpath, "r", encoding="utf-8") as f:
-                    text = f.read()
-                    if text.strip():  # Only process non-empty files
-                        emb = model.encode(text)
-                        embeddings.append(emb)
-                        texts.append(text)
-                        filenames.append(fname)
             except Exception as e:
-                print(f"Error reading file {fpath}: {e}")
     if not embeddings:
         print("No documents were successfully embedded.")
         return []
     # Create FAISS index
     dimension = embeddings[0].shape[0]
     index = faiss.IndexFlatIP(dimension)
-    # Normalize embeddings for cosine similarity
     embeddings_matrix = np.array(embeddings).astype("float32")
-    faiss.normalize_L2(embeddings_matrix)
-    # Add embeddings to index
     index.add(embeddings_matrix)
     # Save FAISS index and metadata
     os.makedirs("vector_cache", exist_ok=True)
     faiss.write_index(index, "vector_cache/faiss_index.bin")
     with open("vector_cache/metadata.pkl", "wb") as f:
         pickle.dump({"texts": texts, "filenames": filenames}, f)
-    print(f"Saved FAISS index to vector_cache/ with {len(embeddings)} documents.")
     print(f"Total embeddings created: {len(embeddings)}")
     return list(zip(filenames, embeddings))

 This script handles document embedding using EmbeddingGemma.
 This is the entry point for indexing documents.
 """
+from sentence_transformers import SentenceTransformer
+import numpy as np
+import faiss
 import os
 import pickle
+from typing import List, Tuple
+def embed_documents(path: str, config: dict) -> List[Tuple[str, np.ndarray]]:
     """
     Embed documents from a directory and save to FAISS index.
     Args:
         path (str): Path to the directory containing the documents to embed.
         config (dict): Configuration dictionary.
+    Returns:
+        List of tuples containing (filename, embedding)
     """
     try:
         model = SentenceTransformer(config["embedding"]["model_path"])
+        print(
+            f"Initialized embedding model: {config['embedding']['model_path']}")
+    except Exception as e:  # Changed to catch broader exception
         print(f"Error initializing embedding model: {e}")
         return []
         fpath = os.path.join(path, fname)
         if os.path.isfile(fpath):
             try:
+                # Try different encodings to handle various file types
+                for encoding in ['utf-8', 'latin-1', 'cp1252']:
+                    try:
+                        with open(fpath, "r", encoding=encoding) as f:
+                            text = f.read()
+                        break
+                    except UnicodeDecodeError:
+                        continue
+                else:
+                    print(
+                        f"Could not decode file {fpath} with common encodings")
+                    continue
+                if text.strip():  # Only process non-empty files
+                    emb = model.encode(text)
+                    # Ensure all embeddings have the same dimension
+                    if embeddings and emb.shape[0] != embeddings[0].shape[0]:
+                        print(f"Dimension mismatch in file {fname}, skipping")
+                        continue
+                    embeddings.append(emb)
+                    texts.append(text)
+                    filenames.append(fname)
             except Exception as e:
+                print(f"Error processing file {fpath}: {e}")
     if not embeddings:
         print("No documents were successfully embedded.")
         return []
+    print("Embedder script started", flush=True)
+    print(f"Documents in path: {os.listdir(path)}")
+    print(f"Successfully processed {len(embeddings)} documents")
     # Create FAISS index
     dimension = embeddings[0].shape[0]
     index = faiss.IndexFlatIP(dimension)
+    # Convert to numpy array and normalize
     embeddings_matrix = np.array(embeddings).astype("float32")
+    faiss.normalize_L2(embeddings_matrix)  # Normalize for cosine similarity
+    # Add normalized embeddings to index
     index.add(embeddings_matrix)
     # Save FAISS index and metadata
     os.makedirs("vector_cache", exist_ok=True)
     faiss.write_index(index, "vector_cache/faiss_index.bin")
+    # Save metadata
     with open("vector_cache/metadata.pkl", "wb") as f:
         pickle.dump({"texts": texts, "filenames": filenames}, f)
+    print(
+        f"Saved FAISS index to vector_cache/ with {len(embeddings)} documents.")
     print(f"Total embeddings created: {len(embeddings)}")
     return list(zip(filenames, embeddings))
+# Example usage
+if __name__ == "__main__":
+    config = {
+        "embedding": {
+            "model_path": "sentence-transformers/all-MiniLM-L6-v2"  # Example model
+        }
+    }
+    result = embed_documents("./docs", config)