Spaces:

towardsai-tutors
/

ai-tutor-chatbot

Running

App Files Files Community

omarsol commited on Sep 13

Commit

ad01081

1 Parent(s): f46a350

update readme - adding course instructions

Browse files

Files changed (6) hide show

CLAUDE.md +45 -26
Dockerfile +1 -1
README.md +1 -1
scripts/custom_retriever.py +5 -12
scripts/main.py +5 -4
scripts/setup.py +3 -35

CLAUDE.md CHANGED Viewed

@@ -1,28 +1,33 @@
 # AI Tutor App Instructions for Claude
 ## Project Overview
 This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
 ## Key Repositories and URLs
-- Main code: https://github.com/towardsai/ai-tutor-app
 - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
 - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
-- Private JSONL repo: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
 ## Architecture Overview
 - Frontend: Gradio-based UI in `scripts/main.py`
 - Retrieval: Custom retriever using ChromaDB vector stores
 - Embedding: Cohere embeddings for vector search
-- LLM: OpenAI models (GPT-4o, etc.) for context addition and responses
 - Storage: Individual JSONL files per source + combined file for retrieval
 ## Data Update Workflows
 ### 1. Adding a New Course
 ```bash
-python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
 ```
-- This requires the course to be configured in `process_md_files.py` under `SOURCE_CONFIGS`
 - The workflow will pause for manual URL addition after processing markdown files
 - Only new content will have context added by default (efficient)
 - Use `--process-all-context` if you need to regenerate context for all documents
@@ -30,9 +35,11 @@ python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
 - Use `--skip-data-upload` if you don't want to upload data files
 ### 2. Updating Documentation from GitHub
 ```bash
-python data/scraping_scripts/update_docs_workflow.py
 ```
 - Updates all supported documentation sources (or specify specific ones with `--sources`)
 - Downloads fresh documentation from GitHub repositories
 - Only new content will have context added by default (efficient)
@@ -41,20 +48,23 @@ python data/scraping_scripts/update_docs_workflow.py
 - Use `--skip-data-upload` if you don't want to upload data files
 ### 3. Data File Management
 ```bash
 # Upload both JSONL and PKL files to private HuggingFace repository
-python data/scraping_scripts/upload_data_to_hf.py
 ```
 ## Data Flow and File Relationships
 ### Document Processing Pipeline
 1. **Markdown Files** → `process_md_files.py` → **Individual JSONL files** (e.g., `transformers_data.jsonl`)
 2. Individual JSONL files → `combine_all_sources()` → `all_sources_data.jsonl`
 3. `all_sources_data.jsonl` → `add_context_to_nodes.py` → `all_sources_contextual_nodes.pkl`
 4. `all_sources_contextual_nodes.pkl` → `create_vector_stores.py` → ChromaDB vector stores
 ### Important Files and Their Purpose
 - `all_sources_data.jsonl` - Combined raw document data without context
 - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
 - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
@@ -64,32 +74,37 @@ python data/scraping_scripts/upload_data_to_hf.py
 ## Configuration Details
 ### Adding a New Course Source
 1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
-```python
-"new_course": {
-    "base_url": "",
-    "input_directory": "data/new_course",
-    "output_file": "data/new_course_data.jsonl",
-    "source_name": "new_course",
-    "use_include_list": False,
-    "included_dirs": [],
-    "excluded_dirs": [],
-    "excluded_root_files": [],
-    "included_root_files": [],
-    "url_extension": "",
-},
-```
 2. Update UI configurations in:
-   - `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
-   - `main.py`: Add mapping in `source_mapping` dictionary
 ## Deployment and Publishing
 ### GitHub Actions Workflow
 The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
 ### Manual Deployment
 ```bash
 git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
 ```
@@ -97,25 +112,27 @@ git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-
 ## Development Environment Setup
 ### Required Environment Variables
 - `OPENAI_API_KEY` - For LLM processing
 - `COHERE_API_KEY` - For embeddings
 - `HF_TOKEN` - For HuggingFace uploads
 - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
 ### Running the Application Locally
 ```bash
 # Install dependencies
-pip install -r requirements.txt
 # Start the Gradio UI
-python scripts/main.py
 ```
 ## Important Notes
 1. When adding new courses, make sure to:
    - Place markdown files exported from Notion in the appropriate directory
-   - Add URLs manually from the live course platform
    - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
    - Configure the course in `process_md_files.py`
    - Verify it appears in the UI after deployment
@@ -132,11 +149,13 @@ python scripts/main.py
 ## Technical Details for Debugging
 ### Node Removal Logic
 - When adding context, the workflow now removes existing nodes for sources being updated
 - This prevents duplication of content in the vector database
 - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
 ### Performance Considerations
 - Context addition is the most time-consuming step (uses OpenAI API)
 - The new default behavior only processes new content
 - For large updates, consider running in batches

 # AI Tutor App Instructions for Claude
 ## Project Overview
 This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
 ## Key Repositories and URLs
+- Repository on GitHub: https://github.com/towardsai/ai-tutor-app
 - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
 - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
+- Private JSONL repo (the raw document data): https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
 ## Architecture Overview
 - Frontend: Gradio-based UI in `scripts/main.py`
 - Retrieval: Custom retriever using ChromaDB vector stores
 - Embedding: Cohere embeddings for vector search
+- LLM: GPT-4o
 - Storage: Individual JSONL files per source + combined file for retrieval
 ## Data Update Workflows
 ### 1. Adding a New Course
 ```bash
+uv run -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
 ```
+- This requires the course to be already configured in `process_md_files.py` under `SOURCE_CONFIGS`
 - The workflow will pause for manual URL addition after processing markdown files
 - Only new content will have context added by default (efficient)
 - Use `--process-all-context` if you need to regenerate context for all documents
 - Use `--skip-data-upload` if you don't want to upload data files
 ### 2. Updating Documentation from GitHub
 ```bash
+uv run -m data.scraping_scripts.update_docs_workflow --sources [SOURCE1] [SOURCE2] ...
 ```
 - Updates all supported documentation sources (or specify specific ones with `--sources`)
 - Downloads fresh documentation from GitHub repositories
 - Only new content will have context added by default (efficient)
 - Use `--skip-data-upload` if you don't want to upload data files
 ### 3. Data File Management
 ```bash
 # Upload both JSONL and PKL files to private HuggingFace repository
+uv run -m data.scraping_scripts.upload_data_to_hf
 ```
 ## Data Flow and File Relationships
 ### Document Processing Pipeline
 1. **Markdown Files** → `process_md_files.py` → **Individual JSONL files** (e.g., `transformers_data.jsonl`)
 2. Individual JSONL files → `combine_all_sources()` → `all_sources_data.jsonl`
 3. `all_sources_data.jsonl` → `add_context_to_nodes.py` → `all_sources_contextual_nodes.pkl`
 4. `all_sources_contextual_nodes.pkl` → `create_vector_stores.py` → ChromaDB vector stores
 ### Important Files and Their Purpose
 - `all_sources_data.jsonl` - Combined raw document data without context
 - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
 - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
 ## Configuration Details
 ### Adding a New Course Source
 1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
+   ```python
+   "new_course": {
+      "base_url": "",
+      "input_directory": "data/new_course",
+      "output_file": "data/new_course_data.jsonl",
+      "source_name": "new_course",
+      "use_include_list": False,
+      "included_dirs": [],
+      "excluded_dirs": [],
+      "excluded_root_files": [],
+      "included_root_files": [],
+      "url_extension": "",
+   },
+   ```
 2. Update UI configurations in:
+- `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
+- `main.py`: Add mapping in `source_mapping` dictionary
 ## Deployment and Publishing
 ### GitHub Actions Workflow
 The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
 ### Manual Deployment
 ```bash
 git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
 ```
 ## Development Environment Setup
 ### Required Environment Variables
 - `OPENAI_API_KEY` - For LLM processing
 - `COHERE_API_KEY` - For embeddings
 - `HF_TOKEN` - For HuggingFace uploads
 - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
 ### Running the Application Locally
 ```bash
 # Install dependencies
+uv sync
 # Start the Gradio UI
+uv run -m scripts.main
 ```
 ## Important Notes
 1. When adding new courses, make sure to:
    - Place markdown files exported from Notion in the appropriate directory
+   - Add URLs manually from the live course platform
    - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
    - Configure the course in `process_md_files.py`
    - Verify it appears in the UI after deployment
 ## Technical Details for Debugging
 ### Node Removal Logic
 - When adding context, the workflow now removes existing nodes for sources being updated
 - This prevents duplication of content in the vector database
 - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
 ### Performance Considerations
 - Context addition is the most time-consuming step (uses OpenAI API)
 - The new default behavior only processes new content
 - For large updates, consider running in batches

Dockerfile CHANGED Viewed

@@ -18,4 +18,4 @@ RUN chown -R user:user /app
 USER user
 EXPOSE 7860
-CMD ["uv", "run", "scripts/main.py"]

 USER user
 EXPOSE 7860
+CMD ["uv", "run", "-m", "scripts.main"]

README.md CHANGED Viewed

@@ -33,7 +33,7 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
 3. Run:
    ```bash
-   uv run scripts/main.py
    ```
    Starts the Gradio AI Tutor interface.

 3. Run:
    ```bash
+   uv run -m scripts.main
    ```
    Starts the Gradio AI Tutor interface.

scripts/custom_retriever.py CHANGED Viewed

@@ -11,18 +11,11 @@ from dotenv import load_dotenv
 from llama_index.core import Document, QueryBundle
 from llama_index.core.async_utils import run_async_tasks
 from llama_index.core.callbacks import CBEventType, EventPayload
-from llama_index.core.retrievers import (
-    BaseRetriever,
-    KeywordTableSimpleRetriever,
-    VectorIndexRetriever,
-)
-from llama_index.core.schema import MetadataMode, NodeWithScore, QueryBundle, TextNode
-from llama_index.core.vector_stores import (
-    FilterCondition,
-    FilterOperator,
-    MetadataFilter,
-    MetadataFilters,
-)
 from llama_index.postprocessor.cohere_rerank import CohereRerank
 from llama_index.postprocessor.cohere_rerank.base import CohereRerank

 from llama_index.core import Document, QueryBundle
 from llama_index.core.async_utils import run_async_tasks
 from llama_index.core.callbacks import CBEventType, EventPayload
+from llama_index.core.retrievers import (BaseRetriever,
+                                         KeywordTableSimpleRetriever,
+                                         VectorIndexRetriever)
+from llama_index.core.schema import (MetadataMode, NodeWithScore, QueryBundle,
+                                     TextNode)
 from llama_index.postprocessor.cohere_rerank import CohereRerank
 from llama_index.postprocessor.cohere_rerank.base import CohereRerank

scripts/main.py CHANGED Viewed

@@ -2,7 +2,6 @@ import pdb
 import gradio as gr
 import logfire
-from custom_retriever import CustomRetriever
 from llama_index.agent.openai import OpenAIAgent
 from llama_index.core.llms import MessageRole
 from llama_index.core.memory import ChatSummaryMemoryBuffer
@@ -10,9 +9,11 @@ from llama_index.core.tools import RetrieverTool, ToolMetadata
 from llama_index.core.vector_stores import (FilterCondition, FilterOperator,
                                             MetadataFilter, MetadataFilters)
 from llama_index.llms.openai import OpenAI
-from prompts import system_message_openai_agent
-from setup import (AVAILABLE_SOURCES, AVAILABLE_SOURCES_UI, CONCURRENCY_COUNT,
-                   custom_retriever_all_sources)
 def update_query_engine_tools(selected_sources) -> list[RetrieverTool]:

 import gradio as gr
 import logfire
 from llama_index.agent.openai import OpenAIAgent
 from llama_index.core.llms import MessageRole
 from llama_index.core.memory import ChatSummaryMemoryBuffer
 from llama_index.core.vector_stores import (FilterCondition, FilterOperator,
                                             MetadataFilter, MetadataFilters)
 from llama_index.llms.openai import OpenAI
+from .custom_retriever import CustomRetriever
+from .prompts import system_message_openai_agent
+from .setup import (AVAILABLE_SOURCES, AVAILABLE_SOURCES_UI, CONCURRENCY_COUNT,
+                    custom_retriever_all_sources)
 def update_query_engine_tools(selected_sources) -> list[RetrieverTool]:

scripts/setup.py CHANGED Viewed

@@ -6,14 +6,15 @@ import pickle
 import chromadb
 import logfire
-from custom_retriever import CustomRetriever
 from dotenv import load_dotenv
 from llama_index.core import Document, VectorStoreIndex
 from llama_index.core.node_parser import SentenceSplitter
 from llama_index.core.retrievers import VectorIndexRetriever
 from llama_index.embeddings.cohere import CohereEmbedding
 from llama_index.vector_stores.chroma import ChromaVectorStore
-from utils import init_mongo_db
 load_dotenv()
@@ -35,39 +36,6 @@ if not os.path.exists("data/chroma-db-all_sources"):
     logfire.info(f"Downloaded vector database to 'data/chroma-db-all_sources'")
-def create_docs(input_file: str) -> list[Document]:
-    with open(input_file, "r") as f:
-        documents = []
-        for line in f:
-            data = json.loads(line)
-            documents.append(
-                Document(
-                    doc_id=data["doc_id"],
-                    text=data["content"],
-                    metadata={  # type: ignore
-                        "url": data["url"],
-                        "title": data["name"],
-                        "tokens": data["tokens"],
-                        "retrieve_doc": data["retrieve_doc"],
-                        "source": data["source"],
-                    },
-                    excluded_llm_metadata_keys=[
-                        "title",
-                        "tokens",
-                        "retrieve_doc",
-                        "source",
-                    ],
-                    excluded_embed_metadata_keys=[
-                        "url",
-                        "tokens",
-                        "retrieve_doc",
-                        "source",
-                    ],
-                )
-            )
-    return documents
 def setup_database(db_collection, dict_file_name) -> CustomRetriever:
     db = chromadb.PersistentClient(path=f"data/{db_collection}")
     chroma_collection = db.get_or_create_collection(db_collection)

 import chromadb
 import logfire
 from dotenv import load_dotenv
 from llama_index.core import Document, VectorStoreIndex
 from llama_index.core.node_parser import SentenceSplitter
 from llama_index.core.retrievers import VectorIndexRetriever
 from llama_index.embeddings.cohere import CohereEmbedding
 from llama_index.vector_stores.chroma import ChromaVectorStore
+from .custom_retriever import CustomRetriever
+from .utils import init_mongo_db
 load_dotenv()
     logfire.info(f"Downloaded vector database to 'data/chroma-db-all_sources'")
 def setup_database(db_collection, dict_file_name) -> CustomRetriever:
     db = chromadb.PersistentClient(path=f"data/{db_collection}")
     chroma_collection = db.get_or_create_collection(db_collection)