omarsol commited on
Commit
ad01081
Β·
1 Parent(s): f46a350

update readme - adding course instructions

Browse files
Files changed (6) hide show
  1. CLAUDE.md +45 -26
  2. Dockerfile +1 -1
  3. README.md +1 -1
  4. scripts/custom_retriever.py +5 -12
  5. scripts/main.py +5 -4
  6. scripts/setup.py +3 -35
CLAUDE.md CHANGED
@@ -1,28 +1,33 @@
1
  # AI Tutor App Instructions for Claude
2
 
3
  ## Project Overview
 
4
  This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
5
 
6
  ## Key Repositories and URLs
7
- - Main code: https://github.com/towardsai/ai-tutor-app
 
8
  - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
9
  - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
10
- - Private JSONL repo: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
11
 
12
  ## Architecture Overview
 
13
  - Frontend: Gradio-based UI in `scripts/main.py`
14
  - Retrieval: Custom retriever using ChromaDB vector stores
15
  - Embedding: Cohere embeddings for vector search
16
- - LLM: OpenAI models (GPT-4o, etc.) for context addition and responses
17
  - Storage: Individual JSONL files per source + combined file for retrieval
18
 
19
  ## Data Update Workflows
20
 
21
  ### 1. Adding a New Course
 
22
  ```bash
23
- python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
24
  ```
25
- - This requires the course to be configured in `process_md_files.py` under `SOURCE_CONFIGS`
 
26
  - The workflow will pause for manual URL addition after processing markdown files
27
  - Only new content will have context added by default (efficient)
28
  - Use `--process-all-context` if you need to regenerate context for all documents
@@ -30,9 +35,11 @@ python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
30
  - Use `--skip-data-upload` if you don't want to upload data files
31
 
32
  ### 2. Updating Documentation from GitHub
 
33
  ```bash
34
- python data/scraping_scripts/update_docs_workflow.py
35
  ```
 
36
  - Updates all supported documentation sources (or specify specific ones with `--sources`)
37
  - Downloads fresh documentation from GitHub repositories
38
  - Only new content will have context added by default (efficient)
@@ -41,20 +48,23 @@ python data/scraping_scripts/update_docs_workflow.py
41
  - Use `--skip-data-upload` if you don't want to upload data files
42
 
43
  ### 3. Data File Management
 
44
  ```bash
45
  # Upload both JSONL and PKL files to private HuggingFace repository
46
- python data/scraping_scripts/upload_data_to_hf.py
47
  ```
48
 
49
  ## Data Flow and File Relationships
50
 
51
  ### Document Processing Pipeline
 
52
  1. **Markdown Files** β†’ `process_md_files.py` β†’ **Individual JSONL files** (e.g., `transformers_data.jsonl`)
53
  2. Individual JSONL files β†’ `combine_all_sources()` β†’ `all_sources_data.jsonl`
54
  3. `all_sources_data.jsonl` β†’ `add_context_to_nodes.py` β†’ `all_sources_contextual_nodes.pkl`
55
  4. `all_sources_contextual_nodes.pkl` β†’ `create_vector_stores.py` β†’ ChromaDB vector stores
56
 
57
  ### Important Files and Their Purpose
 
58
  - `all_sources_data.jsonl` - Combined raw document data without context
59
  - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
60
  - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
@@ -64,32 +74,37 @@ python data/scraping_scripts/upload_data_to_hf.py
64
  ## Configuration Details
65
 
66
  ### Adding a New Course Source
 
67
  1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
68
- ```python
69
- "new_course": {
70
- "base_url": "",
71
- "input_directory": "data/new_course",
72
- "output_file": "data/new_course_data.jsonl",
73
- "source_name": "new_course",
74
- "use_include_list": False,
75
- "included_dirs": [],
76
- "excluded_dirs": [],
77
- "excluded_root_files": [],
78
- "included_root_files": [],
79
- "url_extension": "",
80
- },
81
- ```
 
82
 
83
  2. Update UI configurations in:
84
- - `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
85
- - `main.py`: Add mapping in `source_mapping` dictionary
 
86
 
87
  ## Deployment and Publishing
88
 
89
  ### GitHub Actions Workflow
 
90
  The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
91
 
92
  ### Manual Deployment
 
93
  ```bash
94
  git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
95
  ```
@@ -97,25 +112,27 @@ git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-
97
  ## Development Environment Setup
98
 
99
  ### Required Environment Variables
 
100
  - `OPENAI_API_KEY` - For LLM processing
101
  - `COHERE_API_KEY` - For embeddings
102
  - `HF_TOKEN` - For HuggingFace uploads
103
  - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
104
 
105
  ### Running the Application Locally
 
106
  ```bash
107
  # Install dependencies
108
- pip install -r requirements.txt
109
 
110
  # Start the Gradio UI
111
- python scripts/main.py
112
  ```
113
 
114
  ## Important Notes
115
 
116
  1. When adding new courses, make sure to:
117
  - Place markdown files exported from Notion in the appropriate directory
118
- - Add URLs manually from the live course platform
119
  - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
120
  - Configure the course in `process_md_files.py`
121
  - Verify it appears in the UI after deployment
@@ -132,11 +149,13 @@ python scripts/main.py
132
  ## Technical Details for Debugging
133
 
134
  ### Node Removal Logic
 
135
  - When adding context, the workflow now removes existing nodes for sources being updated
136
  - This prevents duplication of content in the vector database
137
  - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
138
 
139
  ### Performance Considerations
 
140
  - Context addition is the most time-consuming step (uses OpenAI API)
141
  - The new default behavior only processes new content
142
  - For large updates, consider running in batches
 
1
  # AI Tutor App Instructions for Claude
2
 
3
  ## Project Overview
4
+
5
  This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
6
 
7
  ## Key Repositories and URLs
8
+
9
+ - Repository on GitHub: https://github.com/towardsai/ai-tutor-app
10
  - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
11
  - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
12
+ - Private JSONL repo (the raw document data): https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
13
 
14
  ## Architecture Overview
15
+
16
  - Frontend: Gradio-based UI in `scripts/main.py`
17
  - Retrieval: Custom retriever using ChromaDB vector stores
18
  - Embedding: Cohere embeddings for vector search
19
+ - LLM: GPT-4o
20
  - Storage: Individual JSONL files per source + combined file for retrieval
21
 
22
  ## Data Update Workflows
23
 
24
  ### 1. Adding a New Course
25
+
26
  ```bash
27
+ uv run -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
28
  ```
29
+
30
+ - This requires the course to be already configured in `process_md_files.py` under `SOURCE_CONFIGS`
31
  - The workflow will pause for manual URL addition after processing markdown files
32
  - Only new content will have context added by default (efficient)
33
  - Use `--process-all-context` if you need to regenerate context for all documents
 
35
  - Use `--skip-data-upload` if you don't want to upload data files
36
 
37
  ### 2. Updating Documentation from GitHub
38
+
39
  ```bash
40
+ uv run -m data.scraping_scripts.update_docs_workflow --sources [SOURCE1] [SOURCE2] ...
41
  ```
42
+
43
  - Updates all supported documentation sources (or specify specific ones with `--sources`)
44
  - Downloads fresh documentation from GitHub repositories
45
  - Only new content will have context added by default (efficient)
 
48
  - Use `--skip-data-upload` if you don't want to upload data files
49
 
50
  ### 3. Data File Management
51
+
52
  ```bash
53
  # Upload both JSONL and PKL files to private HuggingFace repository
54
+ uv run -m data.scraping_scripts.upload_data_to_hf
55
  ```
56
 
57
  ## Data Flow and File Relationships
58
 
59
  ### Document Processing Pipeline
60
+
61
  1. **Markdown Files** β†’ `process_md_files.py` β†’ **Individual JSONL files** (e.g., `transformers_data.jsonl`)
62
  2. Individual JSONL files β†’ `combine_all_sources()` β†’ `all_sources_data.jsonl`
63
  3. `all_sources_data.jsonl` β†’ `add_context_to_nodes.py` β†’ `all_sources_contextual_nodes.pkl`
64
  4. `all_sources_contextual_nodes.pkl` β†’ `create_vector_stores.py` β†’ ChromaDB vector stores
65
 
66
  ### Important Files and Their Purpose
67
+
68
  - `all_sources_data.jsonl` - Combined raw document data without context
69
  - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
70
  - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
 
74
  ## Configuration Details
75
 
76
  ### Adding a New Course Source
77
+
78
  1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
79
+
80
+ ```python
81
+ "new_course": {
82
+ "base_url": "",
83
+ "input_directory": "data/new_course",
84
+ "output_file": "data/new_course_data.jsonl",
85
+ "source_name": "new_course",
86
+ "use_include_list": False,
87
+ "included_dirs": [],
88
+ "excluded_dirs": [],
89
+ "excluded_root_files": [],
90
+ "included_root_files": [],
91
+ "url_extension": "",
92
+ },
93
+ ```
94
 
95
  2. Update UI configurations in:
96
+
97
+ - `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
98
+ - `main.py`: Add mapping in `source_mapping` dictionary
99
 
100
  ## Deployment and Publishing
101
 
102
  ### GitHub Actions Workflow
103
+
104
  The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
105
 
106
  ### Manual Deployment
107
+
108
  ```bash
109
  git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
110
  ```
 
112
  ## Development Environment Setup
113
 
114
  ### Required Environment Variables
115
+
116
  - `OPENAI_API_KEY` - For LLM processing
117
  - `COHERE_API_KEY` - For embeddings
118
  - `HF_TOKEN` - For HuggingFace uploads
119
  - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
120
 
121
  ### Running the Application Locally
122
+
123
  ```bash
124
  # Install dependencies
125
+ uv sync
126
 
127
  # Start the Gradio UI
128
+ uv run -m scripts.main
129
  ```
130
 
131
  ## Important Notes
132
 
133
  1. When adding new courses, make sure to:
134
  - Place markdown files exported from Notion in the appropriate directory
135
+ - Add URLs manually from the live course platform
136
  - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
137
  - Configure the course in `process_md_files.py`
138
  - Verify it appears in the UI after deployment
 
149
  ## Technical Details for Debugging
150
 
151
  ### Node Removal Logic
152
+
153
  - When adding context, the workflow now removes existing nodes for sources being updated
154
  - This prevents duplication of content in the vector database
155
  - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
156
 
157
  ### Performance Considerations
158
+
159
  - Context addition is the most time-consuming step (uses OpenAI API)
160
  - The new default behavior only processes new content
161
  - For large updates, consider running in batches
Dockerfile CHANGED
@@ -18,4 +18,4 @@ RUN chown -R user:user /app
18
  USER user
19
 
20
  EXPOSE 7860
21
- CMD ["uv", "run", "scripts/main.py"]
 
18
  USER user
19
 
20
  EXPOSE 7860
21
+ CMD ["uv", "run", "-m", "scripts.main"]
README.md CHANGED
@@ -33,7 +33,7 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
33
  3. Run:
34
 
35
  ```bash
36
- uv run scripts/main.py
37
  ```
38
 
39
  Starts the Gradio AI Tutor interface.
 
33
  3. Run:
34
 
35
  ```bash
36
+ uv run -m scripts.main
37
  ```
38
 
39
  Starts the Gradio AI Tutor interface.
scripts/custom_retriever.py CHANGED
@@ -11,18 +11,11 @@ from dotenv import load_dotenv
11
  from llama_index.core import Document, QueryBundle
12
  from llama_index.core.async_utils import run_async_tasks
13
  from llama_index.core.callbacks import CBEventType, EventPayload
14
- from llama_index.core.retrievers import (
15
- BaseRetriever,
16
- KeywordTableSimpleRetriever,
17
- VectorIndexRetriever,
18
- )
19
- from llama_index.core.schema import MetadataMode, NodeWithScore, QueryBundle, TextNode
20
- from llama_index.core.vector_stores import (
21
- FilterCondition,
22
- FilterOperator,
23
- MetadataFilter,
24
- MetadataFilters,
25
- )
26
  from llama_index.postprocessor.cohere_rerank import CohereRerank
27
  from llama_index.postprocessor.cohere_rerank.base import CohereRerank
28
 
 
11
  from llama_index.core import Document, QueryBundle
12
  from llama_index.core.async_utils import run_async_tasks
13
  from llama_index.core.callbacks import CBEventType, EventPayload
14
+ from llama_index.core.retrievers import (BaseRetriever,
15
+ KeywordTableSimpleRetriever,
16
+ VectorIndexRetriever)
17
+ from llama_index.core.schema import (MetadataMode, NodeWithScore, QueryBundle,
18
+ TextNode)
 
 
 
 
 
 
 
19
  from llama_index.postprocessor.cohere_rerank import CohereRerank
20
  from llama_index.postprocessor.cohere_rerank.base import CohereRerank
21
 
scripts/main.py CHANGED
@@ -2,7 +2,6 @@ import pdb
2
 
3
  import gradio as gr
4
  import logfire
5
- from custom_retriever import CustomRetriever
6
  from llama_index.agent.openai import OpenAIAgent
7
  from llama_index.core.llms import MessageRole
8
  from llama_index.core.memory import ChatSummaryMemoryBuffer
@@ -10,9 +9,11 @@ from llama_index.core.tools import RetrieverTool, ToolMetadata
10
  from llama_index.core.vector_stores import (FilterCondition, FilterOperator,
11
  MetadataFilter, MetadataFilters)
12
  from llama_index.llms.openai import OpenAI
13
- from prompts import system_message_openai_agent
14
- from setup import (AVAILABLE_SOURCES, AVAILABLE_SOURCES_UI, CONCURRENCY_COUNT,
15
- custom_retriever_all_sources)
 
 
16
 
17
 
18
  def update_query_engine_tools(selected_sources) -> list[RetrieverTool]:
 
2
 
3
  import gradio as gr
4
  import logfire
 
5
  from llama_index.agent.openai import OpenAIAgent
6
  from llama_index.core.llms import MessageRole
7
  from llama_index.core.memory import ChatSummaryMemoryBuffer
 
9
  from llama_index.core.vector_stores import (FilterCondition, FilterOperator,
10
  MetadataFilter, MetadataFilters)
11
  from llama_index.llms.openai import OpenAI
12
+
13
+ from .custom_retriever import CustomRetriever
14
+ from .prompts import system_message_openai_agent
15
+ from .setup import (AVAILABLE_SOURCES, AVAILABLE_SOURCES_UI, CONCURRENCY_COUNT,
16
+ custom_retriever_all_sources)
17
 
18
 
19
  def update_query_engine_tools(selected_sources) -> list[RetrieverTool]:
scripts/setup.py CHANGED
@@ -6,14 +6,15 @@ import pickle
6
 
7
  import chromadb
8
  import logfire
9
- from custom_retriever import CustomRetriever
10
  from dotenv import load_dotenv
11
  from llama_index.core import Document, VectorStoreIndex
12
  from llama_index.core.node_parser import SentenceSplitter
13
  from llama_index.core.retrievers import VectorIndexRetriever
14
  from llama_index.embeddings.cohere import CohereEmbedding
15
  from llama_index.vector_stores.chroma import ChromaVectorStore
16
- from utils import init_mongo_db
 
 
17
 
18
  load_dotenv()
19
 
@@ -35,39 +36,6 @@ if not os.path.exists("data/chroma-db-all_sources"):
35
  logfire.info(f"Downloaded vector database to 'data/chroma-db-all_sources'")
36
 
37
 
38
- def create_docs(input_file: str) -> list[Document]:
39
- with open(input_file, "r") as f:
40
- documents = []
41
- for line in f:
42
- data = json.loads(line)
43
- documents.append(
44
- Document(
45
- doc_id=data["doc_id"],
46
- text=data["content"],
47
- metadata={ # type: ignore
48
- "url": data["url"],
49
- "title": data["name"],
50
- "tokens": data["tokens"],
51
- "retrieve_doc": data["retrieve_doc"],
52
- "source": data["source"],
53
- },
54
- excluded_llm_metadata_keys=[
55
- "title",
56
- "tokens",
57
- "retrieve_doc",
58
- "source",
59
- ],
60
- excluded_embed_metadata_keys=[
61
- "url",
62
- "tokens",
63
- "retrieve_doc",
64
- "source",
65
- ],
66
- )
67
- )
68
- return documents
69
-
70
-
71
  def setup_database(db_collection, dict_file_name) -> CustomRetriever:
72
  db = chromadb.PersistentClient(path=f"data/{db_collection}")
73
  chroma_collection = db.get_or_create_collection(db_collection)
 
6
 
7
  import chromadb
8
  import logfire
 
9
  from dotenv import load_dotenv
10
  from llama_index.core import Document, VectorStoreIndex
11
  from llama_index.core.node_parser import SentenceSplitter
12
  from llama_index.core.retrievers import VectorIndexRetriever
13
  from llama_index.embeddings.cohere import CohereEmbedding
14
  from llama_index.vector_stores.chroma import ChromaVectorStore
15
+
16
+ from .custom_retriever import CustomRetriever
17
+ from .utils import init_mongo_db
18
 
19
  load_dotenv()
20
 
 
36
  logfire.info(f"Downloaded vector database to 'data/chroma-db-all_sources'")
37
 
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  def setup_database(db_collection, dict_file_name) -> CustomRetriever:
40
  db = chromadb.PersistentClient(path=f"data/{db_collection}")
41
  chroma_collection = db.get_or_create_collection(db_collection)