omarsol commited on
Commit
f46a350
Β·
1 Parent(s): 186ce5a

Update docs/readme files

Browse files
.env.example CHANGED
@@ -1,2 +1,7 @@
 
1
  OPENAI_API_KEY=...
2
- COHERE_API_KEY=...
 
 
 
 
 
1
+ # To run AI Tutor Gradio UI
2
  OPENAI_API_KEY=...
3
+ COHERE_API_KEY=...
4
+
5
+ # To update documentation/add new course
6
+ HF_TOKEN=... https://huggingface.co/settings/tokens to read or write to private DATASET repos e.g. https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data/tree/main
7
+ GITHUB_TOKEN=...https://github.com/settings/tokens to use the Github API
README.md CHANGED
@@ -25,6 +25,7 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
25
  ```
26
 
27
  2. Configure environment variables:
 
28
  ```bash
29
  cp .env.example .env # then edit values
30
  ```
@@ -40,5 +41,5 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
40
  ### Updating Data Sources
41
 
42
  For adding new courses or updating documentation:
43
- - See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
44
 
 
 
25
  ```
26
 
27
  2. Configure environment variables:
28
+
29
  ```bash
30
  cp .env.example .env # then edit values
31
  ```
 
41
  ### Updating Data Sources
42
 
43
  For adding new courses or updating documentation:
 
44
 
45
+ - See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
data/scraping_scripts/README.md CHANGED
@@ -1,45 +1,127 @@
1
- # AI Tutor App Data Workflows
2
 
3
- This directory contains scripts for managing the AI Tutor App's data pipeline.
4
 
5
- ## Workflow Scripts
6
 
7
- ### 1. Adding a New Course
 
 
 
8
 
9
- To add a new course to the AI Tutor:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ```bash
12
- python add_course_workflow.py --course [COURSE_NAME]
13
  ```
14
 
15
- This will guide you through the complete process:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- 1. Process markdown files from the Notion export
18
- 2. Prompt you to manually add URLs to the course content
19
- 3. Merge the course data into the main dataset
20
- 4. Add contextual information to document nodes
21
- 5. Create vector stores
22
- 6. Upload databases to HuggingFace
23
- 7. Update UI configuration
24
 
25
- **Requirements before running:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- - The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
28
- - Course markdown files must be placed in the directory specified in the configuration
29
- - You must have access to the live course platform to add URLs `https://academy.towardsai.net/enrollments`
30
 
31
- ### 2. Updating Documentation via GitHub API
32
 
33
  To update library documentation from GitHub repositories:
34
 
35
  ```bash
36
- python update_docs_workflow.py
37
  ```
38
 
39
  This will update all supported documentation sources. You can also specify specific sources:
40
 
41
  ```bash
42
- python update_docs_workflow.py --sources transformers peft
43
  ```
44
 
45
  The workflow includes:
@@ -88,8 +170,3 @@ If you need to run specific steps individually:
88
  - If the PKL file exists, the `--new-context-only` flag will only process new content
89
  - You must have proper HuggingFace credentials with access to the private repository
90
 
91
- 6. Make sure you have the required environment variables set:
92
- - `OPENAI_API_KEY` for LLM processing
93
- - `COHERE_API_KEY` for embeddings
94
- - `HF_TOKEN` for HuggingFace uploads
95
- - `GITHUB_TOKEN` for accessing documentation via the GitHub API
 
1
+ # Python scripts for adding new courses and updating documentation
2
 
3
+ ## First workflow: Adding a New Course
4
 
5
+ Make sure you have the required environment variables set:
6
 
7
+ - `OPENAI_API_KEY` for LLM processing
8
+ - `COHERE_API_KEY` for embeddings
9
+ - `HF_TOKEN` for HuggingFace uploads and downloads
10
+ - `GITHUB_TOKEN` for accessing files via the GitHub API
11
 
12
+ ## 1. Prepare the course data
13
+
14
+ 0. You must have access to the live course you want to add:
15
+
16
+ - [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape)
17
+
18
+ 1. In Notion, navigate to the main course page that contains the live lessons:
19
+
20
+ - e.g. [Notion page](https://www.notion.so/seldonia/AI-for-Business-Professionals-190f9b6f42708087863df100e9b4b556)
21
+
22
+ 2. Click on the three dots in the top right corner and select "Export"
23
+
24
+ 3. Select these options:
25
+ - Export format: "Markdown & CSV"
26
+ - Include databases: current view
27
+ - Include content: Everything
28
+ - Include subpages: yes
29
+ - Create folders for subpages: yes
30
+
31
+ 4. Click on "Export"
32
+
33
+ 5. Once the export is complete, unzip file.
34
+
35
+ 6. Move the unzipped folder into the `ai-tutor-app/data` directory
36
+
37
+ 7. Rename the folder to the course name.
38
+ - e.g. `ai_for_business_professionals`
39
+
40
+ 8. Open the `data/scraping_scripts/process_md_files.py` python file and locate the `SOURCE_CONFIGS` dictionary.
41
+
42
+ 9. Add the new course to the `SOURCE_CONFIGS` dictionary.
43
+
44
+ example:
45
+
46
+ ```python
47
+ "ai_for_business_professionals": {
48
+ "base_url": "",
49
+ "input_directory": "data/ai_for_business_professionals", # Relative path to the directory that contains the Markdown files
50
+ "output_file": "data/ai_for_business_professionals_data.jsonl", # The output file that will be created by the script
51
+ "source_name": "ai_for_business_professionals",
52
+ "use_include_list": False,
53
+ "included_dirs": [],
54
+ "excluded_dirs": [],
55
+ "excluded_root_files": [],
56
+ "included_root_files": [],
57
+ "url_extension": "",
58
+ },
59
+ ```
60
+
61
+ - The most important fields are:
62
+ - input_directory, the relative path to the directory that contains the Markdown files
63
+ - output_file, the name of the output file that will be created by the script
64
+ - source_name, the name of the course (keep underscores, no spaces)
65
+ - The other fields can stay as empty lists or empty strings.
66
+
67
+ ## 2. Run the add_course_workflow.py script
68
 
69
  ```bash
70
+ uv run -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
71
  ```
72
 
73
+ example:
74
+
75
+ ```bash
76
+ uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
77
+ ```
78
+
79
+ This script will guide you through the complete process, it will:
80
+
81
+ 1. Extract the markdown content from each of the lessons and create a new JSONL file for the course
82
+ 2. Download the JSONL files from the other courses
83
+ 3. Prompt you to manually add URLs to the course content, inside the newly created JSONL file
84
+ 4. Merge the course data into the main dataset
85
+ 5. Add contextual information to document nodes
86
+ 6. Create vector stores
87
+ 7. Upload databases to HuggingFace
88
+ 8. Update UI configuration
89
 
90
+ ## 3. Add URLs to the course content
 
 
 
 
 
 
91
 
92
+ - After the script has processed the markdown files, it will prompt you to manually add URLs to the course content.
93
+ - Answer "no" to the question "Have you added all the URLs?"
94
+ - Open the `data/ai_for_business_professionals_data.jsonl` file in a text editor and open the love course page in the browser.
95
+ - If the JSON looks split into multiple wrapped lines in VS Code / Cursor, you can toggle word wrap off.
96
+ - macOS: press βŒ₯ Option + Z
97
+ - Windows/Linux: press Alt + Z
98
+
99
+ - For each lesson in the course [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape), copy the URL, and add it to the `url` field in the JSONL file, in the corresponding lesson.
100
+
101
+ **Note:** While you do this, now is the time to clean up the .jsonl file, remove any lines/lessons that should not be added to the RAG chatbot.
102
+ example: "Course Overview", "Course Structure", "Course Outline", "Quiz", "Assigments", etc.
103
+ What you can do is add the URLs for all the lessons you want to add to the RAG chatbot, and when done, remove all the json lines that have an empty `url` field.
104
+
105
+ ## 4. Once done, run the script again and answer "yes" to the question "Have you added all the URLs?"
106
+
107
+ ```bash
108
+ uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
109
+ ```
110
 
111
+ ----
 
 
112
 
113
+ ## Second workflow: Updating Documentation via GitHub API
114
 
115
  To update library documentation from GitHub repositories:
116
 
117
  ```bash
118
+ uv run -m data.scraping_scripts.update_docs_workflow
119
  ```
120
 
121
  This will update all supported documentation sources. You can also specify specific sources:
122
 
123
  ```bash
124
+ uv run -m data.scraping_scripts.update_docs_workflow --sources transformers peft langchain
125
  ```
126
 
127
  The workflow includes:
 
170
  - If the PKL file exists, the `--new-context-only` flag will only process new content
171
  - You must have proper HuggingFace credentials with access to the private repository
172
 
 
 
 
 
 
data/scraping_scripts/add_course_workflow.py CHANGED
@@ -13,7 +13,7 @@ This script guides you through the complete process of adding a new course to th
13
  7. Update UI configuration
14
 
15
  Usage:
16
- python add_course_workflow.py --course [COURSE_NAME]
17
 
18
  Additional flags to run specific steps (if you want to restart from a specific point):
19
  --skip-process-md Skip the markdown processing step
 
13
  7. Update UI configuration
14
 
15
  Usage:
16
+ uv run python -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
17
 
18
  Additional flags to run specific steps (if you want to restart from a specific point):
19
  --skip-process-md Skip the markdown processing step
data/scraping_scripts/process_md_files.py CHANGED
@@ -170,6 +170,19 @@ SOURCE_CONFIGS = {
170
  "included_root_files": [],
171
  "url_extension": "",
172
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
173
  }
174
 
175
 
 
170
  "included_root_files": [],
171
  "url_extension": "",
172
  },
173
+ "ai_for_business_professionals": {
174
+ "base_url": "",
175
+ "input_directory": "data/ai_for_business_professionals", # Path to the directory that contains the Markdown files
176
+ "output_file": "data/ai_for_business_professionals_data.jsonl", # From Beginner to Advanced LLM Developer
177
+ "source_name": "ai_for_business_professionals",
178
+ "use_include_list": False,
179
+ "included_dirs": [],
180
+ "excluded_dirs": [],
181
+ "excluded_root_files": [],
182
+ "included_root_files": [],
183
+ "url_extension": "",
184
+ },
185
+
186
  }
187
 
188