Update docs/readme files
Browse files- .env.example +6 -1
- README.md +2 -1
- data/scraping_scripts/README.md +103 -26
- data/scraping_scripts/add_course_workflow.py +1 -1
- data/scraping_scripts/process_md_files.py +13 -0
.env.example
CHANGED
@@ -1,2 +1,7 @@
|
|
|
|
1 |
OPENAI_API_KEY=...
|
2 |
-
COHERE_API_KEY=...
|
|
|
|
|
|
|
|
|
|
1 |
+
# To run AI Tutor Gradio UI
|
2 |
OPENAI_API_KEY=...
|
3 |
+
COHERE_API_KEY=...
|
4 |
+
|
5 |
+
# To update documentation/add new course
|
6 |
+
HF_TOKEN=... https://huggingface.co/settings/tokens to read or write to private DATASET repos e.g. https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data/tree/main
|
7 |
+
GITHUB_TOKEN=...https://github.com/settings/tokens to use the Github API
|
README.md
CHANGED
@@ -25,6 +25,7 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
|
|
25 |
```
|
26 |
|
27 |
2. Configure environment variables:
|
|
|
28 |
```bash
|
29 |
cp .env.example .env # then edit values
|
30 |
```
|
@@ -40,5 +41,5 @@ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugg
|
|
40 |
### Updating Data Sources
|
41 |
|
42 |
For adding new courses or updating documentation:
|
43 |
-
- See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
|
44 |
|
|
|
|
25 |
```
|
26 |
|
27 |
2. Configure environment variables:
|
28 |
+
|
29 |
```bash
|
30 |
cp .env.example .env # then edit values
|
31 |
```
|
|
|
41 |
### Updating Data Sources
|
42 |
|
43 |
For adding new courses or updating documentation:
|
|
|
44 |
|
45 |
+
- See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
|
data/scraping_scripts/README.md
CHANGED
@@ -1,45 +1,127 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
|
5 |
-
|
6 |
|
7 |
-
|
|
|
|
|
|
|
8 |
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
```bash
|
12 |
-
|
13 |
```
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
-
|
18 |
-
2. Prompt you to manually add URLs to the course content
|
19 |
-
3. Merge the course data into the main dataset
|
20 |
-
4. Add contextual information to document nodes
|
21 |
-
5. Create vector stores
|
22 |
-
6. Upload databases to HuggingFace
|
23 |
-
7. Update UI configuration
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
- Course markdown files must be placed in the directory specified in the configuration
|
29 |
-
- You must have access to the live course platform to add URLs `https://academy.towardsai.net/enrollments`
|
30 |
|
31 |
-
|
32 |
|
33 |
To update library documentation from GitHub repositories:
|
34 |
|
35 |
```bash
|
36 |
-
|
37 |
```
|
38 |
|
39 |
This will update all supported documentation sources. You can also specify specific sources:
|
40 |
|
41 |
```bash
|
42 |
-
|
43 |
```
|
44 |
|
45 |
The workflow includes:
|
@@ -88,8 +170,3 @@ If you need to run specific steps individually:
|
|
88 |
- If the PKL file exists, the `--new-context-only` flag will only process new content
|
89 |
- You must have proper HuggingFace credentials with access to the private repository
|
90 |
|
91 |
-
6. Make sure you have the required environment variables set:
|
92 |
-
- `OPENAI_API_KEY` for LLM processing
|
93 |
-
- `COHERE_API_KEY` for embeddings
|
94 |
-
- `HF_TOKEN` for HuggingFace uploads
|
95 |
-
- `GITHUB_TOKEN` for accessing documentation via the GitHub API
|
|
|
1 |
+
# Python scripts for adding new courses and updating documentation
|
2 |
|
3 |
+
## First workflow: Adding a New Course
|
4 |
|
5 |
+
Make sure you have the required environment variables set:
|
6 |
|
7 |
+
- `OPENAI_API_KEY` for LLM processing
|
8 |
+
- `COHERE_API_KEY` for embeddings
|
9 |
+
- `HF_TOKEN` for HuggingFace uploads and downloads
|
10 |
+
- `GITHUB_TOKEN` for accessing files via the GitHub API
|
11 |
|
12 |
+
## 1. Prepare the course data
|
13 |
+
|
14 |
+
0. You must have access to the live course you want to add:
|
15 |
+
|
16 |
+
- [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape)
|
17 |
+
|
18 |
+
1. In Notion, navigate to the main course page that contains the live lessons:
|
19 |
+
|
20 |
+
- e.g. [Notion page](https://www.notion.so/seldonia/AI-for-Business-Professionals-190f9b6f42708087863df100e9b4b556)
|
21 |
+
|
22 |
+
2. Click on the three dots in the top right corner and select "Export"
|
23 |
+
|
24 |
+
3. Select these options:
|
25 |
+
- Export format: "Markdown & CSV"
|
26 |
+
- Include databases: current view
|
27 |
+
- Include content: Everything
|
28 |
+
- Include subpages: yes
|
29 |
+
- Create folders for subpages: yes
|
30 |
+
|
31 |
+
4. Click on "Export"
|
32 |
+
|
33 |
+
5. Once the export is complete, unzip file.
|
34 |
+
|
35 |
+
6. Move the unzipped folder into the `ai-tutor-app/data` directory
|
36 |
+
|
37 |
+
7. Rename the folder to the course name.
|
38 |
+
- e.g. `ai_for_business_professionals`
|
39 |
+
|
40 |
+
8. Open the `data/scraping_scripts/process_md_files.py` python file and locate the `SOURCE_CONFIGS` dictionary.
|
41 |
+
|
42 |
+
9. Add the new course to the `SOURCE_CONFIGS` dictionary.
|
43 |
+
|
44 |
+
example:
|
45 |
+
|
46 |
+
```python
|
47 |
+
"ai_for_business_professionals": {
|
48 |
+
"base_url": "",
|
49 |
+
"input_directory": "data/ai_for_business_professionals", # Relative path to the directory that contains the Markdown files
|
50 |
+
"output_file": "data/ai_for_business_professionals_data.jsonl", # The output file that will be created by the script
|
51 |
+
"source_name": "ai_for_business_professionals",
|
52 |
+
"use_include_list": False,
|
53 |
+
"included_dirs": [],
|
54 |
+
"excluded_dirs": [],
|
55 |
+
"excluded_root_files": [],
|
56 |
+
"included_root_files": [],
|
57 |
+
"url_extension": "",
|
58 |
+
},
|
59 |
+
```
|
60 |
+
|
61 |
+
- The most important fields are:
|
62 |
+
- input_directory, the relative path to the directory that contains the Markdown files
|
63 |
+
- output_file, the name of the output file that will be created by the script
|
64 |
+
- source_name, the name of the course (keep underscores, no spaces)
|
65 |
+
- The other fields can stay as empty lists or empty strings.
|
66 |
+
|
67 |
+
## 2. Run the add_course_workflow.py script
|
68 |
|
69 |
```bash
|
70 |
+
uv run -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
|
71 |
```
|
72 |
|
73 |
+
example:
|
74 |
+
|
75 |
+
```bash
|
76 |
+
uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
|
77 |
+
```
|
78 |
+
|
79 |
+
This script will guide you through the complete process, it will:
|
80 |
+
|
81 |
+
1. Extract the markdown content from each of the lessons and create a new JSONL file for the course
|
82 |
+
2. Download the JSONL files from the other courses
|
83 |
+
3. Prompt you to manually add URLs to the course content, inside the newly created JSONL file
|
84 |
+
4. Merge the course data into the main dataset
|
85 |
+
5. Add contextual information to document nodes
|
86 |
+
6. Create vector stores
|
87 |
+
7. Upload databases to HuggingFace
|
88 |
+
8. Update UI configuration
|
89 |
|
90 |
+
## 3. Add URLs to the course content
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
+
- After the script has processed the markdown files, it will prompt you to manually add URLs to the course content.
|
93 |
+
- Answer "no" to the question "Have you added all the URLs?"
|
94 |
+
- Open the `data/ai_for_business_professionals_data.jsonl` file in a text editor and open the love course page in the browser.
|
95 |
+
- If the JSON looks split into multiple wrapped lines in VS Code / Cursor, you can toggle word wrap off.
|
96 |
+
- macOS: press β₯ Option + Z
|
97 |
+
- Windows/Linux: press Alt + Z
|
98 |
+
|
99 |
+
- For each lesson in the course [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape), copy the URL, and add it to the `url` field in the JSONL file, in the corresponding lesson.
|
100 |
+
|
101 |
+
**Note:** While you do this, now is the time to clean up the .jsonl file, remove any lines/lessons that should not be added to the RAG chatbot.
|
102 |
+
example: "Course Overview", "Course Structure", "Course Outline", "Quiz", "Assigments", etc.
|
103 |
+
What you can do is add the URLs for all the lessons you want to add to the RAG chatbot, and when done, remove all the json lines that have an empty `url` field.
|
104 |
+
|
105 |
+
## 4. Once done, run the script again and answer "yes" to the question "Have you added all the URLs?"
|
106 |
+
|
107 |
+
```bash
|
108 |
+
uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
|
109 |
+
```
|
110 |
|
111 |
+
----
|
|
|
|
|
112 |
|
113 |
+
## Second workflow: Updating Documentation via GitHub API
|
114 |
|
115 |
To update library documentation from GitHub repositories:
|
116 |
|
117 |
```bash
|
118 |
+
uv run -m data.scraping_scripts.update_docs_workflow
|
119 |
```
|
120 |
|
121 |
This will update all supported documentation sources. You can also specify specific sources:
|
122 |
|
123 |
```bash
|
124 |
+
uv run -m data.scraping_scripts.update_docs_workflow --sources transformers peft langchain
|
125 |
```
|
126 |
|
127 |
The workflow includes:
|
|
|
170 |
- If the PKL file exists, the `--new-context-only` flag will only process new content
|
171 |
- You must have proper HuggingFace credentials with access to the private repository
|
172 |
|
|
|
|
|
|
|
|
|
|
data/scraping_scripts/add_course_workflow.py
CHANGED
@@ -13,7 +13,7 @@ This script guides you through the complete process of adding a new course to th
|
|
13 |
7. Update UI configuration
|
14 |
|
15 |
Usage:
|
16 |
-
python add_course_workflow
|
17 |
|
18 |
Additional flags to run specific steps (if you want to restart from a specific point):
|
19 |
--skip-process-md Skip the markdown processing step
|
|
|
13 |
7. Update UI configuration
|
14 |
|
15 |
Usage:
|
16 |
+
uv run python -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
|
17 |
|
18 |
Additional flags to run specific steps (if you want to restart from a specific point):
|
19 |
--skip-process-md Skip the markdown processing step
|
data/scraping_scripts/process_md_files.py
CHANGED
@@ -170,6 +170,19 @@ SOURCE_CONFIGS = {
|
|
170 |
"included_root_files": [],
|
171 |
"url_extension": "",
|
172 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
173 |
}
|
174 |
|
175 |
|
|
|
170 |
"included_root_files": [],
|
171 |
"url_extension": "",
|
172 |
},
|
173 |
+
"ai_for_business_professionals": {
|
174 |
+
"base_url": "",
|
175 |
+
"input_directory": "data/ai_for_business_professionals", # Path to the directory that contains the Markdown files
|
176 |
+
"output_file": "data/ai_for_business_professionals_data.jsonl", # From Beginner to Advanced LLM Developer
|
177 |
+
"source_name": "ai_for_business_professionals",
|
178 |
+
"use_include_list": False,
|
179 |
+
"included_dirs": [],
|
180 |
+
"excluded_dirs": [],
|
181 |
+
"excluded_root_files": [],
|
182 |
+
"included_root_files": [],
|
183 |
+
"url_extension": "",
|
184 |
+
},
|
185 |
+
|
186 |
}
|
187 |
|
188 |
|