File size: 7,283 Bytes
f46a350
a54e637
f46a350
a54e637
f46a350
a54e637
f46a350
 
 
 
a54e637
f46a350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a54e637
 
f46a350
a54e637
 
f46a350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a54e637
f46a350
a54e637
f46a350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a54e637
f46a350
a54e637
f46a350
a54e637
 
 
 
f46a350
a54e637
 
 
 
 
f46a350
a54e637
 
 
 
 
 
 
 
0933e3e
a54e637
 
 
 
 
 
 
0933e3e
a54e637
0933e3e
c1efa6d
a54e637
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# Python scripts for adding new courses and updating documentation

## First workflow: Adding a New Course

Make sure you have the required environment variables set:

- `OPENAI_API_KEY` for LLM processing
- `COHERE_API_KEY` for embeddings
- `HF_TOKEN` for HuggingFace uploads and downloads
- `GITHUB_TOKEN` for accessing files via the GitHub API

## 1. Prepare the course data

0. You must have access to the live course you want to add:

   - [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape)

1. In Notion, navigate to the main course page that contains the live lessons:

   - e.g. [Notion page](https://www.notion.so/seldonia/AI-for-Business-Professionals-190f9b6f42708087863df100e9b4b556)

2. Click on the three dots in the top right corner and select "Export"

3. Select these options:
   - Export format: "Markdown & CSV"
   - Include databases: current view
   - Include content: Everything
   - Include subpages: yes
   - Create folders for subpages: yes

4. Click on "Export"

5. Once the export is complete, unzip file.

6. Move the unzipped folder into the `ai-tutor-app/data` directory

7. Rename the folder to the course name.
   - e.g. `ai_for_business_professionals`

8. Open the `data/scraping_scripts/process_md_files.py` python file and locate the `SOURCE_CONFIGS` dictionary.

9. Add the new course to the `SOURCE_CONFIGS` dictionary.

   example:

   ```python
      "ai_for_business_professionals": {
         "base_url": "",
         "input_directory": "data/ai_for_business_professionals",  # Relative path to the directory that contains the Markdown files
         "output_file": "data/ai_for_business_professionals_data.jsonl", # The output file that will be created by the script
         "source_name": "ai_for_business_professionals",
         "use_include_list": False,
         "included_dirs": [],
         "excluded_dirs": [],
         "excluded_root_files": [],
         "included_root_files": [],
         "url_extension": "",
      },
   ```

   - The most important fields are:
      - input_directory, the relative path to the directory that contains the Markdown files
      - output_file, the name of the output file that will be created by the script
      - source_name, the name of the course (keep underscores, no spaces)
   - The other fields can stay as empty lists or empty strings.

## 2. Run the add_course_workflow.py script

```bash
uv run -m data.scraping_scripts.add_course_workflow --course [COURSE_NAME]
```

example:

```bash
uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
```

This script will guide you through the complete process, it will:

   1. Extract the markdown content from each of the lessons and create a new JSONL file for the course
   2. Download the JSONL files from the other courses
   3. Prompt you to manually add URLs to the course content, inside the newly created JSONL file
   4. Merge the course data into the main dataset
   5. Add contextual information to document nodes
   6. Create vector stores
   7. Upload databases to HuggingFace
   8. Update UI configuration

## 3. Add URLs to the course content

- After the script has processed the markdown files, it will prompt you to manually add URLs to the course content.
- Answer "no" to the question "Have you added all the URLs?"
- Open the `data/ai_for_business_professionals_data.jsonl` file in a text editor and open the love course page in the browser.
- If the JSON looks split into multiple wrapped lines in VS Code / Cursor, you can toggle word wrap off.
  - macOS: press βŒ₯ Option + Z
  - Windows/Linux: press Alt + Z

- For each lesson in the course [academy.towardsai.net](https://academy.towardsai.net/courses/take/agent-engineering/multimedia/67469692-lesson-1-part-1-the-ai-engineer-the-agent-landscape), copy the URL, and add it to the `url` field in the JSONL file, in the corresponding lesson.

**Note:** While you do this, now is the time to clean up the .jsonl file, remove any lines/lessons that should not be added to the RAG chatbot.
example: "Course Overview", "Course Structure", "Course Outline", "Quiz", "Assigments", etc.
What you can do is add the URLs for all the lessons you want to add to the RAG chatbot, and when done, remove all the json lines that have an empty `url` field.

## 4. Once done, run the script again and answer "yes" to the question "Have you added all the URLs?"

```bash
uv run -m data.scraping_scripts.add_course_workflow --course ai_for_business_professionals
```

----

## Second workflow: Updating Documentation via GitHub API

To update library documentation from GitHub repositories:

```bash
uv run -m data.scraping_scripts.update_docs_workflow
```

This will update all supported documentation sources. You can also specify specific sources:

```bash
uv run -m data.scraping_scripts.update_docs_workflow --sources transformers peft langchain
```

The workflow includes:

1. Downloading documentation from GitHub using the API
2. Processing markdown files to create JSONL data
3. Adding contextual information to document nodes
4. Creating vector stores
5. Uploading vector db and new JSONL files to HuggingFace

## Individual Components

If you need to run specific steps individually:

- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
- **Process Markdown**: `process_md_files.py`
- **Add Context**: `add_context_to_nodes.py`
- **Create Vector Stores**: `create_vector_stores.py`
- **Upload to Chroma Vector Store to HuggingFace**: `upload_dbs_to_hf.py`
- **Upload JSONL files to HuggingFace**: `upload_data_to_hf.py`

## Tips for New Team Members

1. To update the AI Tutor with new content:
   - For new courses, use `add_course_workflow.py`
   - For updated documentation, use `update_docs_workflow.py`

2. When adding URLs to course content:
   - Get the URLs from the live course platform
   - Add them to the generated JSONL file in the `url` field
   - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
   - Make sure every document has a valid URL

3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).

4. When adding a new course, verify that it appears in the Gradio UI:
   - The workflow automatically updates `main.py` and `setup.py` to include the new source
   - Check that the new source appears in the dropdown menu in the UI
   - Make sure it's properly included in the default selected sources
   - Restart the Gradio app to see the changes

5. First time setup or missing files:
   - Both workflows automatically check for and download required data files:
     - `all_sources_data.jsonl` - Contains the raw document data
     - `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
   - If the PKL file exists, the `--new-context-only` flag will only process new content
   - You must have proper HuggingFace credentials with access to the private repository