Update README for scraping scripts
Browse files
data/scraping_scripts/README.md
CHANGED
|
@@ -14,7 +14,7 @@ python add_course_workflow.py --course [COURSE_NAME]
|
|
| 14 |
|
| 15 |
This will guide you through the complete process:
|
| 16 |
|
| 17 |
-
1. Process markdown files from Notion
|
| 18 |
2. Prompt you to manually add URLs to the course content
|
| 19 |
3. Merge the course data into the main dataset
|
| 20 |
4. Add contextual information to document nodes
|
|
@@ -26,7 +26,7 @@ This will guide you through the complete process:
|
|
| 26 |
|
| 27 |
- The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
|
| 28 |
- Course markdown files must be placed in the directory specified in the configuration
|
| 29 |
-
- You must have access to the live course platform to add URLs
|
| 30 |
|
| 31 |
### 2. Updating Documentation via GitHub API
|
| 32 |
|
|
@@ -48,17 +48,7 @@ The workflow includes:
|
|
| 48 |
2. Processing markdown files to create JSONL data
|
| 49 |
3. Adding contextual information to document nodes
|
| 50 |
4. Creating vector stores
|
| 51 |
-
5. Uploading
|
| 52 |
-
|
| 53 |
-
### 3. Uploading JSONL to HuggingFace
|
| 54 |
-
|
| 55 |
-
To upload the main JSONL file to a private HuggingFace repository:
|
| 56 |
-
|
| 57 |
-
```bash
|
| 58 |
-
python upload_jsonl_to_hf.py
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
This is useful for sharing the latest data with team members.
|
| 62 |
|
| 63 |
## Individual Components
|
| 64 |
|
|
@@ -66,9 +56,10 @@ If you need to run specific steps individually:
|
|
| 66 |
|
| 67 |
- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
|
| 68 |
- **Process Markdown**: `process_md_files.py`
|
| 69 |
-
- **Add Context**: `add_context_to_nodes.py`
|
| 70 |
- **Create Vector Stores**: `create_vector_stores.py`
|
| 71 |
-
- **Upload to HuggingFace**: `upload_dbs_to_hf.py`
|
|
|
|
| 72 |
|
| 73 |
## Tips for New Team Members
|
| 74 |
|
|
|
|
| 14 |
|
| 15 |
This will guide you through the complete process:
|
| 16 |
|
| 17 |
+
1. Process markdown files from the Notion export
|
| 18 |
2. Prompt you to manually add URLs to the course content
|
| 19 |
3. Merge the course data into the main dataset
|
| 20 |
4. Add contextual information to document nodes
|
|
|
|
| 26 |
|
| 27 |
- The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
|
| 28 |
- Course markdown files must be placed in the directory specified in the configuration
|
| 29 |
+
- You must have access to the live course platform to add URLs `https://academy.towardsai.net/enrollments`
|
| 30 |
|
| 31 |
### 2. Updating Documentation via GitHub API
|
| 32 |
|
|
|
|
| 48 |
2. Processing markdown files to create JSONL data
|
| 49 |
3. Adding contextual information to document nodes
|
| 50 |
4. Creating vector stores
|
| 51 |
+
5. Uploading vector db and new JSONL files to HuggingFace
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Individual Components
|
| 54 |
|
|
|
|
| 56 |
|
| 57 |
- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
|
| 58 |
- **Process Markdown**: `process_md_files.py`
|
| 59 |
+
- **Add Context**: `add_context_to_nodes.py`
|
| 60 |
- **Create Vector Stores**: `create_vector_stores.py`
|
| 61 |
+
- **Upload to Chroma Vector Store to HuggingFace**: `upload_dbs_to_hf.py`
|
| 62 |
+
- **Upload JSONL files to HuggingFace**: `upload_jsonl_to_hf.py`
|
| 63 |
|
| 64 |
## Tips for New Team Members
|
| 65 |
|