omarsol commited on
Commit
0933e3e
Β·
1 Parent(s): 0b1b256

Update README for scraping scripts

Browse files
Files changed (1) hide show
  1. data/scraping_scripts/README.md +6 -15
data/scraping_scripts/README.md CHANGED
@@ -14,7 +14,7 @@ python add_course_workflow.py --course [COURSE_NAME]
14
 
15
  This will guide you through the complete process:
16
 
17
- 1. Process markdown files from Notion exports
18
  2. Prompt you to manually add URLs to the course content
19
  3. Merge the course data into the main dataset
20
  4. Add contextual information to document nodes
@@ -26,7 +26,7 @@ This will guide you through the complete process:
26
 
27
  - The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
28
  - Course markdown files must be placed in the directory specified in the configuration
29
- - You must have access to the live course platform to add URLs
30
 
31
  ### 2. Updating Documentation via GitHub API
32
 
@@ -48,17 +48,7 @@ The workflow includes:
48
  2. Processing markdown files to create JSONL data
49
  3. Adding contextual information to document nodes
50
  4. Creating vector stores
51
- 5. Uploading databases to HuggingFace
52
-
53
- ### 3. Uploading JSONL to HuggingFace
54
-
55
- To upload the main JSONL file to a private HuggingFace repository:
56
-
57
- ```bash
58
- python upload_jsonl_to_hf.py
59
- ```
60
-
61
- This is useful for sharing the latest data with team members.
62
 
63
  ## Individual Components
64
 
@@ -66,9 +56,10 @@ If you need to run specific steps individually:
66
 
67
  - **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
68
  - **Process Markdown**: `process_md_files.py`
69
- - **Add Context**: `add_context_to_nodes.py`
70
  - **Create Vector Stores**: `create_vector_stores.py`
71
- - **Upload to HuggingFace**: `upload_dbs_to_hf.py`
 
72
 
73
  ## Tips for New Team Members
74
 
 
14
 
15
  This will guide you through the complete process:
16
 
17
+ 1. Process markdown files from the Notion export
18
  2. Prompt you to manually add URLs to the course content
19
  3. Merge the course data into the main dataset
20
  4. Add contextual information to document nodes
 
26
 
27
  - The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
28
  - Course markdown files must be placed in the directory specified in the configuration
29
+ - You must have access to the live course platform to add URLs `https://academy.towardsai.net/enrollments`
30
 
31
  ### 2. Updating Documentation via GitHub API
32
 
 
48
  2. Processing markdown files to create JSONL data
49
  3. Adding contextual information to document nodes
50
  4. Creating vector stores
51
+ 5. Uploading vector db and new JSONL files to HuggingFace
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Individual Components
54
 
 
56
 
57
  - **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
58
  - **Process Markdown**: `process_md_files.py`
59
+ - **Add Context**: `add_context_to_nodes.py`
60
  - **Create Vector Stores**: `create_vector_stores.py`
61
+ - **Upload to Chroma Vector Store to HuggingFace**: `upload_dbs_to_hf.py`
62
+ - **Upload JSONL files to HuggingFace**: `upload_jsonl_to_hf.py`
63
 
64
  ## Tips for New Team Members
65