Spaces:

LLM360
/

TxT360

Running

App Files Files Community

victormiller commited on Sep 26, 2024

Commit

88c0211

verified ·

1 Parent(s): 6084136

Update web.py

Browse files

Files changed (1) hide show

web.py +49 -20

web.py CHANGED Viewed

@@ -476,25 +476,38 @@ def web_data():
         P("""
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
-        DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
-        DV(
             "data/bad_url_doc.jsonl",
             3,
             "Sample documents whose urls are blocked by the refined url blocklist",
         ),
         H5("1.3.2 Excluded High Quality Sources"),
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
-        DVS(
-            non_web_urls,
-            "curated url domains that are excluded from our dataset",
-        ),
-        DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
         H3("2. Line-Level Removal"),
         P("""
@@ -510,11 +523,17 @@ def web_data():
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
         """),
-        DV(
             "data/sample_terminal_punc.json",
             0,
             "Sample documents with lines that are removed by the rule of terminal punctuation",
         ),
         H4('2.1 Word "Javascript"'),
         P("""
         In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
@@ -523,10 +542,13 @@ def web_data():
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
-        DV(
-            "data/sample_java.jsonl",
-            0,
-            "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
         ),
         H4("2.2 Other Rules from RefinedWeb"),
         P("""
@@ -536,10 +558,13 @@ def web_data():
         - The line matches the pattern “r'^\\d+\\s+likes$'”,
         - The line contains only one word.
         """),
-        DV(
-            "data/sample_refinedweb_line.json",
-            0,
-            "Sample documents with lines that are removed by the RefinedWeb rules",
         ),
         H4("2.3 Toxic Lines"),
         P("""
@@ -549,10 +574,14 @@ def web_data():
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
         the bad words from English but also consider the bad words from other languages.
         """),
-        DVS(
-            json.load(open("data/toxic_lines.json")),
-            "Sample documents with toxic lines",
         ),
         H3("3. Document-Level Filtering"),
         P("""
         In this section, we introduce all the quality signals that we have used to filter out low-quality documents.

         P("""
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
+        Details(
+            Summary("6 url domains that are removed from the blocklist"),
+            DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
+        ),
+        Details(
+            Summary("Sample documents whose urls are blocked by the refined url blocklist"),
+            DV(
             "data/bad_url_doc.jsonl",
             3,
             "Sample documents whose urls are blocked by the refined url blocklist",
+            ),
         ),
         H5("1.3.2 Excluded High Quality Sources"),
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
+        Details(
+            Summary("curated url domains that are excluded from our dataset"),
+            DVS(
+                non_web_urls,
+                "curated url domains that are excluded from our dataset",
+            ),
+        ),
+        Details(
+            Summary("Sample documents whose urls are in our curated url domain list"),
+            DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
+        ),
         H3("2. Line-Level Removal"),
         P("""
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
         """),
+        Details(
+            Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
+            DV(
             "data/sample_terminal_punc.json",
             0,
             "Sample documents with lines that are removed by the rule of terminal punctuation",
+            ),
         ),
         H4('2.1 Word "Javascript"'),
         P("""
         In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
+        Details(
+            Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
+            DV(
+                "data/sample_java.jsonl",
+                0,
+                "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
+            ),
         ),
         H4("2.2 Other Rules from RefinedWeb"),
         P("""
         - The line matches the pattern “r'^\\d+\\s+likes$'”,
         - The line contains only one word.
         """),
+        Details(
+            Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
+            DV(
+                "data/sample_refinedweb_line.json",
+                0,
+                "Sample documents with lines that are removed by the RefinedWeb rules",
+            ),
         ),
         H4("2.3 Toxic Lines"),
         P("""
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
         the bad words from English but also consider the bad words from other languages.
         """),
+        Details(
+            Summary("Sample documents with toxic lines"),
+            DVS(
+                json.load(open("data/toxic_lines.json")),
+                "Sample documents with toxic lines",
+            ),
         ),
         H3("3. Document-Level Filtering"),
         P("""
         In this section, we introduce all the quality signals that we have used to filter out low-quality documents.