omwdataset

Runtime error

victormiller commited on Sep 26, 2024

Commit

073687e

verified ·

1 Parent(s): 1e4c4a2

Update web.py

Files changed (1) hide show

web.py CHANGED Viewed

@@ -442,15 +442,13 @@ def web_data():
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
         """),
-        Details(
-                Summary("Sample documents that are classified as non-English"),
-                DV("data/sample_non_en.json", 3),
-            ),
-        Details(
-                Summary("Sample documents that are classified as English but with score less than 0.65"),
-                DV("data/sample_en_low.json", 3),
-            ),
         H4("1.3 URL Filtering"),
         P("""
@@ -483,10 +481,9 @@ def web_data():
             "curated url domains that are excluded from our dataset",
         ),
-        Details(
-                Summary("Sample documents whose urls are in our curated url domain list"),
-                DV("data/sample_url_exclusion.json", 0),
-            ),
         H3("2. Line-Level Removal"),
         P("""
         Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level

         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
         """),
+        DV("data/sample_non_en.json", 3, "Sample documents that are classified as non-English"),
+        DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
         H4("1.3 URL Filtering"),
         P("""
             "curated url domains that are excluded from our dataset",
         ),
+        DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
         H3("2. Line-Level Removal"),
         P("""
         Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level