Spaces:
Runtime error
Runtime error
Update web.py
Browse files
web.py
CHANGED
|
@@ -442,15 +442,13 @@ def web_data():
|
|
| 442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
| 443 |
This step removes over 60% of the whole data.
|
| 444 |
"""),
|
| 445 |
-
Details(
|
| 446 |
-
Summary("Sample documents that are classified as non-English"),
|
| 447 |
-
DV("data/sample_non_en.json", 3),
|
| 448 |
-
),
|
| 449 |
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
|
|
|
|
|
|
| 454 |
|
| 455 |
H4("1.3 URL Filtering"),
|
| 456 |
P("""
|
|
@@ -483,10 +481,9 @@ def web_data():
|
|
| 483 |
"curated url domains that are excluded from our dataset",
|
| 484 |
),
|
| 485 |
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
),
|
| 490 |
H3("2. Line-Level Removal"),
|
| 491 |
P("""
|
| 492 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|
|
|
|
| 442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
| 443 |
This step removes over 60% of the whole data.
|
| 444 |
"""),
|
|
|
|
|
|
|
|
|
|
|
|
|
| 445 |
|
| 446 |
+
|
| 447 |
+
DV("data/sample_non_en.json", 3, "Sample documents that are classified as non-English"),
|
| 448 |
+
|
| 449 |
+
|
| 450 |
+
DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
|
| 451 |
+
|
| 452 |
|
| 453 |
H4("1.3 URL Filtering"),
|
| 454 |
P("""
|
|
|
|
| 481 |
"curated url domains that are excluded from our dataset",
|
| 482 |
),
|
| 483 |
|
| 484 |
+
|
| 485 |
+
DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
|
| 486 |
+
|
|
|
|
| 487 |
H3("2. Line-Level Removal"),
|
| 488 |
P("""
|
| 489 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|