Spaces:
Running
Running
Update web.py
Browse files
web.py
CHANGED
|
@@ -476,25 +476,38 @@ def web_data():
|
|
| 476 |
P("""
|
| 477 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
| 478 |
"""),
|
|
|
|
|
|
|
|
|
|
|
|
|
| 479 |
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
"data/bad_url_doc.jsonl",
|
| 484 |
3,
|
| 485 |
"Sample documents whose urls are blocked by the refined url blocklist",
|
|
|
|
| 486 |
),
|
|
|
|
| 487 |
H5("1.3.2 Excluded High Quality Sources"),
|
| 488 |
P("""
|
| 489 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
| 490 |
"""),
|
| 491 |
-
DVS(
|
| 492 |
-
non_web_urls,
|
| 493 |
-
"curated url domains that are excluded from our dataset",
|
| 494 |
-
),
|
| 495 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 496 |
|
| 497 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 498 |
|
| 499 |
H3("2. Line-Level Removal"),
|
| 500 |
P("""
|
|
@@ -510,11 +523,17 @@ def web_data():
|
|
| 510 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
| 511 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
| 512 |
"""),
|
| 513 |
-
|
|
|
|
|
|
|
|
|
|
| 514 |
"data/sample_terminal_punc.json",
|
| 515 |
0,
|
| 516 |
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
|
|
|
| 517 |
),
|
|
|
|
|
|
|
| 518 |
H4('2.1 Word "Javascript"'),
|
| 519 |
P("""
|
| 520 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
|
@@ -523,10 +542,13 @@ def web_data():
|
|
| 523 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
| 524 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
| 525 |
"""),
|
| 526 |
-
|
| 527 |
-
"
|
| 528 |
-
|
| 529 |
-
|
|
|
|
|
|
|
|
|
|
| 530 |
),
|
| 531 |
H4("2.2 Other Rules from RefinedWeb"),
|
| 532 |
P("""
|
|
@@ -536,10 +558,13 @@ def web_data():
|
|
| 536 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
| 537 |
- The line contains only one word.
|
| 538 |
"""),
|
| 539 |
-
|
| 540 |
-
"
|
| 541 |
-
|
| 542 |
-
|
|
|
|
|
|
|
|
|
|
| 543 |
),
|
| 544 |
H4("2.3 Toxic Lines"),
|
| 545 |
P("""
|
|
@@ -549,10 +574,14 @@ def web_data():
|
|
| 549 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
| 550 |
the bad words from English but also consider the bad words from other languages.
|
| 551 |
"""),
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
|
|
|
|
|
|
|
|
|
| 555 |
),
|
|
|
|
| 556 |
H3("3. Document-Level Filtering"),
|
| 557 |
P("""
|
| 558 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
|
|
|
| 476 |
P("""
|
| 477 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
| 478 |
"""),
|
| 479 |
+
Details(
|
| 480 |
+
Summary("6 url domains that are removed from the blocklist"),
|
| 481 |
+
DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
|
| 482 |
+
),
|
| 483 |
|
| 484 |
+
Details(
|
| 485 |
+
Summary("Sample documents whose urls are blocked by the refined url blocklist"),
|
| 486 |
+
DV(
|
| 487 |
"data/bad_url_doc.jsonl",
|
| 488 |
3,
|
| 489 |
"Sample documents whose urls are blocked by the refined url blocklist",
|
| 490 |
+
),
|
| 491 |
),
|
| 492 |
+
|
| 493 |
H5("1.3.2 Excluded High Quality Sources"),
|
| 494 |
P("""
|
| 495 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
| 496 |
"""),
|
|
|
|
|
|
|
|
|
|
|
|
|
| 497 |
|
| 498 |
+
Details(
|
| 499 |
+
Summary("curated url domains that are excluded from our dataset"),
|
| 500 |
+
DVS(
|
| 501 |
+
non_web_urls,
|
| 502 |
+
"curated url domains that are excluded from our dataset",
|
| 503 |
+
),
|
| 504 |
+
),
|
| 505 |
|
| 506 |
+
Details(
|
| 507 |
+
Summary("Sample documents whose urls are in our curated url domain list"),
|
| 508 |
+
DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
|
| 509 |
+
),
|
| 510 |
+
|
| 511 |
|
| 512 |
H3("2. Line-Level Removal"),
|
| 513 |
P("""
|
|
|
|
| 523 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
| 524 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
| 525 |
"""),
|
| 526 |
+
|
| 527 |
+
Details(
|
| 528 |
+
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
| 529 |
+
DV(
|
| 530 |
"data/sample_terminal_punc.json",
|
| 531 |
0,
|
| 532 |
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
| 533 |
+
),
|
| 534 |
),
|
| 535 |
+
|
| 536 |
+
|
| 537 |
H4('2.1 Word "Javascript"'),
|
| 538 |
P("""
|
| 539 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
|
|
|
| 542 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
| 543 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
| 544 |
"""),
|
| 545 |
+
Details(
|
| 546 |
+
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
| 547 |
+
DV(
|
| 548 |
+
"data/sample_java.jsonl",
|
| 549 |
+
0,
|
| 550 |
+
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
| 551 |
+
),
|
| 552 |
),
|
| 553 |
H4("2.2 Other Rules from RefinedWeb"),
|
| 554 |
P("""
|
|
|
|
| 558 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
| 559 |
- The line contains only one word.
|
| 560 |
"""),
|
| 561 |
+
Details(
|
| 562 |
+
Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
|
| 563 |
+
DV(
|
| 564 |
+
"data/sample_refinedweb_line.json",
|
| 565 |
+
0,
|
| 566 |
+
"Sample documents with lines that are removed by the RefinedWeb rules",
|
| 567 |
+
),
|
| 568 |
),
|
| 569 |
H4("2.3 Toxic Lines"),
|
| 570 |
P("""
|
|
|
|
| 574 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
| 575 |
the bad words from English but also consider the bad words from other languages.
|
| 576 |
"""),
|
| 577 |
+
Details(
|
| 578 |
+
Summary("Sample documents with toxic lines"),
|
| 579 |
+
DVS(
|
| 580 |
+
json.load(open("data/toxic_lines.json")),
|
| 581 |
+
"Sample documents with toxic lines",
|
| 582 |
+
),
|
| 583 |
),
|
| 584 |
+
|
| 585 |
H3("3. Document-Level Filtering"),
|
| 586 |
P("""
|
| 587 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|