omwdataset

Runtime error

App Files Files Community

victormiller commited on Oct 4, 2024

Commit

b9b2095

verified ·

1 Parent(s): 117a05e

Update web.py

Browse files

Files changed (1) hide show

web.py +20 -47

web.py CHANGED Viewed

@@ -254,46 +254,14 @@ def web_data():
             Li("Local Deduplication", style = "margin-bottom: 5px"),
             Li("Each section is complete with code and comparisons to Dolma, DataTrove, and/or RedPajama-V-2", style = "margin-bottom: 5px"),
         ),
-        ),
-        Div(
-            H2("Common Crawl Data Processing Summary"),
-            P(
-                "To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
-                A("Common Crawl", href="https://commoncrawl.org/"),
-                ", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
-            ),
-            style="margin-top: 20px;",
-        ),
-        Div(
-            Ul(
-                Li(
-                    A(
-                        "Raw Documentation",
-                        href="https://drive.google.com/drive/folders/1mIJ-Zx8tRhohFdj4ByMToNz1u_9Saa8W?usp=drive_link",
-                    )
-                ),
-                Li(
-                    A(
-                        "Github link of Web Data Pipeline",
-                        href="https://github.com/CIAI-LLM/WebDataProcessing.git",
-                    )
-                ),
-            ),
-            style="""
-            background-color: #d4edda; /* Light green background */
-            border: 1px solid #c3e6cb; /* Green border */
-            border-radius: 5px;
-            padding: 15px 15px 0px 15px;
-            margin-bottom: 15px
-        """,
         ),
         id="section1",),
         Section(
         H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
-        P("The table below provides a comparison of the quality filters that have been applied to each dataset."),
         table_div_qf_filter_data,
         P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
         Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
@@ -408,7 +376,7 @@ def web_data():
         """),
         Details(
-            Summary("24 URL domains with more than 4k matches"),
             Div (
                 DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
@@ -425,7 +393,7 @@ def web_data():
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
         Details(
-            Summary("6 url domains that are removed from the blocklist"),
             Div (
                 DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
@@ -439,7 +407,7 @@ def web_data():
         ),
         Details(
-            Summary("Sample documents whose urls are blocked by the refined url blocklist"),
             Div(
                 DV(
             "data/bad_url_doc.jsonl",
@@ -460,7 +428,7 @@ def web_data():
         """),
         Details(
-            Summary("curated url domains that are excluded from our dataset"),
             Div (
                 DVS(
                 non_web_urls,
@@ -477,7 +445,7 @@ def web_data():
         ),
         Details(
-            Summary("Sample documents whose urls are in our curated url domain list"),
             Div (
                 DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
@@ -539,7 +507,7 @@ def web_data():
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
         Details(
-            Summary("Javascript Examples Filtered by C4 but Kept in TxT360"),
             Div (
                 DV(
                 "data/sample_java.jsonl",
@@ -589,7 +557,7 @@ def web_data():
         the bad words from English but also consider the bad words from other languages.
         """),
         Details(
-            Summary("Sample documents with toxic lines"),
             Div (
                 DVS(
                 json.load(open("data/toxic_lines.json")),
@@ -611,7 +579,7 @@ def web_data():
         In this section, we introduce each quality signal used to filter out low-quality documents.
         """),
         Details(
-            Summary("Overview of all the quality signals that are used for filtering"),
             Div (
                 DVS(
                 json.load(open("data/all_signals.json")),
@@ -732,7 +700,6 @@ def web_data():
         We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
         ensures consistency with the overall document character count calculation.
         """),
-        H3("TxT360 Implementation"),
         Details(
             Summary("TxT360 Implementation"),
             Div(
@@ -1153,9 +1120,6 @@ def web_data():
             margin-bottom: 15px
             """,
         ),
-        H5(
-            "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
-        ),
         Details(
             Summary("Documents Filtered by Duplicated n-Grams (n=5,...,10)"),
             Div(
@@ -1300,13 +1264,22 @@ def web_data():
             Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
             Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
         ),
-        H3("Word Count"),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
             words = text.split()
             word_count = len(words)
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),

             Li("Local Deduplication", style = "margin-bottom: 5px"),
             Li("Each section is complete with code and comparisons to Dolma, DataTrove, and/or RedPajama-V-2", style = "margin-bottom: 5px"),
         ),
+        P("To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Below is a comprehensive list of datasets we reviewed the comparison of filters we have applied."),
         ),
         id="section1",),
         Section(
         H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
+        P("The table below provides a comparison of the quality filters that have been applied to each dataset. Of note, TxT360 does not use any machine learning (ML) based filters. ML filters are a useful and effecient filtering processing that should be consider for any filtering project. However, we are leaving that option to TxT360's end users."),
         table_div_qf_filter_data,
         P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
         Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
         """),
         Details(
+            Summary(" List of 24 URLs with 4k+ Matches"),
             Div (
                 DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
         Details(
+            Summary("6 URLS Manually Removed from the Blocklist"),
             Div (
                 DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
         ),
         Details(
+            Summary("Blocked Document Examples from the URL Blocklist"),
             Div(
                 DV(
             "data/bad_url_doc.jsonl",
         """),
         Details(
+            Summary("TxT360 Excluded URLs"),
             Div (
                 DVS(
                 non_web_urls,
         ),
         Details(
+            Summary("TxT360 Excluded URLs Example Documents"),
             Div (
                 DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
                 style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
         Details(
+            Summary("Javascript Documents Filtered by C4 but Kept in TxT360"),
             Div (
                 DV(
                 "data/sample_java.jsonl",
         the bad words from English but also consider the bad words from other languages.
         """),
         Details(
+            Summary("Toxic Line Examples (WARNING: MAY CONTAIN OFFENSIVE MATERIAL)"),
             Div (
                 DVS(
                 json.load(open("data/toxic_lines.json")),
         In this section, we introduce each quality signal used to filter out low-quality documents.
         """),
         Details(
+            Summary("Quality Signals Used For Filtering"),
             Div (
                 DVS(
                 json.load(open("data/all_signals.json")),
         We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
         ensures consistency with the overall document character count calculation.
         """),
         Details(
             Summary("TxT360 Implementation"),
             Div(
             margin-bottom: 15px
             """,
         ),
         Details(
             Summary("Documents Filtered by Duplicated n-Grams (n=5,...,10)"),
             Div(
             Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
             Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
         ),
+        H3("Word Count Filters"),
         Details(
+            Div(
             Summary("Implementations from Dolma"),
             D_code("""
             words = text.split()
             word_count = len(words)
             """, block="block", language="python"),
+            style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; "  # Styling for the DV2 part
+            ),
+            style="""
+            background-color: #EAFFF1; /* Light yellow background */
+            padding: 15px;
+            border-radius: 12px;
+            margin-bottom: 15px
+            """,
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),