We still have data.

Community Article Published September 8, 2025

Within FinePDF's dataset card, there's a claim which appears to be disingenuous:

As we run out of web pages to process...

We are clearly not running out of web pages to process for 4 simple reasons:

Not everything is... on CommonCrawl

Each crawl done at CommonCrawl is limited by seed URLs, cost and time.

As detailed by this 2016 google groups thread, there's a couple of things we can glean from:

  1. Seed URLs comes from Blekko (Supposedly defunct, acquired by IBM) and moz.com. Both of which are great sources to start from, but they might be showing some age.
  2. CommonCrawl does not guarantee a complete crawl of a particular domain.
  3. There's a 65% overlap in CommonCrawl from one snapshot to the next.
  4. There's a monthly computational limit of ~12 days over 100 VMs*

*2016 Figure. A more recent figure that I can source is 20 nodes of EC2 spot instances.

As such it isn't really possible for CommonCrawl to crawl a web domain in depth.

And besides...

Not everything is... Indexed

Some websites don't choose to list themselves on google or get shadowbanned from google search results. Either for PayWalls, horny, DMCA, vibes or any odd reason one might prefer to pick.

Others spread through word of mouth or on semi-public channels like discord or potentially social media.

Not everything is... Accessible with a Generic Crawler

I'm sure most folks are aware of Cloudflare's turnstile, Akamai's Bot Protection or Anubis's Proof of Work Challenges. Those kinds of challenges block the regular scraper from access content, but won't stop a more determined user who knows about the details of a website to crawl it anyway.

Not everything is... On the Open Net

A certain number of websites don't even appear on regular TLDs like .moe or .org. Some require you to have a specific browser (Tor Browser) to access .onion domains or are distributed in other unique ways like I2P or IPFS.

While there have been some onion search engines developed, there hasn't been a massive push within the AI community to look into those for data. Though I believe it's for a good reason.

Most onion links I've heard of, link to content which are shady or straight out illegal across a lot of jurisdictions, and is content that you should not let a Language Model know about.

Are we so back?

We are starting to be out of the low hanging fruit data. Data cleaning is becoming increasingly important as we try to weed out nonsense in CommonCrawl into text data that a human would want to read.

Plus, synthetic has gotten reasonably good into filtering human preferences well enough to be used in pretrained models. Yet there is merit in relooking into raw web pages as it still provides a better variety.

In short: We are not running out of web text, but easily filtered web pages. The SoTA model is going to be determined by depth, not breadth.

KaraKaraWitch is a Dataset Curator in featherless.ai. This article is not an official view from featherless.ai and should be taken as an opinion.

Community

Sign up or log in to comment