The Anthropic Ruling: Why AI Training Just Got Legal (But Piracy Didn't)

Community Article Published June 24, 2025

It is a rare thing for a federal court to praise the virtues of plagiarism. Yet, in the age of artificial intelligence, what counts as theft and what as innovation is up for grabs. The latest ruling in the battle between authors and Anthropic is a case in point.

This week, Judge William Alsup of the Northern District of California handed Anthropic a victory. The company’s use of copyrighted books to train its Claude language model was, he ruled, fair use. The judge was unmoved by the authors’ pleas that their words were being devoured for profit. He wrote:

Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them -but to turn a hard corner and create something different.

The decision is a landmark for Silicon Valley, which has long argued that machine learning is more akin to study than to theft.

Yet the ruling is hardly an unalloyed triumph for the tech industry. Judge Alsup was scathing about Anthropic’s other habit: downloading millions of pirated books to stock its digital library. That, he said, is plain infringement. The company now faces a trial, and potentially billions in damages, for its cavalier approach to copyright. The message is clear: innovation doesn’t give you a free pass to pirate.

This is the first major ruling in a generative AI copyright case to address fair use in detail. That makes it precedent-setting for the dozens of other AI copyright lawsuits working through the courts right now.

What Actually Happened

The ruling splits into three parts, and they're all fascinating:

✅ Training AI models = Fair Use “The purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative,” the court said - you read, learn patterns, then create something new. Judge Alsup specifically rejected the argument that what humans do when reading and memorizing is different from what computers do when training an LLM. Since Claude doesn't output exact copies or even "one author's identifiable expressive style," this transformative use is totally legal.
✅ Converting print to digital = Fair Use Anthropic bought millions of physical books, scanned them for their digital library, then destroyed the originals. The judge said this format conversion for storage and searchability was fine - you're just changing the container, not duplicating content.
❌ Downloading pirated books = Not Fair Use Here's where it could get expensive. Anthropic co-founder Ben Mann downloaded the entire Books3 dataset (196,640 pirated books) in early 2021. Then they grabbed 5 million more from LibGen in June 2021, and another 2 million from PiLiMi in July 2022. Over 7 million pirated books total.

The judge was particularly pointed about this, noting that Anthropic itself had argued: "You can't just bless yourself by saying I have a research purpose and, therefore, go and take any textbook you want." The judge agreed, writing that he "doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use."

Courts analyze four factors to decide if fair use applies:

Purpose & character - Is the new use transformative or just copying?
Nature of the work - Was the original creative (like a novel) or factual?
Amount taken - How much of the original work was used?
Market impact - Does the new use hurt sales of the original?

What Happens Next

Anthropic now faces a trial specifically for the pirated books, with potential damages that could hit billions. Here's the math: minimum statutory damages are $750 per book for copyright infringement, notes Wired. With 7+ million pirated books, that's over $5 billion in potential damages - and that's the minimum.

But the training use being ruled fair use is a massive win for the AI industry overall. This precedent will likely influence dozens of other AI copyright cases working through the courts. Other companies like Meta are facing similar lawsuits over their use of pirated content from LibGen and other sources.

The question now is whether this creates a real market for AI training licenses, or if most companies will just stick to freely available content. Given how much Anthropic spent buying physical books to scan, there's clearly value in high-quality training data.

Important caveat: This ruling doesn't address whether AI model outputs can infringe copyright - that's a separate legal question that's still being fought in other cases. This is specifically about the training process.

The Bigger Picture

The stakes could hardly be higher. Generative AI, built on oceans of human creativity, now powers everything from search engines to chatbots. Authors, artists and publishers are rightly alarmed that their work is being used to train systems that may one day outcompete them.

One passage from the decision cuts to the heart of the issue:

Over time, Anthropic came to value most highly for its data mixes books like the ones Authors had written, and it valued them because of the creative expressions they contained. Claude's customers wanted Claude to write as accurately and as compellingly as Authors. So, it was best to train the LIMs underlying Claude on works just like the ones Authors had written, with well-curated facts, well-organized analyses, and captivating fictional narratives — above all with 'good writing' of the kind 'an editor would approve of'

The arguments for the AI firms are not trivial. Progress depends on the free flow of knowledge. Yet the counterargument is equally compelling: if creators cannot control—or profit from—how their work is used, the well of new writing may dry up.

Additional Resources

Community

IIIWhiteWolfIII

Jul 2

I definitely think that AI needs to effectively use novels, film scripts, and theater scripts to understand human emotions and thought structures. Not only to define emotions, but also because, in my opinion, we need these kinds of sources within thought frameworks inspired by human imagination and creativity. When we consider the datasets of today’s models, I actually think we’re providing very limited information. Remember that without mistakes there is no truth. This should serve as a fundamental basis for defining many expressions across many workflows.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote