According to TechCrunch, Adobe has been hit with a proposed class-action lawsuit filed on behalf of Oregon author Elizabeth Lyon, accusing the company of using pirated books to train its AI. The suit, originally reported by Reuters, claims Adobe’s SlimLM language model was trained on a dataset called SlimPajama-627B, which is derived from the RedPajama dataset. That dataset, in turn, contains the infamous “Books3” collection of 191,000 books. Lyon alleges her copyrighted guidebooks were included in this manipulated dataset without her consent. This lawsuit follows similar legal action against Apple in September and Salesforce in October, which also cited the RedPajama dataset. In a related major settlement, Anthropic agreed in September to pay $1.5 billion to authors over similar claims regarding its Claude chatbot.
The Books3 Problem Isn’t Going Away
Here’s the thing: this lawsuit against Adobe feels less like a shocking revelation and more like the next domino to fall. The Books3 dataset is the industry’s worst-kept secret and its biggest legal liability. It’s basically a massive, shadow library that became foundational fuel for a whole generation of AI models. Companies like Cerebras, which released the SlimPajama dataset Adobe used, positioned these collections as “open-source” and “deduplicated.” But that technical language is now slamming headfirst into copyright law. The core argument from authors is simple: cleaning and shuffling pirated content doesn’t make it legal to use. And the courts are starting to listen.
A Legal Playbook Is Forming
Look at the sequence. First, you had the Anthropic settlement for $1.5 billion. That was a massive warning shot. Then lawsuits against Apple and Salesforce. Now Adobe. A clear pattern is emerging where plaintiffs are tracing AI models back to these specific, tainted datasets. Adobe’s own research paper on SlimLM openly cites its training data, which makes the chain of evidence pretty straightforward for lawyers. This isn’t a speculative claim anymore; it’s a well-worn legal path. The tech industry’s old “move fast and break things” mantra is colliding with a very established, very slow-moving legal system that takes copyright incredibly seriously.
What This Means For The AI Race
So where does this leave companies? In a really tough spot. If you’re a giant like Adobe, you have deep pockets, but you’re also a huge target. The pressure to license content cleanly is going to skyrocket, which will benefit publishers and rights holders but could also slow down innovation and entrench the biggest players who can afford the licensing fees. Smaller startups that trained on these open but legally dubious datasets could be completely sunk. We’re probably heading for a two-tier AI landscape: one tier of models built on expensive, licensed data, and another shadow tier built on the old, risky stuff. The irony for Adobe? Its SlimLM is a “small language model” optimized for documents. It’s not even their flagship generative AI product like Firefly. But if the training data is poisoned, the size of the model doesn’t matter in court.
The Billion-Dollar Question
Basically, the entire premise of how we’ve built modern AI is on trial. Every company that raced to release an AI feature in the last two years has to be sweating a little, reviewing their data provenance. The Anthropic settlement set a terrifying precedent for potential damages. Will Adobe fight this, or will they seek a quick settlement to make it go away? And more importantly, can the industry even function if it has to scrub every bit of training data for copyright issues? The lawsuits are becoming commonplace because the alleged infringement was, for a long time, commonplace. Untangling that mess is going to be the defining business and legal challenge for AI for the next decade. It’s no longer a technical problem. It’s a survival one.
