On May 5, five of the world's largest publishing houses, Hachette, Macmillan, McGraw Hill, Elsevier, and Cengage, joined author Scott Turow in filing a class action against Meta and CEO Mark Zuckerberg in the Southern District of New York. The complaint alleges Meta torrented over 267 terabytes of pirated material, the equivalent of hundreds of millions of publications, to train its Llama large language models, and that Zuckerberg personally authorized abandoning licensing negotiations in favor of using pirated datasets. This is a case worth watching carefully, for reasons beyond its headline.
The past year has not been short on AI copyright litigation. Among many other pending cases, the Anthropic settlement with authors in September 2025 established that training on copyrighted books and retaining pirated copies are significantly different legal issues. However, courts are still in the process of drawing a clear line on fair use in the context of AI training. This new suit, Publishers v. Meta will be one of the highest-profile case testing this alleged fact pattern of using pirated materials as a training dataset. I expect this case to move fast and draw significant attention from judges, practitioners, and AI developers alike.
Here is what makes this complaint more legally significant than its predecessors: plaintiffs also allege that Meta removed copyright management information from the pirated works, violating Section 1202(b) of the Digital Millennium Copyright Act.
Copyright management information, or CMI, is the copyright-identifying information conveyed in connection with a work: the author’s name, the title, the copyright notice, and the terms of use. Under Section 1202(b), intentionally removing CMI is unlawful where the party doing so knows, or should know, that the removal will facilitate infringement. In this case, the evidence includes internal Meta communications showing employees removing copyright paragraphs from the beginning and end of documents, and filtering out lines containing the words “ISBN,” “Copyright,” “(C),” and “All rights reserved” from the training corpus.
In early 2025, a federal district court ruled that Meta must defend against a Section 1202(b) claim in a separate Llama-related lawsuit, a significant threshold ruling. The publishers’ complaint is built on top of that foundation, with additional evidence. I believe Section 1202(b) will become one of the most powerful tools available to copyright holders in AI-related disputes. Unlike a direct infringement claim, a Section 1202(b) claim does not require proof that your specific work was copied and reproduced. It potentially reaches any training pipeline that removed copyright-bearing text during content normalization, which describes standard practice in nearly every large-scale training run.
Here is the technical reality that makes this legally complicated: removing copyright-bearing text as part of data cleaning is completely standard in AI preprocessing. When teams prepare training corpora from sourced documents, copyright notices, copyright paragraphs, and lines containing copyright identifiers are routinely filtered out as noise. The technical team treats them as boilerplate clutter. CMI, as defined by the DMCA, is that exact text. The technical team is not doing something unusual. They are doing their jobs.
The legal problem is that this normal technical practice can constitute a DMCA violation, even where the legal team has secured what it believes to be proper rights to use the data for training. Rights clearance and CMI preservation are two separate obligations, and the gap between them is rarely visible unless legal and technical teams are working from the same checklist. That gap is exactly the kind of fact pattern that produces significant, entirely unintentional liability.
This creates a dual advisory challenge. For counsel representing copyright holders, the question is whether copyright-identifying text is actually present and intact in the formats most likely to be scraped, because a 1202(b) claim requires proof that CMI existed in the work before it was removed. For counsel advising companies building training datasets, the question is whether the data preparation process addresses CMI obligations separately from licensing, because acquiring the right to use content does not automatically authorize removing the copyright-identifying text within it. Both sides of that advisory relationship require legal teams to be technically fluent in how these pipelines actually work.
Publishers v. Meta may be the moment practitioners realize that advising clients in the age of AI requires understanding both what the contract says and what the pipeline does.
Disclaimer: This article does not constitute legal advice, does not create an attorney-client relationship, and is intended for informational purposes only.
- Senior Counsel
Marcus Burnside focuses his practice on intellectual property for both domestic and foreign clients. With knowledge of both mechanical and electrical engineering, Marcus is able to assist clients in a broad range of technologies ...



