Background
A group of authors — including Stewart O’Nan, Abdi Nazemian, Brian Keene, Rebecca Makkai, and Jason Reynolds — sued Databricks (formerly MosaicML), alleging the company used the Books3 dataset, an openly pirated collection of approximately 196,000 copyrighted books, to train its MPT and DBRX large language models. The case is one of several consolidated copyright actions challenging the use of copyrighted works to train AI systems, alongside similar suits against OpenAI, Meta, and Anthropic.
In an earlier ruling in August 2025, Judge Breyer dismissed the authors’ initial DBRX-related claims as “too vague” but allowed the MPT claims to proceed and invited the plaintiffs to amend their complaint with more specific allegations. The authors filed a Second Amended Consolidated Complaint in January 2026, adding new factual details gleaned from discovery — including statements from Databricks employees — to tie their copyrighted works more directly to the DBRX models. Databricks moved to dismiss the DBRX claims again and separately moved to strike all DBRX-related allegations.
The Court’s Holding
Judge Breyer denied both motions. On the direct copyright infringement claim, the court found that the amended complaint “sufficiently meets” the requirements for stating a plausible claim. The plaintiffs now “directly tie their infringed works to DBRX” through a combination of employee statements and factual allegations about Databricks’ development process.
Databricks argued the infringement allegations were “too attenuated from DBRX as a product” because the plaintiffs do not allege copying in DBRX’s final training dataset — only during early development stages. The court rejected this argument, holding that “properly determining the degree of attenuation would require evidentiary considerations outside of the pleadings” and that it is “not appropriate to litigate factual disputes on a motion to dismiss.” The court acknowledged that “Defendants may ultimately prevail on this issue, but for now, Plaintiffs’ allegations are sufficient.”
The court also denied the motion to strike DBRX-related allegations, finding that they “implicate the merits of plaintiffs’ claims” and should not be eliminated at this procedural stage. The order does not address fair use, which Databricks has not yet raised as a defense in this litigation.
Key Takeaways
- AI training copyright suits move past the pleading stage: This ruling joins a growing body of decisions allowing copyright claims against AI companies to survive motions to dismiss. Courts are increasingly finding that allegations of using pirated datasets like Books3 to train language models state plausible infringement claims.
- Discovery is doing the work: The plaintiffs cured their earlier deficiencies by using information from discovery — including internal employee statements — to connect their works to the DBRX models. This pattern is likely to recur in similar AI copyright suits as discovery progresses.
- The “attenuation” defense lives to fight another day: Databricks’ argument that copying during early development is too attenuated from the final product was not rejected on the merits — just deferred as a factual question unsuitable for resolution at the pleading stage. This defense will likely be central at summary judgment.
- Fair use not yet in play: Unlike the Meta and Anthropic cases, Databricks has not yet raised fair use as a defense. When it does, the analysis will interact with the attenuation question in interesting ways.
Why It Matters
The Mosaic LLM litigation is one of the most closely watched AI copyright cases in the country. As the AI industry relies heavily on large datasets to train foundation models, the legal question of whether that training constitutes copyright infringement — or fair use — will shape the business models of every major AI company. This ruling means Databricks will face discovery and potentially trial on the core question of whether ingesting copyrighted books to build language models violates authors’ rights. For authors and publishers, the decision validates the viability of copyright claims against AI companies that trained on pirated content. For AI companies, it underscores the legal risks of using datasets of uncertain provenance and the importance of documenting the development chain from training data to final model.
Your browser cannot display this PDF inline.
Download the full opinion (PDF)