Nazemian v. NVIDIA — Court Allows AI Copyright Training Claims to Proceed, Applies Cox Framework to Dataset Scripts

Case
Nazemian et al. v. NVIDIA Corporation
Court
U.S. District Court, Northern District of California
Date Decided
May 5, 2026
Docket No.
No. 4:24-cv-01454-JST
Judge(s)
Jon S. Tigar
Topics
AI Training, Copyright Infringement, Contributory Liability, Shadow Libraries, Cox v. Sony

Background

Authors Abdi Nazemian, Brian Keene, and Stewart O’Nan filed a proposed class action against NVIDIA alleging that the company trained its large language models — including the Megatron 345M and the NeMo Megatron family — on unauthorized copies of their copyrighted books. The authors claimed NVIDIA sourced training data from “shadow libraries” — illegal piracy repositories including Bibliotik and Pirate Library Mirror — through datasets known as Books3 and The Pile, which collectively contained approximately 196,640 pirated books.

The plaintiffs alleged NVIDIA not only used these pirated materials directly but also provided scripts enabling its customers to download The Pile dataset and train their own models, thereby facilitating further infringement through the BitTorrent protocol. NVIDIA moved to dismiss the first consolidated amended complaint on multiple grounds, arguing the plaintiffs failed to connect specific models to specific copyrighted works and that the Supreme Court’s recent Cox Communications v. Sony Music decision foreclosed contributory liability.

The Court’s Holding

Judge Tigar denied the vast majority of NVIDIA’s motion, allowing claims for direct copyright infringement and contributory infringement to proceed. The court dismissed only the vicarious infringement claim, granting leave to amend.

On direct infringement, the court rejected NVIDIA’s attempt to use its own website’s model card — listing only certain training data sources — as proof that Megatron 345M was not trained on Books3. Judge Tigar denied NVIDIA’s request for judicial notice, holding that a website screenshot is not automatically judicially noticeable, and that the card’s listing of some data sources does not foreclose training on others. The court similarly preserved the plaintiffs’ allegations about Pirate Library Mirror, Bibliotik, and the BitTorrent protocol, finding them sufficiently pleaded at the motion-to-dismiss stage.

On contributory infringement, the court applied the Supreme Court’s 2025 Cox Communications v. Sony Music framework and found it supported — not defeated — the plaintiffs’ claims. Judge Tigar distinguished the case from Cox itself, finding that NVIDIA’s dataset download scripts “have no other purpose than to speed up the process of infringement, unlike the digital video recorder systems at issue in Sony Corp. or the internet service provided in Cox.” This satisfied both the “inducement” and “service tailored to infringement” theories of contributory liability.

Only the vicarious infringement claim fell. The court found the plaintiffs had not shown that NVIDIA had the legal right to control third-party users’ independent decisions to access The Pile, nor that the availability of pirated content served as a specific “draw” bringing customers to NVIDIA’s platform. The dismissal came with 21 days’ leave to amend.

Key Takeaways

  • First application of Cox in an AI training case. The court used the Supreme Court’s recent secondary-liability framework to analyze AI dataset tools — and concluded that purpose-built download scripts are unlike general-purpose internet services. Tools with “no other purpose than to speed up the process of infringement” can support contributory liability.
  • Allegations of unlawful data acquisition carry a complaint past the pleading stage. The court rejected NVIDIA’s argument that fair use should be decided at the motion-to-dismiss stage, finding fair use presents “a mixed question of law and fact not suited for resolution on a Rule 12(b)(6) motion.”
  • AI companies’ own model cards are not judicial-notice shields. Companies cannot defeat copyright claims simply by pointing to their own website descriptions of training data, as these may be incomplete.
  • Vicarious liability requires more than tool design. Controlling one’s own AI platform does not automatically confer the legal right to control how third parties use downloaded datasets.

Why It Matters

This ruling is one of the most significant AI copyright decisions to date. It confirms that AI companies face real litigation risk when their training pipelines touch pirated datasets — even when the piracy occurs through intermediary “shadow libraries” rather than direct copying. The court’s application of the Cox framework to dataset download scripts creates a new analytical lens: purpose-built AI training tools that automate access to infringing datasets may face contributory liability, while general-purpose platforms remain protected.

For the AI industry, the decision signals that early-stage motions to dismiss will not easily dispose of training-data copyright claims. Companies that provided tools or scripts facilitating dataset downloads during the AI training boom may face heightened scrutiny. For authors and creators, the ruling validates the theory that AI training on pirated books can be challenged not only as direct infringement but as contributory infringement through the distribution of purpose-built tools.

Full Opinion

Your browser cannot display this PDF inline.

Download the full opinion (PDF)

Leave a Comment

Scroll to Top