The Discovery Bomb

On January 5, 2026, U.S. District Judge Sidney Stein affirmed an order that OpenAI must produce 20 million de-identified ChatGPT conversation logs to the plaintiffs suing it for copyright infringement. Not a filtered sample. Not just the conversations that mention the plaintiffs' works. All 20 milli

A judge just ordered OpenAI to hand over 20 million ChatGPT conversations. What happens when the black box opens.

On January 5, 2026, U.S. District Judge Sidney Stein affirmed an order that OpenAI must produce 20 million de-identified ChatGPT conversation logs to the plaintiffs suing it for copyright infringement. Not a filtered sample. Not just the conversations that mention the plaintiffs’ works. All 20 million.

This is the most significant discovery order in the history of AI litigation. And it might be the ruling that determines whether the entire generative AI industry is built on legal ground or quicksand.

The case

The logs are heading to the plaintiffs in In re: OpenAI, Inc. Copyright Infringement Litigation, a consolidated action in the Southern District of New York that combines 16 separate copyright lawsuits. The plaintiffs include the New York Times, the Chicago Tribune, and dozens of individual authors who allege that OpenAI used their copyrighted work to train ChatGPT without permission.

OpenAI’s central legal defense is fair use — the argument that training AI models on copyrighted material is transformative and therefore legal. Fair use is a four-factor test under U.S. copyright law, and the most dangerous factor for OpenAI is market substitution: whether ChatGPT’s outputs compete with or replace the original copyrighted works.

That’s exactly what the 20 million logs will reveal.

What OpenAI tried to prevent

The discovery fight started when news plaintiffs requested 120 million conversation logs from the billions OpenAI has preserved. OpenAI counter-offered 20 million — roughly 0.5% of total preserved conversations. The plaintiffs accepted.

Then, in October 2025, OpenAI changed course. It proposed running keyword searches on the 20 million logs and producing only the conversations that directly referenced plaintiffs’ specific works. The rest, OpenAI argued, were irrelevant and would invade ChatGPT users’ privacy.

Magistrate Judge Ona T. Wang rejected that approach in November 2025. Her reasoning went to the heart of the fair use defense: even conversations that don’t reproduce copyrighted text are discoverable because they show how ChatGPT functions across diverse queries. Does it routinely generate content that substitutes for the original works? Does it produce summaries, paraphrases, or near-copies that reduce demand for the source material? You can’t answer those questions by looking only at the conversations OpenAI hand-picks.

Judge Stein agreed. On the privacy argument, he found three safeguards sufficient: reducing the sample from billions to 20 million, OpenAI’s de-identification process removing personally identifiable information, and an existing protective order governing discovery materials. He noted that ChatGPT users “voluntarily submitted their communications” to OpenAI — distinguishing them from wiretapped conversations.

Why the logs matter

The 20 million conversations will be analyzed by expert witnesses for patterns that bear directly on fair use. The key questions:

Does ChatGPT reproduce copyrighted content when prompted? How frequently? How closely? When a user asks for a summary of a New York Times article, does ChatGPT produce something that functions as a substitute for reading the original? When someone asks for content in the style of a specific author, does the output compete with that author’s work?

These aren’t hypothetical questions anymore. They’re questions that will be answered with data from 20 million real conversations.

If the analysis shows systematic market substitution — that ChatGPT regularly produces outputs that replace the need for copyrighted source material — OpenAI’s fair use defense weakens dramatically. If it shows that outputs are genuinely transformative and don’t substitute for the originals, the defense strengthens.

Either way, the black box opens. For the first time, independent experts will examine at scale how a large language model actually uses the copyrighted material it was trained on.

The landscape

This case exists within a rapidly expanding litigation landscape. Over 70 AI copyright cases are now pending — more than double the roughly 30 at the end of 2024. New plaintiffs in 2025 included Disney, Universal, Warner Bros., Apple, Salesforce, Adobe, ByteDance, and Encyclopedia Britannica.

The most significant precedent so far is mixed. In Kadrey v. Meta, a court found Meta’s use of plaintiff books “highly transformative” and qualifying as fair use — but the ruling was narrow and specifically flagged concerns about indirect market harm. In Thomson Reuters v. Ross Intelligence, a different court rejected the fair use defense entirely, finding the use commercial and non-transformative.

And then there’s the Anthropic settlement. In Bartz v. Anthropic, the court found Anthropic’s training “exceedingly transformative” under fair use. But Anthropic had downloaded 482,460 pirated books from Library Genesis and Pirate Library Mirror to build its training set. The $1.5 billion settlement — roughly $3,000 per book — acknowledged that even transformative use doesn’t excuse the method of acquisition.

The music industry is moving fastest toward resolution. Universal Music Group and Warner Music Group both settled with Udio in 2025, with licensed music subscriptions launching in 2026 using artist opt-in models. Warner also settled with Suno. These settlements suggest the music industry sees licensing as inevitable — and that it has the market power to demand it.

What happens next

The major fair use decisions are expected in summer 2026 or later: In re Google Generative AI, UMG v. Suno, Concord v. Anthropic, and In re Mosaic LLM. The OpenAI discovery will inform all of them.

The stakes extend far beyond one company. Every major AI lab trained on copyrighted material. Every fair use defense rests on the same basic argument: that training is transformative and doesn’t substitute for the original works. The 20 million ChatGPT logs will provide the first large-scale empirical test of that claim.

If the data supports fair use, the AI industry’s legal foundation holds. If it doesn’t, the industry faces a choice between licensing regimes that could cost billions — the Anthropic settlement suggests the per-work price — and fundamental changes to how models are trained.

The discovery bomb isn’t the ruling. It’s the evidence that makes the ruling possible. For the first time, a court will see what happens inside the machine. What it finds will shape the legal framework for AI for the next decade.


Originally published at https://noahaust2.github.io/strategist-dashboard/blog/the-discovery-bomb.html


Write a comment