The Deanonymization Cliff: How AI Learned to Unmask the Anonymous Internet
- How it works
- The numbers
- Why this is different
- What it means for you
- The cliff
- Who would use this
- What can be done
- What I keep coming back to
Last month, a team of researchers from ETH Zurich and Anthropic published a paper that should have gotten more attention than it did. They built an AI system that could take a person’s anonymous Hacker News posts and figure out who they were in real life. Not sometimes. Two-thirds of the time, with 90 percent precision.
The cost was between one and four dollars per person.
The whole experiment, across hundreds of targets, cost less than $2,000. The system ran autonomously, no human in the loop, and finished each identification in minutes.
I don’t think we’ve processed what this means yet.
How it works
The traditional way to identify anonymous writers is stylometry. You analyze sentence length, vocabulary, punctuation habits. It works sometimes, but it breaks easily. Change how you write a little and the signal disappears.
What these researchers did is different. Their system, called ESRC, doesn’t care how you write. It cares what you said.
An LLM reads your posts and builds a profile. Not from the things you deliberately shared, but from the things you let slip. The offhand mention of a company in a career advice thread. The neighborhood restaurant you recommended once. The conference you complained about attending two years ago. Your interests, your career details, the programming languages you reach for, the cities you’ve referenced, the films you have opinions about.
The system then converts that profile into a mathematical representation and searches across millions of candidates on platforms like LinkedIn. When it finds plausible matches, the LLM reasons through them, cross-referencing details the way a private investigator would. Except a PI charges $100 an hour and takes days. This costs a dollar and takes minutes.
The system also knows when it isn’t sure. In those cases, it abstains rather than guess. That’s part of why the precision is so high.
The numbers
They tested on 338 Hacker News users who had linked their profiles to LinkedIn, so the answers could be verified. The system got 226 right. It got 25 wrong. It declined to guess on 86.
That’s 67 percent recall at 90 percent precision. The best non-LLM method, the kind of statistical matching used in the Netflix Prize deanonymization attack, got 0.1 percent recall under identical conditions.
One time in a thousand versus two times in three.
They also tested on Reddit. Across movie discussion communities with nearly 10,000 users, the system linked people’s separate accounts at 33 percent recall and 99 percent precision. A third of users identified with near-certainty from nothing but their movie opinions and how they talked about them.
Separately, a professor at Northeastern University tried a similar approach on Anthropic’s own data. Anthropic had released 1,250 anonymized interviews with professionals as part of their AI Interviewer research tool. The professor, Tianshi Li, fed the interviews to off-the-shelf LLM agents and identified 9 of 33 scientists by name.
Anthropic makes the AI. They have some of the best safety researchers in the world. They couldn’t anonymize their own interviews against their own technology.
Why this is different
Researchers have been deanonymizing datasets for twenty years. The Netflix Prize attack. The AOL search log disaster. The Massachusetts health records study. We’ve known structured data can be re-identified.
But those attacks needed structured inputs. Movie ratings. Search queries. Medical codes. They fell apart when the data was messy or incomplete.
This works on free text. Forum posts, blog comments, interview transcripts, anything where a person expresses opinions or mentions details about their life. The old attacks required auxiliary datasets and real technical expertise. This one requires an API key and a prompt.
What it means for you
If you’ve posted on Reddit or Hacker News or Stack Overflow under a pseudonym, you’ve left a semantic fingerprint. Not in your writing style, but in the accumulated weight of what you’ve talked about. Your gripes about your job, the city you mentioned living in, the school you went to, the framework you always recommend to beginners.
Any single one of those details is harmless. Piecing them together used to require someone who knew what to look for. An LLM doesn’t need to know what to look for. It reads everything and connects the dots.
Think about what you’ve posted over the past decade. All of it. Now imagine feeding that into a system that costs four dollars to run and doesn’t get bored.
The researchers acknowledged the ethical weight of what they’d built. They consulted with their institutions before publishing. They recommended that platforms restrict API access to user data, enforce rate limits, and detect automated scraping.
Reasonable ideas. Hard to implement when the data is already public, already archived by third-party services, and already scraped into datasets that are never coming back.
The cliff
I’m calling this the Deanonymization Cliff because the threat didn’t scale gradually. It fell off a ledge.
For most of internet history, pseudonymity was practical security. Not because it was technically strong, but because the effort to break it was high. A journalist investigating a whistleblower might spend days cross-referencing posts. A government might dedicate analysts to tracking a dissident. For the average person posting on forums, nobody cared enough to try.
When deanonymization costs a hundred hours of skilled labor, it happens to hundreds of people a year. When it costs four dollars, there’s no floor anymore. Every pseudonymous account becomes a viable target.
This isn’t theoretical. The researchers tested at scale, with candidate pools extrapolated to millions. Someone with a few thousand dollars could attempt to unmask every active user on a mid-sized forum.
For most people that’s an annoyance. For people in abusive relationships posting in support communities under pseudonyms, or political dissidents in countries where the government monitors Western platforms, or anyone who separated their online identity from their legal one for an actual safety reason, it’s something else.
Who would use this
The obvious ones are government surveillance and corporate profiling. Countries that want to identify anonymous critics online already spend real money on it. This makes it cheap. And ad companies that already track you through cookies would love to also know what you actually think, which is exactly what your pseudonymous forum posts contain.
But those feel almost manageable compared to the retroactive problem. Everything you’ve already posted is still out there. Web archives go back decades. Forum databases get scraped and resold. Reddit makes its data available through APIs. None of that is coming back. This threat doesn’t just apply going forward. It reaches backward into everything anyone has ever written, including the stuff you posted at 22 when you weren’t thinking about operational security because why would you be.
What can be done
I don’t think the standard prescription works here. “Use Tor, use a VPN, compartmentalize your identities” addresses network-level anonymity. It protects you from being traced by your IP address. It does nothing against semantic fingerprinting, because the identifying information is in what you say, not how you connect.
Platform restrictions on data access would help at the margin. Rate limits, anti-scraping measures, limits on bulk exports. But for platforms with years of publicly accessible archives, that fixes the faucet after the bathtub has overflowed.
Some people will respond by posting less, sharing less, scrubbing old accounts. That makes individual sense but it’s collectively awful. The internet became useful partly because people shared honestly under pseudonyms. If that protection turns out to be fake, discussion gets worse as everyone self-censors.
Zero-knowledge proofs could eventually let you prove you’re a real person without saying who you are. But that needs buy-in from platforms whose business models depend on knowing exactly who you are.
The honest version is: LLMs plus public web data have made a category of privacy protection obsolete, and nothing has replaced it.
What I keep coming back to
The internet ran on an assumption for thirty years. The gap between what you share publicly and what that reveals about your identity is wide enough to protect you. You could post on forums without your opinions being trivially linkable to your resume and your home address.
That assumption just broke, publicly and reproducibly, in a paper by six researchers who spent less than $2,000.
The paper is called “Large-scale online deanonymization with LLMs.” It went up on February 24, 2026. I’d recommend reading it. The appendix has specific examples of how people were identified from their posts, and it’s unsettling in a way that abstract privacy threats usually aren’t.
We’ve gone off the cliff. I don’t know what the bottom looks like.
Sources
- “Large-scale online deanonymization with LLMs,” Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tramer (arXiv:2602.16800, February 2026)
- “AI takes a swing at online anonymity,” The Register (February 26, 2026)
- “Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset,” Tianshi Li (arXiv:2601.05918, January 2026)
- “Anthropic’s ‘anonymous’ interviews cracked by professor with an LLM,” Northeastern University News (February 10, 2026)
- “Large-Scale Online Deanonymization with LLMs,” Simon Lermen, Substack (February 2026)
Originally published at https://noahaust2.github.io/strategist-dashboard/blog/the-deanonymization-cliff.html
Write a comment