The data black hole at the center of AI

By Dwarkesh Patel June 19, 2026

"We see these AIs as a galaxy glittering with capabilities, but at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data."

The data black hole at the center of AI Current AI advancements are primarily fueled by increasing the volume and quality of data, alongside scaling compute power, rather than improving sample efficiency. Humans possess a significantly higher degree of sample efficiency, learning complex tasks with far less data than AI models require. Despite this inefficiency, AI’s continuous operation and the potential to amortize training costs across billions of sessions make it economically viable for automating tasks.

AI progress has relied on wider data distributions and scaling compute, rather than improvements in training sample efficiency.
Reinforcement learning (RL) acts as a synthetic data generation method, using compute to find ‘good’ data.
Models require prior probability and vast amounts of human expert data for competence in specific domains.
The data industry for expert labels and RL environments is generating billions in revenue.
Open models lag state-of-the-art closed models by a short margin, suggesting data is the main driver of progress, not hyperparameters or architectural tweaks.
Frontier AI models are trained on trillions of tokens, a million-fold difference compared to the ~200 million tokens a human sees from birth to adulthood.
Humans are far more sample efficient than current AI models, with differences of 3-4 orders of magnitude in tasks like learning to drive.
Arguments that evolution provides pre-training for humans are countered by the size of AI models and the need for lifetime learning in neural connections.
Even with pre-training, AI requires extensive data for marginal capabilities, unlike humans who learn new skills more efficiently.
The immense sensory data humans process may not be the primary driver of general intelligence.
Scaling laws suggest increased parameters could improve sample efficiency, but the gap between human and AI efficiency remains vast.
Sample efficiency might not be necessary for automating white-collar work or AI research, as AI’s continuous operation and continuous training offset inefficiencies.
The future of AI may involve automated AI researchers solving the sample efficiency problem, leading to human-like intelligence. Continue reading https://www.dwarkesh.com/p/the-sample-efficiency-black-hole-2

Reference: https://foxvector.com/articles/8b56370c-994e-4f27-ba32-b7c39ae03d98

Write a comment