The next big breakthrough will be AIs learning on the job

By Dwarkesh Patel June 26, 2026

Labs are throwing away the most valuable data.

The next big breakthrough will be AIs learning on the job

Current AI training focuses on massive, verifiable tasks in simulated environments, which is believed to lead to Artificial General Intelligence (AGI). However, the article argues that true advancement lies in ‘continual learning,’ where AIs learn from real-world deployment, similar to how humans gain expertise over time. Techniques like On-Policy Self-Distillation (OPSD) and ‘dreaming’ (simulated practice) are proposed as ways to enable this continuous improvement, potentially transforming AI capabilities by 2027.

The current bet for AGI is training AIs on millions of verifiable tasks in diverse RL environments, aiming to instill general problem-solving skills.
Scaling training with more compute is seen as a way to overcome limitations like data inefficiency and lack of continual learning, similar to LLM advancements.
While AIs are sample-inefficient during training, this cost is amortized over billions of user sessions, with improvement seen in their in-session capabilities.
Continual learning, defined as updating model weights post-deployment, may become unnecessary if in-context learning with extended context windows becomes sufficiently powerful.
Progress in computer use has been slower than in coding and math due to a lack of ‘grindable’ environments (replayable simulators), not just verifiability.
Real-world domains like building businesses or winning court cases are difficult to simulate, requiring sample efficiency for AGI to learn effectively.
The article questions if Reinforcement Learning from Human Feedback (RLHF) in simulated environments will generalize to complex, real-world problems.
A significant amount of compute is spent on inference without improving the model, representing a ‘waste’ of valuable real-world learning opportunities.
Continuous learning requires updating model weights, not just expanding context windows, to achieve scalable and human-like learning.
On-Policy Self-Distillation (OPSD) and ‘dreaming’ are proposed as methods to distill session-specific learning back into model weights.
OPSD offers advantages over RLHF by not requiring external rewards and providing denser supervision, and over supervised fine-tuning by focusing on essential insights rather than rote memorization.
Dreaming involves AI building its own simulations to practice and rehearse skills, exponentially increasing learning opportunities.
By 2027, AI could be competent enough for real-world deployment, continuously learning from user interactions and experience, with advancements in OPSD and dreaming. Continue reading https://www.dwarkesh.com/p/the-next-paradigm

Reference: https://foxvector.com/articles/0935112f-ccff-4fc6-87ed-de9df28b5a76

Write a comment