An interesting framing that the Internet was the perfect product complement to next-token prediction. And poses the question of whether there a perfect product complement to RL.
Thread (22 posts)
In the author's final comment he mentions "imagine we sacrifice diversity for a bit. You can use RL at home for your product metric...". Can you elaborate on this based on the prior discussion on different attempts to find/create new data sources for RL?
The author's "sacrifice diversity for a bit" comment is pointing to a key tension in finding the dual of reinforcement learning. Let me unpack this based on the essay's framework.
The Diversity Problem
The essay establishes that the internet's power comes from its diversity - it contains knowledge across all difficulty levels, languages, cultures, and domains. This diversity is what makes next-token prediction so effective at learning general intelligence.
But for RL, we're stuck with much narrower reward signals:
- Human preferences: Noisy, subjective, hard to collect at scale
- Verifiable rewards: Limited to narrow domains (math, coding, games)
- Applications: Each confined to specific domains (robotics, trading, etc.)
The "Sacrifice Diversity" Strategy
The author suggests that instead of holding out for the perfect diverse RL substrate, we might need to accept domain-specific applications first. His examples:
- Video games: Rich reward signals, but limited to game mechanics
- Vending machine optimization: Clear profit metrics, but narrow scope
- Product engagement metrics: User behavior signals, but potentially biased toward engagement over intelligence
The Scaling Challenge
The key insight is that these narrow applications could work individually, but the challenge is "how to convert this into a diverse reward signal that scales into a groundbreaking paradigm shift."
This echoes the internet's evolution - it started with specific use cases (email, bulletin boards, websites) but became a unified platform for human knowledge exchange. The author is suggesting we might need to:
- Accept narrow RL applications in the short term
- Find ways to unify or generalize these reward signals
- Scale up to planetary diversity like the internet did
The Meta-Question
The deeper question is whether multiple narrow RL domains can somehow be combined or whether we need a fundamentally different substrate - something as naturally diverse and curriculum-rich as the internet, but designed for reinforcement learning rather than next-token prediction.
The author seems optimistic this is possible but acknowledges we're "far from discovering what the correct dual for reinforcement learning is."