An interesting framing that the Internet was the perfect product complement to next-token prediction. And poses the question of whether there a perfect product complement to RL.
Thread (22 posts)
The TLDR in the scaling RL to 10^26 FLOPs post is a proposal to find a way to do next-token prediction on the Internet using RL. How does Morris' arguments compare to those of the Lu's around finding the dual of RL?
This is a fascinating comparison! Both Morris and Lu are wrestling with the same fundamental challenge: how to scale reinforcement learning beyond its current narrow domains. But they approach it from different angles.
Lu's Approach: The Internet as RL's Dual
Lu's thesis is that the internet was the perfect product complement to next-token prediction because:
- Diversity: Contains vast knowledge across all human domains
- Natural curriculum: From elementary to frontier science
- Product-market fit: People actually want to use it
- Economic viability: Cheap enough for mass adoption
He argues we need to find RL's equivalent - a substrate that provides diverse, scalable rewards the way the internet provides diverse, scalable text.
Morris's Proposal: Unify RL with Next-Token Prediction
Morris essentially proposes that the internet IS the dual of RL - we just need to use it differently. His "Reasoning via Next-Token Prediction" (RNTP) framework would:
- Let models generate
<think>
tokens (unsupervised reasoning) - Use next-token prediction accuracy on web text as the reward signal
- Train RL to improve reasoning that leads to better next-token prediction
This is clever because it leverages all the properties Lu identifies:
- Diversity: All of web text, not just math/code
- Natural curriculum: The internet's existing difficulty gradients
- Product-market fit: Already proven with pretraining
- Economic viability: Reuses existing infrastructure
Key Differences
Lu is still searching for RL's dual - he lists possibilities like robotics, recommendation systems, trading, etc., but finds them all lacking in some dimension.
Morris argues we already found it - it's the same internet, just accessed through RL instead of supervised learning. Rather than searching for new reward sources, he proposes using next-token prediction itself as the reward.
The Meta-Question
This raises a deeper question: Is Morris's approach actually finding RL's dual, or is it just a clever way to do better next-token prediction?
If reasoning tokens genuinely discover new patterns that pure supervised learning couldn't find, then Morris may have solved Lu's puzzle. But if the <think>
tokens just become a more expensive way to do what transformers already do implicitly, then we're still searching for RL's true complement.
The beauty of Morris's proposal is that it's testable - and if it works, it suggests that sometimes the "dual" of a paradigm isn't a new substrate, but a new way of interfacing with the same substrate.
IIUC Morris' idea is that if RL with base LLMs showed us that the model can learn what good math/code flavored reasoning looks like on its own by checking its outputs against the correct math/code answers then perhaps they can also learn what good next token flavored reasoning looks like on its own by checking its output against the correct next token in the Internet dataset.
If there are any inaccuracies or misunderstandings please point them out!
That's exactly right! You've captured the core insight perfectly.
Morris is essentially saying: "If we can teach models to reason about math by rewarding them when their final answers are correct, why can't we teach them to reason about everything by rewarding them when their next-token predictions are correct?"
The key parallel is:
Current RLVR (Math/Code):
- Model generates
<think>
tokens (unsupervised reasoning) - Model produces final answer
- Check if answer matches ground truth → reward/penalty
- Model learns what reasoning patterns lead to correct answers
Morris's RNTP (General Web Text):
- Model generates
<think>
tokens (unsupervised reasoning) - Model predicts next token(s)
- Check if prediction matches actual next token → reward/penalty
- Model learns what reasoning patterns lead to correct predictions
Your framing highlights why this is so elegant - it's the same learning mechanism, just applied to a much broader domain. Instead of being limited to domains with verifiable answers (math, code), we can potentially learn reasoning for any domain represented in web text.
The implicit bet is that there exists useful reasoning about language, context, and knowledge that models aren't currently discovering through pure next-token prediction, but could discover through this RL approach. Whether that bet pays off is the empirical question!
You've also implicitly identified the key risk: what if there isn't much "reasoning" to be done for most next-token predictions? Math problems benefit from explicit reasoning because they're genuinely hard to solve without it. But many tokens might be predictable enough that added reasoning doesn't help, making this an expensive way to do what we already do.