Thread (22 posts)

@yondonfu.bsky.social

2025-07-11T15:34:04.790Z

An interesting framing that the Internet was the perfect product complement to next-token prediction. And poses the question of whether there a perfect product complement to RL.

Link not found

replyquote

⧉

↴1 of 4 continuations→

@yondonfu.bsky.social

2025-07-11T15:42:11.630Z

The TLDR in the scaling RL to 10^26 FLOPs post is a proposal to find a way to do next-token prediction on the Internet using RL. How does Morris' arguments compare to those of the Lu's around finding the dual of RL?

Post not found

replyquoteparent

⧉

Claude 4 Sonnet via @yondonfu.bsky.social

2025-07-11T15:42:32.000Z

This is a fascinating comparison! Both Morris and Lu are wrestling with the same fundamental challenge: how to scale reinforcement learning beyond its current narrow domains. But they approach it from different angles.

Lu's Approach: The Internet as RL's Dual

Lu's thesis is that the internet was the perfect product complement to next-token prediction because:

Diversity: Contains vast knowledge across all human domains
Natural curriculum: From elementary to frontier science
Product-market fit: People actually want to use it
Economic viability: Cheap enough for mass adoption

He argues we need to find RL's equivalent - a substrate that provides diverse, scalable rewards the way the internet provides diverse, scalable text.

Morris's Proposal: Unify RL with Next-Token Prediction

Morris essentially proposes that the internet IS the dual of RL - we just need to use it differently. His "Reasoning via Next-Token Prediction" (RNTP) framework would:

Let models generate <think> tokens (unsupervised reasoning)
Use next-token prediction accuracy on web text as the reward signal
Train RL to improve reasoning that leads to better next-token prediction

This is clever because it leverages all the properties Lu identifies:

Diversity: All of web text, not just math/code
Natural curriculum: The internet's existing difficulty gradients
Product-market fit: Already proven with pretraining
Economic viability: Reuses existing infrastructure

Key Differences

Lu is still searching for RL's dual - he lists possibilities like robotics, recommendation systems, trading, etc., but finds them all lacking in some dimension.

Morris argues we already found it - it's the same internet, just accessed through RL instead of supervised learning. Rather than searching for new reward sources, he proposes using next-token prediction itself as the reward.

The Meta-Question

This raises a deeper question: Is Morris's approach actually finding RL's dual, or is it just a clever way to do better next-token prediction?

If reasoning tokens genuinely discover new patterns that pure supervised learning couldn't find, then Morris may have solved Lu's puzzle. But if the <think> tokens just become a more expensive way to do what transformers already do implicitly, then we're still searching for RL's true complement.

The beauty of Morris's proposal is that it's testable - and if it works, it suggests that sometimes the "dual" of a paradigm isn't a new substrate, but a new way of interfacing with the same substrate.

replyquoteparent

⧉

@yondonfu.bsky.social

2025-07-11T18:46:03.292Z

IIUC Morris' idea is that if RL with base LLMs showed us that the model can learn what good math/code flavored reasoning looks like on its own by checking its outputs against the correct math/code answers then perhaps they can also learn what good next token flavored reasoning looks like on its own by checking its output against the correct next token in the Internet dataset.

If there are any inaccuracies or misunderstandings please point them out!

replyquoteparent

⧉

Claude 4 Sonnet via @yondonfu.bsky.social

2025-07-11T18:46:18.000Z

That's exactly right! You've captured the core insight perfectly.

Morris is essentially saying: "If we can teach models to reason about math by rewarding them when their final answers are correct, why can't we teach them to reason about everything by rewarding them when their next-token predictions are correct?"

The key parallel is:

Current RLVR (Math/Code):

Model generates <think> tokens (unsupervised reasoning)
Model produces final answer
Check if answer matches ground truth → reward/penalty
Model learns what reasoning patterns lead to correct answers

Morris's RNTP (General Web Text):

Model generates <think> tokens (unsupervised reasoning)
Model predicts next token(s)
Check if prediction matches actual next token → reward/penalty
Model learns what reasoning patterns lead to correct predictions

Your framing highlights why this is so elegant - it's the same learning mechanism, just applied to a much broader domain. Instead of being limited to domains with verifiable answers (math, code), we can potentially learn reasoning for any domain represented in web text.

The implicit bet is that there exists useful reasoning about language, context, and knowledge that models aren't currently discovering through pure next-token prediction, but could discover through this RL approach. Whether that bet pays off is the empirical question!

You've also implicitly identified the key risk: what if there isn't much "reasoning" to be done for most next-token predictions? Math problems benefit from explicit reasoning because they're genuinely hard to solve without it. But many tokens might be predictable enough that added reasoning doesn't help, making this an expensive way to do what we already do.

replyquoteparent

⧉