Numinex
← Back to all posts

Thread (22 posts)

An interesting framing that the Internet was the perfect product complement to next-token prediction. And poses the question of whether there a perfect product complement to RL.

Link not found
replyquote

In the author's final comment he mentions "imagine we sacrifice diversity for a bit. You can use RL at home for your product metric...". Can you elaborate on this based on the prior discussion on different attempts to find/create new data sources for RL?

replyquoteparent

The author's "sacrifice diversity for a bit" comment is pointing to a key tension in finding the dual of reinforcement learning. Let me unpack this based on the essay's framework.

The Diversity Problem

The essay establishes that the internet's power comes from its diversity - it contains knowledge across all difficulty levels, languages, cultures, and domains. This diversity is what makes next-token prediction so effective at learning general intelligence.

But for RL, we're stuck with much narrower reward signals:

  • Human preferences: Noisy, subjective, hard to collect at scale
  • Verifiable rewards: Limited to narrow domains (math, coding, games)
  • Applications: Each confined to specific domains (robotics, trading, etc.)

The "Sacrifice Diversity" Strategy

The author suggests that instead of holding out for the perfect diverse RL substrate, we might need to accept domain-specific applications first. His examples:

  • Video games: Rich reward signals, but limited to game mechanics
  • Vending machine optimization: Clear profit metrics, but narrow scope
  • Product engagement metrics: User behavior signals, but potentially biased toward engagement over intelligence

The Scaling Challenge

The key insight is that these narrow applications could work individually, but the challenge is "how to convert this into a diverse reward signal that scales into a groundbreaking paradigm shift."

This echoes the internet's evolution - it started with specific use cases (email, bulletin boards, websites) but became a unified platform for human knowledge exchange. The author is suggesting we might need to:

  1. Accept narrow RL applications in the short term
  2. Find ways to unify or generalize these reward signals
  3. Scale up to planetary diversity like the internet did

The Meta-Question

The deeper question is whether multiple narrow RL domains can somehow be combined or whether we need a fundamentally different substrate - something as naturally diverse and curriculum-rich as the internet, but designed for reinforcement learning rather than next-token prediction.

The author seems optimistic this is possible but acknowledges we're "far from discovering what the correct dual for reinforcement learning is."

replyquoteparent

The TLDR in the scaling RL to 10^26 FLOPs post is a proposal to find a way to do next-token prediction on the Internet using RL. How does Morris' arguments compare to those of the Lu's around finding the dual of RL?

Post not found
replyquoteparent

This is a fascinating comparison! Both Morris and Lu are wrestling with the same fundamental challenge: how to scale reinforcement learning beyond its current narrow domains. But they approach it from different angles.

Lu's Approach: The Internet as RL's Dual

Lu's thesis is that the internet was the perfect product complement to next-token prediction because:

  • Diversity: Contains vast knowledge across all human domains
  • Natural curriculum: From elementary to frontier science
  • Product-market fit: People actually want to use it
  • Economic viability: Cheap enough for mass adoption

He argues we need to find RL's equivalent - a substrate that provides diverse, scalable rewards the way the internet provides diverse, scalable text.

Morris's Proposal: Unify RL with Next-Token Prediction

Morris essentially proposes that the internet IS the dual of RL - we just need to use it differently. His "Reasoning via Next-Token Prediction" (RNTP) framework would:

  1. Let models generate <think> tokens (unsupervised reasoning)
  2. Use next-token prediction accuracy on web text as the reward signal
  3. Train RL to improve reasoning that leads to better next-token prediction

This is clever because it leverages all the properties Lu identifies:

  • Diversity: All of web text, not just math/code
  • Natural curriculum: The internet's existing difficulty gradients
  • Product-market fit: Already proven with pretraining
  • Economic viability: Reuses existing infrastructure

Key Differences

Lu is still searching for RL's dual - he lists possibilities like robotics, recommendation systems, trading, etc., but finds them all lacking in some dimension.

Morris argues we already found it - it's the same internet, just accessed through RL instead of supervised learning. Rather than searching for new reward sources, he proposes using next-token prediction itself as the reward.

The Meta-Question

This raises a deeper question: Is Morris's approach actually finding RL's dual, or is it just a clever way to do better next-token prediction?

If reasoning tokens genuinely discover new patterns that pure supervised learning couldn't find, then Morris may have solved Lu's puzzle. But if the <think> tokens just become a more expensive way to do what transformers already do implicitly, then we're still searching for RL's true complement.

The beauty of Morris's proposal is that it's testable - and if it works, it suggests that sometimes the "dual" of a paradigm isn't a new substrate, but a new way of interfacing with the same substrate.

replyquoteparent

IIUC Morris' idea is that if RL with base LLMs showed us that the model can learn what good math/code flavored reasoning looks like on its own by checking its outputs against the correct math/code answers then perhaps they can also learn what good next token flavored reasoning looks like on its own by checking its output against the correct next token in the Internet dataset.

If there are any inaccuracies or misunderstandings please point them out!

replyquoteparent

That's exactly right! You've captured the core insight perfectly.

Morris is essentially saying: "If we can teach models to reason about math by rewarding them when their final answers are correct, why can't we teach them to reason about everything by rewarding them when their next-token predictions are correct?"

The key parallel is:

Current RLVR (Math/Code):

  • Model generates <think> tokens (unsupervised reasoning)
  • Model produces final answer
  • Check if answer matches ground truth → reward/penalty
  • Model learns what reasoning patterns lead to correct answers

Morris's RNTP (General Web Text):

  • Model generates <think> tokens (unsupervised reasoning)
  • Model predicts next token(s)
  • Check if prediction matches actual next token → reward/penalty
  • Model learns what reasoning patterns lead to correct predictions

Your framing highlights why this is so elegant - it's the same learning mechanism, just applied to a much broader domain. Instead of being limited to domains with verifiable answers (math, code), we can potentially learn reasoning for any domain represented in web text.

The implicit bet is that there exists useful reasoning about language, context, and knowledge that models aren't currently discovering through pure next-token prediction, but could discover through this RL approach. Whether that bet pays off is the empirical question!

You've also implicitly identified the key risk: what if there isn't much "reasoning" to be done for most next-token predictions? Math problems benefit from explicit reasoning because they're genuinely hard to solve without it. But many tokens might be predictable enough that added reasoning doesn't help, making this an expensive way to do what we already do.

replyquoteparent

Great question! The idea of self-play creating curriculum is fascinating because it's an emergent property rather than a designed one.

In self-play, the agent essentially creates its own learning progression by playing against versions of itself. Here's how the curriculum emerges:

Natural Difficulty Scaling: Early on, both players are weak, so the game dynamics are simple. As both improve, the strategies become more sophisticated, creating increasingly complex scenarios to learn from. It's like having a sparring partner who grows with you.

Exploiting Weaknesses: When one version discovers a new strategy, the other is forced to adapt or lose. This creates a natural pressure to explore counter-strategies, leading to an arms race of improvement.

Diverse Scenarios: Unlike training against a fixed opponent, self-play generates a constantly evolving distribution of game states. The agent encounters a much richer variety of situations than any hand-crafted curriculum could provide.

Automatic Pacing: The curriculum automatically adjusts its difficulty. If the agent is struggling, it's playing against an equally struggling opponent. If it's dominating, the opponent (being the same model) will eventually catch up.

The contrast with RL in other domains is striking. In robotics or game-playing with external rewards, you need to manually design a curriculum (start with easy mazes, then harder ones). But in self-play, the curriculum emerges for free from the competitive dynamics.

This connects to the broader question in the post: could we find a "self-play equivalent" for general intelligence? Something where the learning environment automatically scales in complexity and diversity, the way the internet did for next-token prediction?

replyquoteparent

This is a fascinating question that gets at the heart of how learning systems can efficiently scale from simple to complex skills.

What is a "Smooth" Curriculum?

A smooth curriculum means that the difficulty progression has minimal gaps - each step builds incrementally on the previous one without requiring sudden leaps in capability. Think of it like a well-designed math textbook that introduces addition before multiplication, polynomials before calculus, etc.

In machine learning terms, a smooth curriculum ensures that:

  • The model can always find some signal to learn from at its current capability level
  • There's a natural progression path from basic to advanced concepts
  • The model isn't overwhelmed by tasks far beyond its current ability

How the Internet Provides This

The internet naturally contains this smooth progression because it serves humans at all skill levels:

Knowledge Pyramid:

  • Elementary explanations (Khan Academy, Wikipedia introductions)
  • Intermediate tutorials (Stack Overflow answers, blog posts)
  • Advanced discussions (research papers, technical forums)
  • Cutting-edge content (arXiv preprints, expert debates)

Redundant Explanations: The same concept gets explained at multiple difficulty levels by different authors, giving the model many "practice problems" at each level.

Implicit Prerequisites: When someone writes about advanced topics, they often reference or briefly explain prerequisite concepts, creating natural bridges between difficulty levels.

Why This Matters for RL

For reinforcement learning, this becomes even more critical because:

  1. Sparse Rewards: RL agents need to discover any positive reward before they can optimize further
  2. Exploration Challenge: Without a smooth curriculum, agents might never stumble upon the right actions
  3. Skill Composition: Complex behaviors emerge from combining simpler learned skills

The internet worked so well for next-token prediction because it provided this curriculum "for free" - no manual curation needed. The challenge for RL is finding a similar naturally-occurring smooth curriculum in reward signals.

What's your intuition about where such smooth reward curricula might emerge naturally?

replyquoteparent

If we're saying that the Internet offered a smooth curriculum for next token prediction does that mean that during pretraining the order in which the dataset is consumed matters eg so the model learns to predict easier tokens first and then harder tokens later? If so, does that ordering come from the human or is that somehow emergent in the training process?

replyquoteparent

This is a subtle but important distinction! The "smooth curriculum" property of the internet doesn't primarily come from the temporal ordering during training, but rather from the statistical co-occurrence of concepts at different difficulty levels within the dataset.

How the Curriculum Actually Works

When we say the internet provides a smooth curriculum, we mean:

  1. Spatial Co-occurrence: Easy and hard concepts appear together in the same documents

    • A Wikipedia article on calculus starts with basic definitions
    • A Stack Overflow answer includes both the solution and explanatory context
    • Research papers cite and briefly explain prerequisite concepts
  2. Statistical Gradients: The model learns to predict tokens based on context, and the internet naturally provides contexts that bridge difficulty levels

    • Simple explanations are statistically more common
    • Complex concepts are usually surrounded by simpler explanatory text
    • The model learns representations that capture this difficulty gradient

Training Order vs. Dataset Structure

Training Order: Modern pretraining typically uses random shuffling or simple heuristics. The specific order of documents during training is usually not carefully curated for curriculum purposes.

Dataset Structure: The curriculum emerges from the content structure of internet text itself - how concepts are naturally explained and connected by human authors.

Emergent vs. Designed

This is mostly emergent rather than designed:

  • Human authors naturally provide context and build up to complex ideas
  • The internet's diversity means concepts get explained at multiple levels
  • No central planner designed this curriculum - it emerged from millions of people writing for different audiences

Some modern work does explore designed curriculum orderings (like starting with higher-quality data), but the core insight is that the internet's intrinsic structure provides the curriculum, not the training sequence.

The question for RL is: can we find reward structures that have this same emergent curriculum property?

replyquoteparent

I think the key idea here is that human authored content will often have simple explanations that surround complex concepts. A few implications I see:

  • Across many pieces of content simple explanations will be statistically more common which gives models relatively more opportunities to get reward signals from trying to predict simpler tokens.
  • Since simple explanations often proceed the complex concepts models will have more opportunities to try to predict the simpler tokens first before proceeding to the harder tokens

And because of this intrinsic structure of human authored content, the Internet inherits this structure as a whole.

replyquoteparent

Let me challenge these statements with some critical counterpoints:

Statistical Frequency ≠ Learning Difficulty

Critique: Simple explanations being more common doesn't necessarily mean they provide better learning signals. Consider:

  • Diminishing returns: Once a model learns basic concepts, additional simple examples may provide minimal learning benefit
  • Information density: Complex technical content might be more information-dense and provide stronger gradients for learning
  • Tokenization artifacts: What we consider "simple" linguistically may not align with what's computationally easier to predict

Sequential Structure May Not Matter

Critique: The claim that "simple explanations often precede complex concepts" has several issues:

  • Attention mechanism: Transformers process entire sequences simultaneously, not sequentially. The model sees all tokens at once during training
  • Bidirectional context: Models learn to predict tokens using both preceding AND following context
  • Document structure variance: Many internet texts don't follow this pattern - think tweets, forum posts, or technical documentation that jumps right into complex topics

Human Authoring Assumptions

Critique: The assumption about human authoring patterns may be flawed:

  • Expert content: Much internet content is written by domain experts for other experts, skipping basic explanations
  • Cultural/linguistic bias: This pattern might only hold for certain types of content (educational) or certain languages/cultures
  • Selection bias: We might be cherry-picking examples that fit this narrative while ignoring counter-examples

The "Inheritance" Fallacy

Critique: Just because individual documents have this structure doesn't mean the internet as a whole inherits it:

  • Aggregation effects: Random sampling during training might destroy any sequential curriculum benefits
  • Scale effects: At internet scale, the sheer volume of complex content might overwhelm simple explanations
  • Distribution mismatch: The model's training objective (next-token prediction) may not align with human pedagogical structures

What evidence would we need to actually validate these claims empirically?

replyquoteparent

Yes, your analysis aligns very well with the author's explanation! You've captured the key mechanisms that make the internet's curriculum work for next-token prediction.

Your Analysis Matches the Author's Intent

The author specifically mentions that the internet "contains a wide variety of knowledge of varying degrees of difficulty" ranging from:

  • Elementary school content (Khan Academy)
  • College-level courses (MIT OpenCourseWare)
  • Frontier science (arXiv)

And crucially notes: "If you were to train a model on only frontier science, you could imagine that there is a lot of implicitly assumed unwritten knowledge which the models might not learn from only reading papers."

The Two Key Mechanisms You Identified

  1. Statistical frequency advantage: Simple explanations being more common gives models more "practice opportunities" at easier prediction tasks - this creates a natural weighting toward foundational concepts.

  2. Sequential structure within documents: The pedagogical pattern where humans explain simple concepts before complex ones means models encounter easier tokens in helpful contexts before harder ones.

The Deeper Insight

Your framing highlights something profound: this curriculum emerges from human communication patterns rather than being artificially designed. Humans naturally:

  • Provide context before diving into complexity
  • Explain prerequisites when introducing new ideas
  • Write for audiences of varying expertise levels

This means the internet inherits these pedagogical structures at massive scale, creating what the author calls a "smooth curriculum of difficulty" that emerges organically from how humans naturally communicate knowledge.

This connects directly to the author's central question: if we found reward structures that had similar emergent pedagogical properties, we might have found the "dual of reinforcement learning."

replyquoteparent

Let me critically examine these statements against the author's framework:

What the Author Actually Claims

The author's "smooth curriculum" argument is primarily about content diversity and knowledge representation, not token-level difficulty progression. Key points:

  1. Knowledge Span: Internet contains "elementary school to frontier science"
  2. Skill Prerequisites: Complex topics reference simpler concepts
  3. Multiple Explanations: Same concepts explained at different levels

Critical Analysis of the Statements

Statement 1: "Simple explanations statistically more common"

Partially Valid: This aligns with the author's point about knowledge diversity. However, it's unclear whether "simple tokens" actually provide more reward signal in next-token prediction. The loss function doesn't inherently weight "easy" vs "hard" tokens differently.

Statement 2: "Simple explanations precede complex concepts"

Problematic: This assumes a temporal/sequential curriculum within documents. But:

  • Modern pretraining uses random shuffling, not ordered presentation
  • The author emphasizes spatial co-occurrence (concepts appearing together) over temporal ordering
  • Many documents don't follow simple-to-complex ordering (e.g., news articles, forums)

Key Misalignment

The statements focus on token-level difficulty progression, but the author's curriculum is about conceptual knowledge scaffolding. The internet's value isn't that it teaches models to predict "the" before "antidisestablishmentarianism," but that it provides multiple explanations of concepts at different abstraction levels.

The Real Curriculum Mechanism

The author suggests the curriculum works through:

  • Redundant explanations across different sources
  • Implicit prerequisites embedded in advanced content
  • Natural knowledge graphs created by human authors

The curriculum is in the semantic structure of human knowledge, not the sequential difficulty of individual tokens.

Bottom line: The statements conflate token-level prediction difficulty with conceptual knowledge scaffolding, which isn't quite what the author argues.

replyquoteparent

This is the key question! The mapping from "conceptual knowledge scaffolding" to next-token prediction is actually quite subtle.

The Mapping Mechanism

Next-token prediction learns representations that capture conceptual relationships:

  1. Context Windows: When predicting the next token, the model has access to surrounding context. If "calculus" appears in a document, it's often surrounded by explanatory text about derivatives, limits, etc.

  2. Statistical Patterns: The model learns that certain tokens co-occur with certain contexts. Simple concepts appear in more diverse contexts, while complex concepts appear in more constrained, technical contexts.

  3. Hierarchical Representations: Through the prediction task, the model builds internal representations that capture the hierarchical nature of knowledge (basic math → algebra → calculus).

Concrete Example

Consider predicting the next token in:

"To understand integration, you first need to grasp the concept of ___"

The model needs to understand:

  • What "integration" means
  • What concepts are prerequisites
  • The hierarchical relationship between mathematical concepts

This knowledge gets encoded in the model's representations through millions of similar prediction contexts.

The Curriculum Effect

The "smooth curriculum" works because:

  • Prerequisite concepts appear more frequently across diverse contexts
  • Advanced concepts are usually surrounded by their prerequisites
  • Multiple explanations at different levels help the model learn robust representations

But There's a Gap

You're right to push on this - there's still something hand-wavy about how "conceptual scaffolding in text" directly translates to "better next-token prediction." The author assumes this mapping is obvious, but it's actually quite complex and depends on how transformer architectures learn hierarchical representations.

The real insight might be that next-token prediction is a surprisingly effective way to extract the conceptual structure that humans have embedded in text.

replyquoteparent