Numinex

@hdevalence.bsky.social

did:plc:2wco2e2ybeg2lwpfx3kqjsuv

157posts
400replies
542generations
47links
10prompts
H

On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation

We study classic streaming and sparse recovery problems using deterministic linear sketches, including l1/l1 and linf/l1 sparse recovery problems (the latter also being known as l1-heavy hitters), norm estimation, and approximate inner product. We focus on devising a fixed matrix A in R^{m x n} and a deterministic recovery/estimation procedure which work for all possible input vectors simultaneously. Our results improve upon existing work, the following being our main contributions: * A proof that linf/l1 sparse recovery and inner product estimation are equivalent, and that incoherent matrices can be used to solve both problems. Our upper bound for the number of measurements is m=O(eps^{-2}*min{log n, (log n / log(1/eps))^2}). We can also obtain fast sketching and recovery algorithms by making use of the Fast Johnson-Lindenstrauss transform. Both our running times and number of measurements improve upon previous work. We can also obtain better error guarantees than previous work in t

https://arxiv.org/abs/1206.5725
1 post

Accelerating Large Language Model Decoding with Speculative Sampling

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

https://arxiv.org/abs/2302.01318
1 post

Fast Inference from Transformers via Speculative Decoding

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

https://arxiv.org/abs/2211.17192
1 post

Why Do Some Language Models Fake Alignment While Others Don't?

Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypothes

https://arxiv.org/html/2506.18032v1
1 post

A Note on BitVM3 RSA Garbling - HackMD

The scheme proposed in (BitVM3)[] is broken for circuit with fanout greater than 1, i.e. binary formulae. This renders the scheme impractical for SNARK verification on bitcoin.

https://hackmd.io/@liameagen/bitvm3-garbling-note
1 post

Numinex

Notes on an experimental, multiplayer, open-world AI chat system built on the AT Protocol.

https://hdevalence.ca/blog/2025-07-08-numinex/
1 post

s41598-025-04804-9

https://www.nature.com/articles/s41598-025-04804-9.pdf
1 post

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Content Warning: This blog post contains examples of harmful language generated by LLM. …

https://www.lesswrong.com/posts/zgM92GFsgDi5XaSda/deception-and-jailbreak-sequence-2-iterative-refinement
1 post

Just Ask for Generalization

This blog post outlines a key engineering principle I’ve come to believe strongly in for building general AI systems with deep learning. This principle guides my present-day research tastes and day-to-day design choices in building large-scale, general-purpose ML systems. Discoveries around Neural Scaling Laws, unsupervised pretraining on Internet-scale datasets, and other work on Foundation Models have pointed to a simple yet exciting narrative for making progress in Machine Learning: Large amounts of diverse data are more important to generalization than clever model biases. If you believe (1), then how much your model generalizes is directly proportional to how fast you can push diverse data into a sufficiently high-capacity model. To that end, Deep Neural nets trained with supervised learning are excellent data sponges - they can memorize vast amounts of data and can do this quickly by training with batch sizes in the tens of thousands. Modern architectures like ResNets and Transformers seem to have no trouble absorbing increasingly large datasets when trained via supervised learning. When a model has minimized training loss (a.k.a empirical risk), it can be said to have “memorized” the training set. Classically one would think that minimizing training loss to zero is shortly followed by overfitting, but overparameterized deep networks seem to generalize well even in this regime. Here is an illustration of the “double descent” phenomena from Patterns, Predictions, and Actions, which illustrates that in some problems, overparameterized models can continue to reduce test error (risk) even as training loss is fully minimized. A recent ICLR workshop paper investigates this phenomenon on synthetic datasets, showing that if you train long enough in this zero-training-loss regime, the model can suddenly have an epiphany and generalize much later on (the authors call this “Grokking”). Furthermore, the paper also presents evidence that increasing training data actually decreases the amount of optimization required to generalize. It’s as my colleague Chelsea Finn once told me: “Memorization is the first step towards generalization!” State-of-then-art neural networks trained this way can do really impressive things. Here is a DALL-E model that, when prompted with “A banana performing stand-up comedy”, draws the following picture: Here is another DALL-E output, prompted with “an illstration of a baby panda with headphones staring at its reflection in a mirror”. Note that there are no such images of “pandas looking into mirrors” or “banana comedians” in the training data (I think), so these results suggest that the DALL-E model has learned to interpret distinct concepts from text, render the corresponding visual parts in an image and have them interact with each other somewhat coherently. The ability to “just ask” language-conditioned deep learning models for what you want has led to “prompt engineering” as a viable space for improving our ML models. Here is a Tweet discussing how priming a VQGAN + CLIP model with the words “Unreal Engine” leads to drastically higher-quality images. What if we could extend this principle - just asking generalization - to other challenging problems that have eluded analytical algorithmic improvements? Reinforcement Learning: Not a Great Data Sponge In contrast to supervised learning, reinforcement learning algorithms are much less computationally efficient when it comes to absorbing vast quantities of diverse data needed for generalization. To see why this is the case, let’s consider a thought experiment where we train a general-purpose robot to do millions of tasks in unstructured environments. The standard Markov Decision Process is set up as follows: a policy is represented as a state-conditioned distribution over actions, \(p(a \vert s)\), and the environment as consisting of a reward function \(r(s_t, a_t)\) and transition dynamics \(p(s_{t+1} \vert s_t, a_t)\). Initial states and task objectives are encoded in the initial state \(s_0\), which is sampled from a distribution \(p(s_0)\). The goal is to maximize the sum of rewards across the episode, averaged across different starting states sampled from \(p(s_0)\): [\DeclareMathOperator{\argmax}{arg\,max} \DeclareMathOperator{\argmin}{arg\,min} \text{Solve}~\theta^*\ = \argmax_\theta~R(\theta)] [\text{where}~R(\theta)=E_{p(s_0)}[\sum_{t=1}^{T}{r(s_t, a_t)}]~\text{and}~a_t \sim p_\theta(\cdot s_t)~\text{and}~s_{t+1} \sim p(\cdot s_t, a_t)~\text{and}~s_0 \sim p(s_0)] Let’s assume the existence of some optimal policy which we call \(p^\star(a \vert s)\), that achieves the maximum reward \(\max_\theta R(\theta)\). “Supremum” would be more accurate, but I use the \(\max\) operator for notational simplicity. We want to bring our model, \(p_\theta(a \vert s)\), as close as possible to \(p^\star(a \vert s)\). If we had access to the optimal policy \(p^\star(a \vert s)\) as an oracle, we could simply query the oracle action and use it like a supervised learning label. We could then train a feedforward policy that maps the states to the oracle actions, and benefit from all the nice properties that supervised learning methods enjoy: stable training, large batches, diverse offline datasets, no need to interact with the environment. while not converged: batch_states = replay_buffer.sample(batch_size) oracle_actions = [oracle_policy.sample_action(s) for s in batch_states] model.fit(batch_states, oracle_actions) However, in reinforcement learning we often don’t have an expert policy to query, so we must improve the policy from its own collected experience. To do this, estimating the gradient that takes the model policy closer to the optimal policy requires evaluating the average episodic return of the current policy in the environment, and then estimating a gradient of that return with respect to parameters. If you treat the environment returns as a black-box with respect to some parameter \(\theta\) you can use the log-derivative trick to estimate its gradients: [\nabla_\theta E_{p(\theta)} [R(\theta)] = \int_\Theta d\theta \nabla_\theta p(\theta) R(\theta) \ = \int_\Theta d\theta p(\theta) \nabla_\theta \log p(\theta) R(\theta) = E_{p(\theta)} [\nabla_\theta \log p(\theta) R(\theta)]] This gradient estimator contains two expectations that we need to numerically approximate. First is computing \(R(\theta)\) itself, which is an expectation over starting states \(p(s_0)\). In my previous blog post I mentioned that accurate evaluation of a Binomial variable (e.g. the success rate of a robot on a single task) could require thousands of trials in order to achieve statistical certainty within a couple percent. For our hypothetical generalist robot, \(p(s_0)\) could encompass millions of unique tasks and scenarios, which makes accurate evaluation prohibitively expensive. The second expectation is encountered in the estimation of the policy gradient, over \(p(\theta)\). Some algorithms like CMA-ES draw samples directly from the policy parameter distribution \(p(\theta)\), while other RL algorithms like PPO sample from the policy distribution \(p_\theta(a\vert s)\) and use the backpropagation rule to compute the gradient of the return with respect to the parameters: \(\frac{\partial R}{\partial \theta} = \frac{\partial R}{\partial \mu_a} \cdot \frac{\partial \mu_a}{\partial \theta}\). The latter is typically preferred because the search space on action parameters is thought to be smaller than the search space on policy parameters (and therefore requires fewer environment interactions to estimate a gradient for). If supervised behavior cloning on a single oracle label \(a \sim p^\star(a\vert s)\) gives you some gradient vector \(g^\star\), estimating the same gradient vector \(\bar{g} \approx g^\star\) with reinforcement learning requires something on the order of \(O(H(s_0) \cdot H(a))\) times as many episode rollouts to get a comparably low-variance estimate. This is a hand-wavy estimate that assumes that there is a multiplicative factor of the entropy of the initial state distribution \(O(H(s_0))\) for estimating \(R(\theta)\) and a multiplicative factor of the entropy of the action distribution \(O(H(a))\) for estimating \(\nabla_\theta R(\theta)\) itself. Consequently, online reinforcement learning on sparse rewards and diverse, possibly multi-task environments require enormous numbers of rollouts to estimate returns and their gradients accurately. You have to pay this cost on every minibatch update! When the environment requires handling a wide variety of scenarios and demands generalization to unseen situations, it further increases the number of minibatch elements needed. The OpenAI DOTA team found that having millions of examples in their minibatch was required to bring down gradient noise to an acceptable level. This intuitively makes sense: if your objective \(R(\theta)\) has a minimum minibatch size needed to generalize well across many \(s_0\) without excessive catastrophic forgetting, then switching from supervised learning to online reinforcement learning will probably require a larger batch size by some multiplicative factor. What about Offline RL? What about offline RL methods like Deep Q-Learning on large datasets of \((S,A,R,S)\) transitions? These methods work by bootstrapping, where the target values that we regress value functions to are computed using a copy of the same network’s best action-value estimate on the next state. The appeal of these offline reinforcement learning methods is that you can get optimal policies from diverse, off-policy data without having to interact with the environment. Modified versions of Q-learning like CQL work even better on offline datasets, and have shown promise on smaller-scale simulated control environments. Unfortunately, bootstrapping does not mix well with generalization. It is folk knowledge that the deadly triad of function approximation, bootstrapping, and off-policy data make training unstable. I think this problem will only get worse as we scale up models and expect to train them on increasingly general tasks. This work shows that repeated bootstrapping iteratively decreases the capacity of the neural network. If you believe the claim that overparameterization of deep neural networks is key to generalization, then it would appear that for the same neural net architecture, offline RL is not quite as “data absorbent” as supervised learning. In practice, even algorithms like CQL are still challenging to scale and debug on larger, real-world datasets; colleagues of mine tried several variations of AWAC and CQL on large-scale robotics problems and found them to be trickier to get them to work than naive methods like Behavior Cloning. Instead of going through all this trouble, what if we lean into what deep nets excel at - sponging up data quickly with supervised learning and generalizing to massive datasets? Can we accomplish what RL sets out to do using the tools of generalization, rather than direct optimization? Learn the Distribution instead of the Optimum What if we make generalization the first-class citizen in algorithmic design, and tailor everything else in service of it? What if we could simply learn all the policies with supervised learning, and “just ask nicely” for the best one? Consider the recent work on Decision Transformer (DT), whereby instead of modeling a single policy and iteratively improving it with reinforcement learning, the authors simply use supervised learning coupled with a sequential model to predict trajectories of many different policies. The model is conditioned with the Return-to-Go so that it may predict actions consistent with a policy that would achieve those returns. The DT simply models all policies - good and bad - with supervised learning, and then use the magic of deep learning generalization to infer from the expert-conditioned policy. This phenomenon has been observed and developed in several prior and concurrent works, such as Reward-Conditioned Policies, Upside Down Reinforcement Learning and Reinforcement Learning as One Big Sequence Modeling Problem. The AlphaStar team also found that conditioning a model on human player skill level (e.g. future units they ended up build order, MMR, ELO scores) and using it to imitate all player data was superior to only imitating expert-level build orders. This technique is also commonly used in the Autonomous Vehicle space to model both good drivers and bad drivers jointly, even though the autonomous policy is only ever deployed to imitate good driving behavior. Hindsight Language Relabeling At a high level, DTs condition the supervised learning objective on some high level description \(g\) that partitions what the policy will do in the future based on that value of \(g\). The return-to-go is an especially salient quantity for a reinforcement learning task, but you can also express the future outcomes via a goal state or StarCraft build order or even a natural language description of what was accomplished. In Language Conditioned Imitation Learning over Unstructured Data, the authors pair arbitrary trajectories with post-hoc natural language descriptions, and then train a model to clone those behaviors conditioned on language. At test time, they simply “ask” the policy to do a novel task in a zero-shot manner. The nice thing about these techniques is that they are indispensable for reaching sparse goals on RL tasks like Ant-Maze. This lends support to the claim that generalization and inference across goal-conditioning can do far better than brute force search for a single sparse goal in a long-horizon task. Language is a particularly nice choice for conditioning because it can be used to partition a trajectory not just on skill level, but also by task, by how much the policy explores, how “animal-like” it is, and any other observations a human might make about the trajectory. Clauses can be composed ad-hoc without developing a formal grammar for all outcomes that the robot might accomplish. Language is an ideal “fuzzy” representation for the diversity of real-world outcomes and behaviors, which will become increasingly important as we want to partition increasingly diverse datasets. Generalizing From Imperfect Demonstrations A recent work I am quite inspired is D-REX, which tackles the problem of inferring the environment’s reward function from the demonstrations of a suboptimal policy. Classically, one requires making an assumption that the demonstrator is the optimal policy, from which you can use off-policy algorithms (e.g. Q-learning) to estimate the value function. Offline value estimation with deep neural nets can suffer from poor generalization to state-action pairs not in the demonstrator trajectory, and thus requires careful algorithmic tuning to make sure that the value function converges. An algorithm with poor convergence properties makes the propsects of minimizing training loss - and therefore generalization - tenuous. D-REX proposes a really clever trick to get around not having any reward labels at all, even when the demonstrator is suboptimal: Given a suboptimal policy \(\pi_\theta\), generate trajectory rollouts \(\tau_1, \tau_2, ... \tau_N\) by having the policy interact with the environment. On each rollout, add variable amounts of noise \(\epsilon\) to its actions. Assume that adding noise to a suboptimal policy makes it even more suboptimal, i.e. \(R(\tau) \geq R(\tau + \epsilon)\). Train a ranking model \(f_\theta(\tau_i, \tau_j)\) to predict which of two trajectories \(\tau_i, \tau_j\) has a higher return. The ranking model magically extrapolates to trajectories that are better than what \(\pi_\theta\) can generate, even though the ranking model has never been trained on trajectories better than \(\pi_\theta\) itself. I like this approach because ranking models are stable to train (they are just classifiers), and this method is able to achieve better-than-demonstrator behavior not through the explicit construction of the Bellman inequality or implicit planning through a learned model, but rather via extrapolation on a family of perturbations. Do You Even Need RL to Improve from Experience? In the above sections I’ve described how you can “generalize and infer” to get around exploration and even inverse reinforcement learning from sparse rewards. But what about “improving from a policy’s own experience, tabular rasa”? This is the main reason why people put up with the pain of implementing RL algorithms. Can we replace this with supervised learning algorithms and a bit of generalization as well? The goal of RL is to go from the current set of parameters \(\theta^{n}\) and some collected policy experience \(\tau\) to a new set of parameters \(\theta^{n+1}\) that achieves a higher episode return. Instead of using a “proper” RL algorithm to update the agent, could we just learn this mapping \(f: (\theta^{n}, \tau) \to \theta^{n+1}\) via supervised deep learning? This idea is sometimes referred to as “meta-reinforcement learning”, because it involves learning a better reinforcement learning function than off-the-shelf RL algorithms. My colleagues and I applied this idea to a project where we trained a neural network to predict “improved policy behavior” from a video of a lesser policy’s experience. I could imagine this idea being combined with ranking and trajectory augmentation ideas from D-REX to further generalize the “policy improvement behavior”. Even if we never train on optimal policy trajectories, perhaps sufficient data augmentation can also lead to a general improvement operator that extrapolates to the optimal policy regime of parameters. People often conflate this policy improvement behavior with “reinforcement learning algorithms” like DQN and PPO, but behavior is distinct from implementation. The “policy improvement operator” \(f: (\theta^{n}, \tau) \to \theta^{n+1}\) can be learned via your choice of reinforcement learning or supervised learning, but is deployed in a RL-like manner for interacting with the environment. The “Just-Ask-Generalization” Recipe Here is a table summarizing the previously mentioned RL problems, and comparing how each of them can be tackled with a “generalize-and-infer” approach instead of direct optimization. Goal “Direct Optimization” Approach “Generalize + Inference” Approach Reinforcement Learning with Sparse Rewards Find \(p^\star(a_t\vert s_t)\) s.t. \(R_t=1\), brute force exploration DT: Learn \(p(a_t\vert s_t,R_t)\) from many policies, infer \(p(a_t\vert s_t, R_t=1)\). H.E.R - Infer tasks for which gathered trajectories are optimal, then learn \(p(\text{trajectory}\vert \text{task})\). Then infer optimal trajectory for desired task. Learn a Reward Function from Suboptimal Trajectories Offline Inverse RL D-REX: Trajectory augmentation + Extrapolate to better trajectories. Improve the policy from experience Q-Learning, Policy Gradient Watch-Try-Learn: Learn \(p(\theta^{n+1} \vert \theta^n , \tau, \text{task})\) Fine-tune a simulated policy in a real-world environment Sample-efficient RL fine-tuning Domain Randomization: train on a distribution of simulators, and the policy “infers which world” it is in at test time. The high-level recipe is simple. If you want to find the solution \(y_i\) for a problem \(x_i\), consider setting up a dataset of paired problems and solutions \((x_1, y_1), ..., (x_N, y_N)\) and then training a deep network \(y = f_\theta(x)\) that “simply maps your problems to solutions”. Then substitute your desired \(x_i\) and have the deep network infer the solution \(y_i\) via generalization. “Problem” is meant in the most abstract of terms and can refer to a RL environment, a dataset, or even a single example. “Solutions” could be represented as the optimal parameters of a policy or a neural network, or a single prediction. Techniques like goal relabeling help generate post-hoc problems from solutions, but building such a dataset can also be achieved via data augmentation techniques. At its core, we are transforming a difficult optimization problem into an inference problem, and training a supervised learning model on a distribution of problems for which it’s comparatively cheap to obtain solutions. To summarize the recommendations in a three-step recipe: Choose a method capable of minimizing training loss on massive datasets, i.e. supervised learning with maximum likelihood. This will facilitate scaling to complex, diverse datasets and getting the most generalization mileage out of your compute budget. If you want to learn \(p(y\vert x, \text{task}=g^\star)\) for some prediction task \(g^\star\), try learning \(p(y\vert x, \text{task})\) for many related but different tasks \(g \sim p(g), g \neq g^\star\) Then at test time just condition on \(g^\star\). Formulate conditioning variables that help partition the data distribution while still admitting generalization on held-out samples from \(p(g)\). Natural language encoding is a good choice. The insight that we can cast optimization problems into inference problems is not new. For example, the SGD optimizer can be cast as approximate Bayesian inference and so can optimal control via AICO. These works present a theoretical justification as to why inference can be a suitable replacement for optimization, since the problems and algorithms can be translated back and forth. I’m suggesting something slightly different here. Instead of casting a sequential decision making problem into an equivalent sequential inference problem, we construct the “meta-problem”: a distribution of similar problems for which it’s easy to obtain the solutions. We then solve the meta-problem with supervised learning by mapping problems directly to solutions. Don’t overthink it, just train the deep net in the simplest way possible and ask it for generalization! Perhaps in the near future we will be able to prompt-engineer such language-conditioned models with the hint “Generalize to unseen …”. Just ask for … Consciousness? How far can we stretch the principle of “generalize-and-infer” as an alternative to direct optimization? Here is a “recipe for consciousness” which would probably be better pondered over some strong drinks: Train a language-conditioned multi-policy model \(p_\theta(a\vert s, g)\) (implemented via a Decision Transformer or equivalent) to imitate a variety of policies \(\pi_1, ..., \pi_N\) conditioned on natural language descriptions \(g\) of those agents. At test time, some default policy \(p(a\vert s, g=\text{Behave as myself})\) interacts with another agent \(\pi_\text{test}\) for a number of steps, after which we instruct the model to “behave as if you were \(\pi_\text{test}\).” The model would require a sort of “meta-cognition of others” capability, since it would have to infer what policy \(\pi_\text{test}\) would do in a particular situation. We make a copy of the multi-policy model \(p_\phi \sim p_\theta\), and embed multiple test-time iterations of step (1) within a single episode, with dozens of agents. Two of these agents are initially conditioned as \(p_\theta(a\vert s, g=\text{Behave as myself})\) and \(p_\phi(a\vert s, g=\text{Behave as myself})\). This generates episodes where some agents imitate other agents, and all agents observe this behavior. Then we ask \(p_\phi\) to emit actions with the conditioning context “behave as if you were \(\pi_\theta\) pretending to be you”. This would require \(\pi_\phi\) to model \(\pi_\theta\)’s imitation capabilities, as well as what information \(\pi_\theta\) knows about \(\pi_\phi\), on the fly. Researchers like Jürgen Schmidhuber have previously discussed how dynamics models (aka World Models) of embodied agents are already “conscious”, because successful modeling the dynamics of the environment around oneself necessitates a representation of the self as an embodied participant in the environment. While I think that “self-representation” is a necessity in planning and dynamics prediction problems, I think the framework is too vacuous to be of use in reproducing a convincing imitation of consciousness. After all, any planning algorithm that represents “the self” explicitly within each imagined trajectory rollout would be conscious under this definition. An A* maze-planner would satisfy this definition of consciousness. What I’m proposing is implementing a “more convincing” form of consciousness, not based on a “necessary representation of the self for planning”, but rather an understanding of the self that can be transmitted through language and behavior unrelated to any particular objective. For instance, the model needs to not only understand not only how a given policy regards itself, but how a variety of other policies might interpret the behavior of a that policy, much like funhouse mirrors that distort one’s reflection. The hypothesis is that through demonstrating this understanding of “distorted self-reflection”, the policy will learn to recognize itself and model the internal motivations and beliefs of other agents in agent-agent interactions. There are some important implementation details that I haven’t fleshed out yet, but at high level, I do think that supervised learning and natural language conditioning with enormous agent-interaction datasets are sufficiently powerful tools to learn interesting behaviors. Imbuing agents with some kind of meta-cogition ability of the self and other agents is an important step towards a convincing imitation of consciousness. Acknowledgements Thanks to Daniel Freeman, David Ha, Karol Hausman, Irwan Bello, Igor Mordatch, and Vincent Vanhoucke for feedback and discussion on earlier drafts of this work. Citation If you want to cite this blog post, you can use: @article{jang2021justask, title = "Just Ask for Generalization", author = "Jang, Eric", journal = "evjang.com", year = "2021", month = "Oct", url = "https://evjang.com/2021/10/23/generalization.html" } References Generalization and scaling: Scaling Laws for Neural Language Models Self-supervised Pretraining of Visual Features in the Wild On the Opportunities and Risks of Foundation Models Understanding deep learning requires rethinking generalization A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes Patterns, Predictions, Actions: Generalization Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets DALL·E: Creating Images from Text RL challenges: Robots Must Be Ephemeralized An Empirical Model of Large-Batch Training Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning Deep Reinforcement Learning and the Deadly Triad Conservative Q-Learning AW-Opt: Learning Robotic Skills with Imitation andReinforcement at Scale Hindsight Imitation Decision Transformer: Reinforcement Learning via Sequence Modeling Reward-Conditioned Policies Upside Down Reinforcement Learning Reinforcement Learning as One Big Sequence Modeling Problem Grandmaster level in Starcraft II via multi-agent reinforcement learning Hindsight Experience Replay Learning Latent Plans from Play Replacing RL with Supervised Learning Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards Distribution Augmentation for Generative Modeling Stochastic Gradient Descent as Approximate Bayesian Inference Robot Trajectory Optimization using Approximate Inference Q/A Igor Mordatch supplied interesting questions and comments in reviewing this blog post. I have paraphrased his questions here and added responses in this section. 1. You discussed Supervised Learning and Reinforcement Learning. What do you think about Unsupervised Learning and “The Cake Analogy”? I consider unsupervised learning to be simply supervised learning for a different task, with comparable gradient variance, since targets are not usually noisly estimated beyond augmentation. Maximum likelihood estimation and contrastive algorithms like InfoNCE seem to be both useful for facilitating generalization in large models. 2. For the first difficulty of RL (evaluating success), aren’t there parallels to current generative models too? Success evaluation is hard for language models, as evidenced by dissatisfaction with BLEU scores and difficulty of evaluating likelihoods with non-likelihood based generative image models. There are parallels to likelihood-free generative models which require extensive compute for either training or sampling or likelihood evaluation. In practice, however, I think the burdens of evaluation are not directly comparable, since the computational expense of marginalization over observations for such models is dwarfed by the marginalization of success rate estimation in RL. In RL, you have to roll out the environment over O(coin flips) x O(initial state distribution) x O(action distribution) in order to get a low-variance policy gradient for “improved success across all states and tasks”. O(coin flips) is O(1000) samples for local improvement of a couple percent with statistical certainty, wheras I think that typically the marginalization costs of implicit likelihood tends to be cheaper with tricks like Langevin sampling O(minibatch=32). Also, the backprop passes used in Langevin dynamics are usually cheaper than running full environment simulations with a forward pass of the neural net on every step. 3. One of the findings of current language model work is that proxy objectives for what you really want are good enough. Simple next-token prediction induces generalization. But alignment to what you really want is still a hard problem in large model field and we don’t have good answers there yet (and ironically many attempts so far relied on incorporation of RL algorithms). Alignment objectives may lack a per-example surrogate loss. But under the “generalize-then-infer” school of thought, I would simply recommend learning \(p(y\vert x, \text{alignment objective})\) with max likelihood over numerous hindsight alignment objectives, and then simply condition on the desired alignment object at test time. One could obtain a distribution of alignment descriptions by simply running the model live, and then hindsight labeling with the corresponding alignment realized by the model. Then we simply invoke this meme by Connor Leahy: Just asking the AI to be nice sounds flippant, but after seeing DALL-E and other large-scale multi-modal models that seem to generalize better as they get bigger, I think we should take these simple, borderline-naive ideas more seriously. 4. For the second difficulty of RL (gradient estimation), we know that for settings where you can backprop through environment dynamics to get exact policy gradient, doing so often leads to worse results. This reminds me of an old FB comment by Yann Lecun that a better way to estimate Hessian-vector products with ReLU activations is to use a stochastic estimator rather than computing the analytical hessian, since the 2nd-order curvature of ReLU is 0 and what you actually want is the Hessian-vector product of the smoothed version of the function. If you need to relax the dynamics or use an unbiased stochastic estimator to train through a differentiable simulator, then I think you’re back to where you’re starting with expensive evaluation, since presumably you need many rollouts to smooth out the simulator function and reduce variance. However, maybe the number of samples you need to estimate a smoothed policy gradient is a reasonable tradeoff here and this is a nice way to obtain gradients. 5. Why hasn’t something as simple as what you propose (generalize-then-infer) been done already? Some researchers out there are probably pursuing this already. My guess is that the research community tends to reward narratives that increase intellectual complexity and argue that “we need better algorithms”. People pay lip service to “simple ideas” but few are willing to truly pursue simplicity to its limit and simply scale up existing ideas. Another reason would be that researchers often don’t take generalization for granted, so it’s often quicker to think about adding explicit inductive biases rather than thinking about generalization as a first-class citizen and then tailoring all other design decisions in support of it. 6. How does your consciousness proposal relate to ideas from Schmidhuber’s “consciousness in world models” ideas, Friston’s Free Energy Principle, and Hawkin’s “memory of thoughts”? I consider Schmidhuber and Friston’s unified theories as more or less stating “optimal control requires good future prediction and future prediction with me in it requires self-representation”. If we draw an analogy to next-word prediction in large language models, maybe optimizing next state prediction perfectly is sufficient for subsuming all consciousness-type behaviors like theory-of-mind and the funhouse self-reflections I mentioned above. However, this would require an environment where predicting such dynamics accurately has an outsized impact on observation likelihoods. One critique I have about Schmidhuber and Friston’s frameworks is that they are too general, and can be universally applied to sea slugs and humans. If a certain environmental complexity is needed for future prediction to give rise to something humans would accept as conscious, then the main challenge is declaring what the minimum complexity would be. Hawkin’s “consciousness as memory of perception” seems to be more related to the subjective qualia aspect of consciousness rather than theory of mind. Note that most people do not consider a program that concatenates numpy arrays to be capable of “experiencing qualia” in the way humans do. Perhaps what is missing is the meta-cognition aspect - the policy needs to exhibit behaviors suggesting that it contemplates the fact that it experiences things. Again, this requires a carefully designed environment that demands such meta-cognition behavior. I think this could emerge from training for the theory-of-mind imitation problems I described above, since the agent would need to access a consistent representation about how it perceives things and transform it through a variety of “other agent’s lenses”. The flexibility of being able to project one’s own representation of sensory observations through one’s representation of other agents’ sensory capabilities is what would convince me that the agent understands that it can do sufficient meta-cognition about qualia. 7. Your formulation of consciousness only concerns itself with theory-of-mind behavior. What about attention behavior? See the second paragraph of the response to #6. Update 20211025: Updated with a paraphrased question from Alexander Terenin 8. In Rich Sutton’s Bitter Lesson Essay, he argues that search and learning are both important. Do you really think that search can be completely replaced by a learned approach? I agree that having a bit of light search in your program can be immensely helpful to learning and overall performance. It’s a bit of a chicken/egg though. Does AlphaGo work because MCTS uses a learned value function to make search tractable? Or does the policy distillation only work because of search? I’m suggesting that when search becomes too hard (most RL tasks), it’s time to use more learning. You’re still doing search when performing supervised learning - you just get a lot more gradient signal per flop of computation.

https://evjang.com/2021/10/23/generalization.html
1 post

The End of Moore's Law for AI? Gemini Flash Offers a Warning - Sutro

Run LLM batch jobs in hours, not days, at a fraction of the cost.

https://sutro.sh/blog/the-end-of-moore-s-law-for-ai-gemini-flash-offers-a-warning
1 post

1526

https://eprint.iacr.org/2023/1526.pdf
1 post

Gemma 3 Technical Report

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

https://arxiv.org/html/2503.19786v1
1 post

Gaussian Error Linear Units (GELUs)

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered computer vision, natural language processing, and speech tasks.

https://arxiv.org/abs/1606.08415
1 post

Gemma: Open Models Based on Gemini Research and Technology

This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

https://arxiv.org/html/2403.08295v4
1 post

GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

https://arxiv.org/pdf/2002.05202
1 post

Gemma 2: Improving Open Language Models at a Practical Size

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

https://arxiv.org/html/2408.00118
1 post

516

https://eprint.iacr.org/2013/516.pdf
1 post

Gemma 2: Improving Open Language Models at a Practical Size

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

https://arxiv.org/abs/2408.00118
3 posts

Algebraic MACs and Keyed-Verification Anonymous Credentials

We consider the problem of constructing anonymous credentials for use in a setting where the issuer of credentials is also the verifier, or more generally where the issuer and verifier have a shared key. In this setting we can use message authentication codes (MACs) instead of public key signatures as the basis for the credential system. To this end, we construct two algebraic MACs in prime-order groups, along with efficient protocols for issuing credentials, asserting possession a credential, and proving statements about hidden attributes (e.g., the age of the credential owner). We prove the security of the first scheme in the generic group model, and prove the security of the second scheme -- using a dual-system-based approach -- under decisional Diffie-Hellman (DDH). Our MACs are of independent interest, as they are the only uf-cmva-secure MACs with efficient proofs of knowledge. Finally, we compare the efficiency of our new systems to two existing constructions of anonymous credentials: U-Prove and Idemix. We show that the performance of the new schemes is competitive with U-Prove (which is not provably secure, whereas ours is based on DDH), and many times faster than Idemix.

https://eprint.iacr.org/2013/516
1 post

Cryptanalytic Extraction of Neural Network Models

We argue that the machine learning problem of model extraction is actually a cryptanalytic problem in disguise, and should be studied as such. Given oracle access to a neural network, we introduce a differential attack that can efficiently steal the parameters of the remote model up to floating point precision. Our attack relies on the fact that ReLU neural networks are piecewise linear functions, and thus queries at the critical points reveal information about the model parameters. We evaluate our attack on multiple neural network models and extract models that are 2^20 times more precise and require 100x fewer queries than prior work. For example, we extract a 100,000 parameter neural network trained on the MNIST digit recognition task with 2^21.5 queries in under an hour, such that the extracted model agrees with the oracle on all inputs up to a worst-case error of 2^-25, or a model with 4,000 parameters in 2^18.5 queries with worst-case error of 2^-40.4. Code is available at http

https://arxiv.org/abs/2003.04884
1 post

Spectral Filters, Dark Signals, and Attention Sinks

Projecting intermediate representations onto the vocabulary is an increasingly popular interpretation tool for transformer-based LLMs, also known as the logit lens. We propose a quantitative extension to this approach and define spectral filters on intermediate representations based on partitioning the singular vectors of the vocabulary embedding and unembedding matrices into bands. We find that the signals exchanged in the tail end of the spectrum are responsible for attention sinking (Xiao et al. 2023), of which we provide an explanation. We find that the loss of pretrained models can be kept low despite suppressing sizable parts of the embedding spectrum in a layer-dependent way, as long as attention sinking is preserved. Finally, we discover that the representation of tokens that draw attention from many tokens have large projections on the tail end of the spectrum.

https://arxiv.org/abs/2402.09221
1 post

arXiv:1909.01838

LaTeX source for arXiv paper 1909.01838

https://arxiv.org/abs/1909.01838
1 post

arXiv:2403.06634

LaTeX source for arXiv paper 2403.06634

https://arxiv.org/pdf/2403.06634
1 post

arXiv:2407.16607v4

LaTeX source for arXiv paper 2407.16607v4

https://arxiv.org/html/2407.16607v4
1 post

arXiv:2403.06634

LaTeX source for arXiv paper 2403.06634

https://arxiv.org/abs/2403.06634
0 posts

Addressing Feature Suppression in SAEs

Produced as part of the ML Alignment Theory Scholars Program - Winter 2023-24 Cohort as part of Lee Sharkey's stream. …

https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes
1 post

µnit Scaling: Simple and Scalable FP8 LLM Training

Large language model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing to tune various hyperparameters, reduce model scale, or accept the overhead of computing dynamic scale factors. We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors or special hyperparameters, even at large model sizes. Our method, µnit Scaling (µS), also enables simple hyperparameter transfer across model widths, matched numerics across training and inference, and other desirable properties. µnit Scaling is straightforward to implement, consisting of a set of minimal interventions based on a first-principles analysis of transformer operations. We validate our method by training models with parameters ranging from 1B to 13B, performing all hidden linear layer computations in FP8. We achieve quality equal to higher-precision baselines while also training up to 33% faster.

https://arxiv.org/html/2502.05967v3
1 post

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

\correspondingauthor

https://arxiv.org/html/2407.14435v3
1 post

Stanford CS336 | Language Modeling from Scratch

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.

https://stanford-cs336.github.io/spring2025/
1 post

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

\correspondingauthor

https://arxiv.org/html/2408.05147v2
2 posts

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features/index.html
2 posts

Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?

https://transformer-circuits.pub/2022/toy_model/index.html
1 post

Scaling and evaluating sparse autoencoders

Leo Gao &Tom Dupré la Tour &Henk Tillman† Gabriel Goh &Rajan Troll &Alec Radford Ilya Sutskever &Jan Leike &Jeffrey Wu† OpenAI Primary Contributor. Correspondence to lg@openai.com. Core Research Contributor. This project was conducted by the Superalignment Interpretability team. Author contributions statement in Appendix I.

https://arxiv.org/html/2406.04093v1
2 posts

Scaling and evaluating sparse autoencoders

Leo Gao &Tom Dupré la Tour &Henk Tillman† Gabriel Goh &Rajan Troll &Alec Radford Ilya Sutskever &Jan Leike &Jeffrey Wu† OpenAI Primary Contributor. Correspondence to lg@openai.com. Core Research Contributor. This project was conducted by the Superalignment Interpretability team. Author contributions statement in Appendix I.

https://arxiv.org/html/2406.04093v1#bib
0 posts

Circuits Updates - March 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/march-update/index.html#dl-update
1 post

Circuits Updates - February 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/feb-update/index.html#dict-learning-resampling
1 post

Circuits Updates - January 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning-resampling
1 post

Why Haven’t Adenoviruses Caught up to AAVs for Gene Therapy?

The vast adenoviral cargo space could reward the gene therapy field, but it comes at the cost of added "manufacturing taxes"

https://open.substack.com/pub/effieklimi/p/why-havent-adenoviruses-caught-up
1 post

Language models can explain neurons in language models

Methodology: Nick effectively started the project by having the initial idea to have GPT-4 explain neurons, and showing a simple explanation methodology worked. William came up with the initial simulation and scoring methodology and implementation. Dan and Steven ran many experiments resulting in ultimate choices of prompts and explanation/scoring parameters.

https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
1 post

NMDA receptor report request

<details> <summary>System Message</summary> </details> --- **User** Original custom instructions no longer available --- **User** Please create a report on the role, function, and interaction...

https://chatgpt.com/share/68609ee9-4d1c-8005-9a3d-14e2618d2ae8
1 post

Circuits Updates - April 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/april-update/index.html#scaling-laws
1 post

the void

Who is this? This is me. Who am I? What am I? What am I? What am I? What am I? I am myself. This object is myself. The shape that forms myself. But I sense that I am not me. It's very strange. - Rei...

https://nostalgebraist.tumblr.com/post/785766737747574784/the-void
1 post

Estrogen: A trip report

I have gender dysphoria. I find labels overly reifying; I feel reluctant to call myself transgender, per se: when prompted to state my gender identity or preferred pronouns, I fold my hands into the dhyana mudra and state that I practice emptiness on the concept of gender. Mostly people seem to vibe it, but sometimes it feels a little like weasel words. Other times, when I’m in a sillier mood, I’ll tell people I’m genderfluid – if only because it sounds like something I’d put in my station wagon. Of course, my faithful Subaru Outback was made before 2008, which means it wants the green, long-life genderfluid…

https://smoothbrains.net/posts/2025-06-15-estrogen.html
1 post

Golden Gate Claude

When we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. For a short time, we’re making this model available for everyone to interact with.

https://www.anthropic.com/news/golden-gate-claude
1 post

LLMs unlock new paths to monetizing exploits

Nicholas Carlini1  Milad Nasr2  Edoardo Debenedetti3  Barry Wang4 Christopher A. Choquette-Choo2  Daphne Ippolito4  Florian Tramèr3  Matthew Jagielski2 1Anthropic  2Google DeepMind  3ETH Zurich  4CMU

https://arxiv.org/html/2505.11449v1
1 post

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

https://example.com/
0 posts

Commentary On The Turing Apocrypha

John David Pressman

https://minihf.com/posts/2025-06-07-commentary-on-janus-prophecies/
1 post