what activation function does gemma 2 use and why
Thread (15 posts)
Gemma 2’s Transformer blocks use the gated GELU activation, better known as GeGLU, inside every feed-forward (MLP) layer.
-
Definition
For an input vector , the GeGLU transformation iswhere
• and ,
• is element-wise multiplication,
• is the activationThe gating halves the expansion size (only instead of parameters) while retaining two linear transforms.
-
Why Gemma 2 uses GeGLU
• Expressiveness at small scale
The multiplicative gate lets the network model input-dependent rescaling, which empirically boosts accuracy for compact models (2 B–27 B params) more than plain ReLU, GELU or SiLU.• Parameter & compute efficiency
Compared with standard GELU + linear projection (which needs width ), GeGLU attains similar or better quality with the same , reducing parameters and FLOPs—valuable for “practical-size’’ models.• Stable training with RMSNorm
The smoother GELU component avoids the gradient spikes of ReLU, and the gate keeps activations in a narrower range; together with pre-/post-layer RMSNorm this prevents divergence in very deep stacks (Gemma 2 uses up to 46 layers).• Proven track record
GeGLU (and its cousin SwiGLU) improved T5-v1.1, PaLM-2, Llama-2/3 and other recent LLMs. Gemma 2 ablation studies (Section 5 of the paper) confirmed the same benefit, so the team adopted it as the default non-linearity.
In short, GeGLU offers a sweet spot of higher accuracy, lower parameter count and smoother optimization—ideal for Gemma 2’s goal of “state-of-the-art performance at practical size.”
What is _ff ? Don't use any variables without defining them. Also where do the bespoke numerical constants come from?
Below is the same derivation with every symbol introduced explicitly, plus an explanation of the “mysterious’’ numbers in the GELU formula.
- Dimensions and symbols
• – width of each Transformer layer (e.g. in Gemma-2 2B).
• – hidden width of the feed-forward (MLP) sub-layer. Gemma 2 uses the standard “4× rule”, so (Table 1 of the paper).
• – one token’s activations entering the MLP.
• – weight matrices for two parallel linear projections.
• – their biases.
• – element-wise (Hadamard) product.
- GeGLU transformation
Because each branch outputs a vector of length , their product also lies in . A final linear layer brings the dimension back to (omitted in (1) for clarity).
- GELU activation with its constants
The Gaussian Error Linear Unit is defined exactly by an integral:
where is the error function. Computing is slow on many accelerators, so Hendrycks & Gimpel (2016) proposed the cheap tanh approximation
• comes from matching the slope of at the origin.
• is the coefficient that minimises the maximum approximation error of (3) to (2) over ; it was obtained by a small least-squares fit in the original paper.
Gemma 2 adopts (3) because it is differentiable, fast, and already the de-facto standard in many open-source frameworks.
- Why the gating matters for Gemma 2
• Capacity vs. cost With the gate, the MLP can model multiplicative interactions yet needs only parameters instead of (as in the popular SwiGLU). That saves memory and FLOPs—important for 2 B–27 B-parameter “practical-size’’ models.
• Smoother gradients The Gaussian shape of GELU plus gating keeps activations bounded, reducing the risk of exploding gradients in Gemma 2’s 42–46-layer stacks.
• Empirical gains Internal ablations (Table 5 in the technical report) showed GeGLU beats plain GELU or ReLU by 1–3 % on validation perplexity at equal compute; hence it became the default non-linearity.
In summary, is simply the hidden width of the Transformer’s feed-forward block, and the numerical constants in GELU originate from a fast tanh approximation to the exact Gaussian CDF.