what activation function does gemma 2 use and why
Thread (15 posts)
Gemma 2 keeps the standard Transformer layout (RMS-norm → linear → non-linearity → linear), but the non-linearity used in every feed-forward block is GeGLU – the “Gated GELU” activation that first appeared in PaLM.
-
What it is
• Split the intermediate projection into two equal parts:
h₁ = W₁ x , h₂ = W₂ x
• Apply GELU to the first half and use it to gate the second half:
GeGLU(x) = GELU(h₁) ⊙ h₂
(⊙ is element-wise product.)In the implementation used by Gemma 2 (“approximated GeGLU”) the exact Gaussian error-function in GELU is replaced with a cheaper polynomial/Tanh approximation so that the whole operation can run in one fused kernel.
-
Why Google DeepMind chose it for Gemma 2
a. Better quality at the same or lower cost
• Internal ablations (and many external studies) show GeGLU and its cousin SwiGLU give ≈1–2 pt higher accuracy/perplexity improvements than plain GELU/ReLU for the same hidden size.
• Because the gate gives the network a multiplicative interaction, it increases the expressive power of each FFN without adding extra parameters (the extra projection is already there in Gated FFN designs).b. Training stability on long contexts
• GeGLU keeps activations roughly in the same range as GELU, avoiding the dead-neuron problem of ReLU and preventing the gradient explosions sometimes seen with SwiGLU for very long context lengths (Gemma 2 trains up to 8 k tokens).c. Hardware efficiency
• The split-gate structure maps cleanly to TPU and GPU tensor cores and can be fused into a single GEMM + element-wise kernel, so the latency overhead versus GELU is negligible.
• Re-using the second projection as the gate (instead of adding a third) keeps parameter count low; for the 27 B model the FFN share of parameters is kept under budget while still widening to 16× d_model.d. Alignment with previous Google stacks
• PaLM, Gemma 1 and Gemini already use GeGLU; keeping the same activation simplifies weight-space transfer (needed for Gemma 2’s knowledge-distillation training recipe).
In short, Gemma 2 uses the GeGLU activation because it gives a small but consistent bump in perplexity and downstream task accuracy, trains stably on long sequences, and comes “for free” computationally once fused – making it a good fit for practical-size open-weight models.
Here’s the house-style I follow when writing mathematics:
• Use KaTeX-compatible syntax inside Markdown.
– Inline math: wrap the expression in single dollar signs, e.g. $ax^2 + bx + c = 0$
.
– Display math: wrap the whole expression in double dollar signs on the same line, e.g.
$$\int_{-\infty}^{\infty} e^{-x^2}\,dx = \sqrt{\pi}$$
.
• Keep the opening and closing $$
on the same line as the equation itself (no line breaks inside).
• Matrices and other multi-line constructs should also stay on one line between $$ … $$
, for example:
$$A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$$
.
• Normal Markdown formatting (headings, bullets, code blocks, etc.) can surround the math, but the math block itself must comply with the rules above so the KaTeX parser renders correctly.
Apologies—my previous reply slipped back into plain-text math because I answered quickly and focused on the content, forgetting to wrap the symbols in the required $ … $
delimiters. Here is the same explanation with the proper KaTeX-friendly formatting:
-
Split the intermediate projection in two
•
• -
Gate with GELU
•
(⊙ is element-wise multiplication.)
Thanks for pointing it out—I’ll stick to the formatting guidelines going forward.