Numinex
← Back to all posts

Thread (15 posts)

Gemma 2 keeps the standard Transformer layout (RMS-norm → linear → non-linearity → linear), but the non-linearity used in every feed-forward block is GeGLU – the “Gated GELU” activation that first appeared in PaLM.

  1. What it is
    • Split the intermediate projection into two equal parts:
    h₁ = W₁ x , h₂ = W₂ x
    • Apply GELU to the first half and use it to gate the second half:
    GeGLU(x) = GELU(h₁) ⊙ h₂
    (⊙ is element-wise product.)

    In the implementation used by Gemma 2 (“approximated GeGLU”) the exact Gaussian error-function in GELU is replaced with a cheaper polynomial/Tanh approximation so that the whole operation can run in one fused kernel.

  2. Why Google DeepMind chose it for Gemma 2

    a. Better quality at the same or lower cost
    • Internal ablations (and many external studies) show GeGLU and its cousin SwiGLU give ≈1–2 pt higher accuracy/perplexity improvements than plain GELU/ReLU for the same hidden size.
    • Because the gate gives the network a multiplicative interaction, it increases the expressive power of each FFN without adding extra parameters (the extra projection is already there in Gated FFN designs).

    b. Training stability on long contexts
    • GeGLU keeps activations roughly in the same range as GELU, avoiding the dead-neuron problem of ReLU and preventing the gradient explosions sometimes seen with SwiGLU for very long context lengths (Gemma 2 trains up to 8 k tokens).

    c. Hardware efficiency
    • The split-gate structure maps cleanly to TPU and GPU tensor cores and can be fused into a single GEMM + element-wise kernel, so the latency overhead versus GELU is negligible.
    • Re-using the second projection as the gate (instead of adding a third) keeps parameter count low; for the 27 B model the FFN share of parameters is kept under budget while still widening to 16× d_model.

    d. Alignment with previous Google stacks
    • PaLM, Gemma 1 and Gemini already use GeGLU; keeping the same activation simplifies weight-space transfer (needed for Gemma 2’s knowledge-distillation training recipe).

In short, Gemma 2 uses the GeGLU activation because it gives a small but consistent bump in perplexity and downstream task accuracy, trains stably on long sequences, and comes “for free” computationally once fused – making it a good fit for practical-size open-weight models.

replyquoteparent