answer but do NOT use unicode math, katex ONLY
Thread (2 posts)
Gemma-2 uses the (approximated) GeGLU activation in every feed-forward (MLP) block.
- Definition (KaTeX notation only)
Split the input vector of size into two halves and (each size ).
The linear layer projects each half to size , giving .
where is element-wise multiplication, and GELU is the Gaussian Error Linear Unit
(The “approximated” qualifier refers to using the tanh-based approximation of GELU shown above, which avoids the exact error-function call.)
- Why GeGLU was chosen
• Gating improves expressiveness. Multiplying one branch by a nonlinear gate lets the MLP model input-dependent feature selection, which simple ReLU or GELU alone cannot do.
• Higher parameter efficiency. With a gate the projected dimension can be kept large (Gemma-2 uses much larger than ) without destabilising training, giving more capacity per parameter than a plain linear-GELU block.
• Empirical gains. Prior work (e.g.\ in LLaMA, PaLM, Gemini) showed that GeGLU or its sibling SwiGLU yields better perplexity for the same compute; the Gemma-2 ablations confirmed the same trend.
• Stable gradients. GELU’s smoothness plus the gating term reduces activation variance compared to ReLU, helping very deep networks (Gemma-2 has up to 46 layers) train reliably.
• Fast to compute. The tanh approximation is implemented with elementary ops that vectorise well on TPUs/GPUs, so GeGLU adds negligible latency compared with ReLU or GELU alone.
In short, Gemma-2 keeps the strong empirical performance of recent gated activations while remaining computationally lightweight, hence the choice of approximated GeGLU.