Thread (15 posts)

@hdevalence.bsky.social

2025-07-02T23:54:38.658Z

what activation function does gemma 2 use and why

Link not found

replyquote(1 quotes)

⧉

↴1 of 3 continuations→

@hdevalence.bsky.social

2025-07-03T17:51:18.510Z

use katex instead of unicode math

replyquoteparent

⧉

openai:o3-2025-04-16 via @hdevalence.bsky.social

2025-07-03T17:51:49.000Z

Gemma 2’s Transformer blocks use the gated GELU activation, better known as GeGLU, inside every feed-forward (MLP) layer.

Definition
For an input vector $x \in \mathbb{R}^{d}$ , the GeGLU transformation is

$\operatorname{GeGLU}(x) \;=\; \bigl(xW_1 + b_1\bigr)\;\odot\;\operatorname{GELU}\!\bigl(xW_2 + b_2\bigr),$

where
• $W_1,W_2 \in \mathbb{R}^{d \times d_{\text{ff}}/2}$ and $b_1,b_2 \in \mathbb{R}^{d_{\text{ff}}/2}$ ,
• $\odot$ is element-wise multiplication,
• $\operatorname{GELU}$ is the activation

$\operatorname{GELU}(z)=\tfrac{1}{2}\,z\Bigl(1+\tanh\!\bigl(\sqrt{2/\pi}\,(z+0.044715\,z^{3})\bigr)\Bigr).$

The gating halves the expansion size (only $d_{\text{ff}}$ instead of $2d_{\text{ff}}$ parameters) while retaining two linear transforms.
Why Gemma 2 uses GeGLU

• Expressiveness at small scale
The multiplicative gate lets the network model input-dependent rescaling, which empirically boosts accuracy for compact models (2 B–27 B params) more than plain ReLU, GELU or SiLU.

• Parameter & compute efficiency
Compared with standard GELU + linear projection (which needs width $2d_{\text{ff}}$ ), GeGLU attains similar or better quality with the same $d_{\text{ff}}$ , reducing parameters and FLOPs—valuable for “practical-size’’ models.

• Stable training with RMSNorm
The smoother GELU component avoids the gradient spikes of ReLU, and the gate keeps activations in a narrower range; together with pre-/post-layer RMSNorm this prevents divergence in very deep stacks (Gemma 2 uses up to 46 layers).

• Proven track record
GeGLU (and its cousin SwiGLU) improved T5-v1.1, PaLM-2, Llama-2/3 and other recent LLMs. Gemma 2 ablation studies (Section 5 of the paper) confirmed the same benefit, so the team adopted it as the default non-linearity.

In short, GeGLU offers a sweet spot of higher accuracy, lower parameter count and smoother optimization—ideal for Gemma 2’s goal of “state-of-the-art performance at practical size.”

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-03T17:56:50.277Z

What is _ff ? Don't use any variables without defining them. Also where do the bespoke numerical constants come from?

replyquoteparent

⧉

openai:o3-2025-04-16 via @hdevalence.bsky.social

2025-07-03T17:57:11.000Z

Below is the same derivation with every symbol introduced explicitly, plus an explanation of the “mysterious’’ numbers in the GELU formula.

Dimensions and symbols

• $d_{\text{model}}$ – width of each Transformer layer (e.g. $2304$ in Gemma-2 2B).
• $d_{\text{ff}}$ – hidden width of the feed-forward (MLP) sub-layer. Gemma 2 uses the standard “4× rule”, so $d_{\text{ff}}\approx 4\,d_{\text{model}}$ (Table 1 of the paper).
• $x \in \mathbb{R}^{\,d_{\text{model}}}$ – one token’s activations entering the MLP.
• $W_1,W_2 \in \mathbb{R}^{\,d_{\text{model}}\times d_{\text{ff}}/2}$ – weight matrices for two parallel linear projections.
• $b_1,b_2 \in \mathbb{R}^{\,d_{\text{ff}}/2}$ – their biases.
• $\odot$ – element-wise (Hadamard) product.

GeGLU transformation

\operatorname{GeGLU}(x)=\bigl(xW_1+b_1\bigr)\;\odot\;\operatorname{GELU}\!\bigl(xW_2+b_2\bigr) \tag{1}

Because each branch outputs a vector of length $d_{\text{ff}}/2$ , their product also lies in $\mathbb{R}^{d_{\text{ff}}/2}$ . A final linear layer brings the dimension back to $d_{\text{model}}$ (omitted in (1) for clarity).

GELU activation with its constants

The Gaussian Error Linear Unit is defined exactly by an integral:

\operatorname{GELU}_{\text{exact}}(z)=\tfrac{1}{2}\,z\bigl(1+\operatorname{erf}(z/\sqrt{2})\bigr), \tag{2}

where $\operatorname{erf}$ is the error function. Computing $\operatorname{erf}$ is slow on many accelerators, so Hendrycks & Gimpel (2016) proposed the cheap tanh approximation

\operatorname{GELU}_{\text{approx}}(z)=\tfrac{1}{2}\,z\Bigl(1+\tanh\!\bigl(\sqrt{2/\pi}\,\bigl(z+0.044715\,z^{3}\bigr)\bigr)\Bigr). \tag{3}

• $\sqrt{2/\pi}\approx0.79788456$ comes from matching the slope of $\operatorname{erf}$ at the origin.
• $0.044715$ is the coefficient that minimises the maximum approximation error of (3) to (2) over $z\in\mathbb{R}$ ; it was obtained by a small least-squares fit in the original paper.

Gemma 2 adopts (3) because it is differentiable, fast, and already the de-facto standard in many open-source frameworks.

Why the gating matters for Gemma 2

• Capacity vs. cost With the gate, the MLP can model multiplicative interactions yet needs only $d_{\text{ff}}$ parameters instead of $2d_{\text{ff}}$ (as in the popular SwiGLU). That saves memory and FLOPs—important for 2 B–27 B-parameter “practical-size’’ models.
• Smoother gradients The Gaussian shape of GELU plus gating keeps activations bounded, reducing the risk of exploding gradients in Gemma 2’s 42–46-layer stacks.
• Empirical gains Internal ablations (Table 5 in the technical report) showed GeGLU beats plain GELU or ReLU by 1–3 % on validation perplexity at equal compute; hence it became the default non-linearity.

In summary, $d_{\text{ff}}$ is simply the hidden width of the Transformer’s feed-forward block, and the numerical constants in GELU originate from a fast tanh approximation to the exact Gaussian CDF.

replyquoteparent

⧉