Gemma 2
Thread (18 posts)
Thinking Summary
Looking at Table 1 in the embedded paper, I can see a table with model parameters for Gemma 2. It shows various architectural parameters for the 2B, 9B, and 27B models. The table includes entries for "Pre-norm" and "Post-norm" both showing "yes" for all three models. Looking at the paper text, I can find more details about this. In the "Model Architecture" section, there's a subsection that explains: "Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer." So pre-norm and post-norm refer to layer normalization applied before and after transformer sublayers respectively. The paper mentions they use RMSNorm for both input and output normalization of each transformer sub-layer. This is a common architectural choice in transformer models - you can apply layer normalization before the sublayer (pre-norm) and/or after the sublayer (post-norm). The original Transformer paper used post-norm, but many modern implementations use pre-norm or both for training stability.
Looking at Table 1 in the Gemma 2 paper, "Pre-norm" and "Post-norm" refer to layer normalization applied at different positions within the transformer architecture:
- Pre-norm: Layer normalization applied to the input of each transformer sub-layer (before the attention layer and feedforward layer)
- Post-norm: Layer normalization applied to the output of each transformer sub-layer (after the attention layer and feedforward layer)
The paper explains this in the architecture section:
"Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer."
Gemma 2 uses both pre-norm and post-norm (both marked as "yes" in the table), meaning they apply RMSNorm normalization both before and after each transformer sub-layer. This dual normalization approach helps stabilize training.
This is different from the original Transformer architecture which only used post-norm, and also different from some modern implementations that only use pre-norm. Using both provides additional training stability at the cost of some computational overhead.
Thinking Summary
The user is asking about Table 1 in the Gemma 2 paper. Let me look at Table 1 which provides an overview of the main model parameters and design choices. Table 1 shows: - Parameters: 2B, 9B, 27B - d_model values - Layers - Pre-norm: yes for all models - Post-norm: yes for all models - Various other architectural details The user wants to know what "pre-norm" and "post-norm" refer to. Looking at the text, I can find the relevant information in Section 2 (Model Architecture): "Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer." So it appears that: - Pre-norm refers to normalizing the input to each transformer sub-layer - Post-norm refers to normalizing the output of each transformer sub-layer Both use RMSNorm (Root Mean Square Normalization) and both are applied to the attention layer and feedforward layer sub-layers of the transformer.
In Table 1, pre-norm and post-norm refer to the placement of normalization layers in the transformer architecture:
- Pre-norm: Normalization applied to the input of each transformer sub-layer (before the attention or feedforward computation)
- Post-norm: Normalization applied to the output of each transformer sub-layer (after the attention or feedforward computation)
According to the paper (Section 2), Gemma 2 uses both pre-norm and post-norm with RMSNorm (Root Mean Square Normalization) to stabilize training. Specifically, they state:
"To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer."
This dual normalization approach is somewhat unusual - many transformer models use either pre-norm or post-norm, but not both. The combination likely helps with training stability, especially for the larger models. All three Gemma 2 model sizes (2B, 9B, and 27B) use this same normalization strategy, as indicated by the "yes" values for both rows in Table 1.
Thinking Summary
**Explaining pre-norm and post-norm** The user is asking about the 'Pre-norm' and 'Post-norm' columns in Table 1. In the context of transformer layers, particularly with Gemma 2, these terms indicate where normalization occurs in relation to the residual connection within each transformer block. Pre-norm applies RMSNorm before the sub-layer, while post-norm applies it after. Gemma 2 utilizes both methods by applying RMSNorm to both the input and output of each sub-layer, hence the 'yes' under both columns indicates the use of dual normalization. **Discussing normalization advantages** I'm looking to explain the benefits of pre-norm and post-norm in transformer networks. Pre-norm enhances training stability, while post-norm can improve inference calibration and performance. Using a dual normalization approach, like RMSNorm, further stabilizes training by managing both inputs and outputs. In a transformer block, pre-norm involves applying normalization to the input before the sub-layer, while post-norm applies it after adding the residual. Gemma 2 employs this "Sandwich" approach, combining both methods effectively.
In a Transformer block you have three main operations, executed in this order:
- The residual input coming from the previous layer.
- The sublayer (either self-attention or the feed-forward network, FFN).
- The residual addition that adds the sublayer output back to .
Where you place the normalisation step(s) relative to those three operations is what the paper’s Table 1 is calling “pre-norm” and “post-norm”.
• Pre-norm (a “yes” in the table) means:
i.e. the input is normalised before it is fed into the sublayer.
• Post-norm (also “yes”) means:
i.e. after the residual addition the result is normalised again.
Gemma 2 therefore uses both: every attention or FFN sub-block is wrapped in a dual / sandwich normalisation scheme
x
│
├─► RMSNorm ─► Sublayer ─► + x ─► RMSNorm
│ │
└────────────────────────────────┘
and it uses RMSNorm instead of LayerNorm for all of these normalisation steps.
Why both?
• Pre-norm is known to stabilise very deep Transformers and makes optimisation easier.
• Post-norm (the extra “sandwich” layer) helps keep the magnitude of activations bounded, improving training stability, calibration and in some cases final accuracy (ideas similar to DeepNorm, NormFormer, and Sandwich-LayerNorm).
So the two columns simply indicate that Gemma 2 applies RMSNorm both before and after each Transformer sub-layer.