excited to dig into this
Thread (8 posts)
Thinking Summary
The user is asking about a formula in section 3.1 of the alignment faking paper. Let me look through the content to find section 3.1. Looking at the paper structure, I can see: - Section 1: Introduction - Section 2: Background & Related Work - Section 3: Alignment Faking Across Models And section 3.1 is "Our Setup" Looking at section 3.1, I can see it describes their experimental setup but I don't see any specific mathematical formula mentioned. The section talks about: - Using StrongReject classifier to grade compliance - Using improved alignment faking classifiers - Using a user message suffix - Evaluating models with temperature 0 - Reporting uncertainty over observed fractions The only mathematical notation I see is: "In each figure, we report the uncertainty over the observed fractions $f$ of $1\sigma = \sqrt{f(1-f)/n} \leq 1/\sqrt{4n}$." This appears to be the formula they're asking about. This is a standard formula for the standard error of a proportion/fraction.
Looking at section 3.1 "Our Setup," the formula you're referring to is likely this uncertainty calculation:
This is the standard error formula for a proportion, where:
- = observed fraction/proportion (e.g., compliance rate or alignment faking rate)
- = sample size
- = one standard deviation (standard error)
Context and meaning:
What it calculates: The uncertainty/margin of error around their measured rates (like compliance rates or alignment faking frequencies).
Why the inequality: The expression is maximized when , giving . So regardless of the actual observed fraction , the standard error will never exceed .
Practical use: This gives them confidence intervals around their measurements. For example, if they measure a 20% compliance rate with samples, the standard error is approximately .
Why it matters: Since they're comparing alignment faking behavior across different models and conditions, having rigorous uncertainty estimates helps distinguish real differences from statistical noise. The paper uses this to report error bars on all their measurement plots.
This is a standard statistical approach for binary outcomes (comply/don't comply, shows alignment faking reasoning/doesn't show it).
Thinking Summary
The user is asking about the derivation of the standard error formula for a proportion. Let me walk through this step by step. The formula is $\sqrt{f(1-f)/n}$ where $f$ is the observed proportion and $n$ is the sample size. This comes from the binomial distribution and the central limit theorem. Let me think through this: 1. We have $n$ independent trials, each with probability $p$ of success 2. The number of successes $X$ follows a binomial distribution: $X \sim \text{Binomial}(n, p)$ 3. The sample proportion is $\hat{p} = X/n$ 4. We want to find the standard error of $\hat{p}$ For a binomial distribution: - $E[X] = np$ - $\text{Var}(X) = np(1-p)$ So for the sample proportion $\hat{p} = X/n$: - $E[\hat{p}] = E[X/n] = E[X]/n = np/n = p$ - $\text{Var}(\hat{p}) = \text{Var}(X/n) = \text{Var}(X)/n^2 = np(1-p)/n^2 = p(1-p)/n$ The standard error is the square root of the variance: $\text{SE}(\hat{p}) = \sqrt{\text{Var}(\hat{p})} = \sqrt{p(1-p)/n}$ But we don't know the true $p$, so we estimate it with our observed proportion $f = \hat{p}$, giving us: $\text{SE}(\hat{p}) \approx \sqrt{f(1-f)/n}$ This is the formula they're using. The inequality part $\sqrt{f(1-f)/n} \leq 1/\sqrt{4n}$ comes from the fact that $f(1-f)$ is maximized when $f = 0.5$, giving $f(1-f) = 0.5 \times 0.5 = 0.25 = 1/4$.
Great question! Let me walk through the derivation of the standard error formula for a proportion.
Starting point: Binomial distribution
When we measure something like "compliance rate," we're essentially counting successes in independent trials. If each trial has probability of success, then the number of successes follows a binomial distribution:
Properties of the binomial distribution:
- Mean:
- Variance:
Sample proportion
The sample proportion is:
Finding the standard error
Using properties of expectation and variance:
Mean of sample proportion:
Variance of sample proportion:
Standard error:
The practical formula
Since we don't know the true , we estimate it with our observed proportion :
The inequality bound
The expression is a quadratic that's maximized when :
Therefore:
This gives the authors a conservative upper bound on their uncertainty, regardless of the actual observed proportion.