Multiple Hypothesis Testing: Bonferroni and FDR
Modern experimentation platforms rarely test a single metric. Recommendation systems track click-through rate, watch time, retention, and revenue simultaneously. Advertising experiments segment results across countries, devices, creatives, and user cohorts. The statistical problem is that every additional test increases the probability of false discoveries.
1. Why Multiple Testing Becomes Dangerous
Suppose you test 20 independent metrics with significance level α = 0.05. Even if all null hypotheses are true — meaning none of your treatments actually work — the probability of observing at least one false positive is:
1 − (1 − 0.05)^20 ≈ 0.64
That means there's roughly a 64% chance of reporting at least one statistically significant result purely by noise.
This is the multiple testing problem. The more hypotheses you test, the more likely you are to find something that looks significant — even when nothing real is happening.
2. Two Error Rates Worth Knowing
Before diving into corrections, it helps to distinguish two different ways of measuring "how wrong you might be."
Family-Wise Error Rate (FWER) The probability of making at least one false positive across all tests. This is the strict standard — it asks: what's the chance I report anything wrong?
False Discovery Rate (FDR) The expected proportion of false positives among all significant results. This is a softer standard — it asks: of everything I call significant, what fraction is actually noise?
| FWER | FDR | |
|---|---|---|
| Question it answers | "Did I make any mistakes?" | "What fraction of my findings are wrong?" |
| Standard | Strict | More lenient |
| Best suited for | Small number of critical tests | Large-scale testing, many metrics |
| Common correction | Bonferroni | Benjamini-Hochberg |
The right standard depends on your context. In a clinical trial where a single false positive could harm patients, FWER control is essential. In a product experiment tracking 20 metrics, FDR control often makes more practical sense.
3. Bonferroni Correction — The Conservative Approach
How it works
The Bonferroni correction controls FWER by dividing your significance threshold by the number of tests:
α_adjusted = α / m
Where m is the number of hypotheses being tested.
If you're testing 20 metrics at α = 0.05, each individual test must pass a threshold of:
α_adjusted = 0.05 / 20 = 0.0025
Only results with p < 0.0025 are called significant. This guarantees the probability of any false positive across all 20 tests stays below 5%.
The intuition
Think of your total α budget (0.05) as a fixed allowance. Bonferroni says: split it equally across all tests. Each test gets a smaller slice, so each individual threshold is stricter.
When Bonferroni works well
- You have a small number of pre-specified hypotheses (2–5)
- The cost of any false positive is high — a wrong decision has serious consequences
- Tests are independent or nearly so
The problem with Bonferroni
Bonferroni is conservative — often too conservative. By controlling for the worst-case scenario (any false positive at all), it rejects many true effects, especially when:
- You're testing many metrics simultaneously
- Metrics are correlated (as they often are in product experiments)
- Effect sizes are small but real
In large-scale experimentation, Bonferroni can make it nearly impossible to detect real improvements. You pay for zero false positives with a lot of missed true positives.
4. False Discovery Rate — A More Practical Standard
The core shift
Rather than asking "did I make any mistakes?", FDR control asks: "among everything I call significant, what fraction is wrong?"
Formally, FDR is defined as:
FDR = E[ false positives / total significant results ]
If you call 10 results significant and FDR = 0.10, you expect roughly 1 of those 10 to be a false positive. You're accepting a small, known rate of error — in exchange for much higher sensitivity.
The Benjamini-Hochberg Procedure
The most widely used FDR control method is the Benjamini-Hochberg (BH) procedure, introduced in 1995. Here's how it works:
- Run all m tests and collect their p-values
- Rank p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)
- Find the largest k such that:
p(k) ≤ (k / m) · q
Where q is your target FDR level (e.g., 0.10).
- Reject all hypotheses from p(1) through p(k)
A worked example
Suppose you test 10 metrics and collect these p-values (sorted):
| Rank (k) | Metric | p-value | BH threshold (k/10 × 0.10) | Reject? |
|---|---|---|---|---|
| 1 | CTR | 0.001 | 0.010 | ✅ |
| 2 | Revenue | 0.008 | 0.020 | ✅ |
| 3 | Retention | 0.019 | 0.030 | ✅ |
| 4 | Session length | 0.043 | 0.040 | ❌ |
| 5 | Bounce rate | 0.065 | 0.050 | ❌ |
| ... | ... | ... | ... | ❌ |
At rank 4, the p-value (0.043) exceeds the BH threshold (0.040). So we reject the first 3 hypotheses and declare CTR, Revenue, and Retention significant — controlling FDR at 10%.
Note: once we find the largest k where the condition holds, all hypotheses ranked 1 through k are rejected, even if some intermediate p-values crossed their threshold. It's a sequential procedure, not a per-test cutoff.
5. Bonferroni vs. Benjamini-Hochberg — Side by Side
| Bonferroni | Benjamini-Hochberg | |
|---|---|---|
| Controls | FWER | FDR |
| Error standard | Zero false positives | Known proportion of false positives |
| Correction formula | α / m per test | Ranked p-value comparison |
| Conservatism | High | Moderate |
| Statistical power | Low (many missed effects) | Higher |
| Best for | Few critical tests | Many simultaneous tests |
| Assumption | Tests can be independent or correlated | Works under independence; extensions for correlation exist |
6. Which Standard Should You Use in Practice?
This is where context matters more than methodology.
Use Bonferroni (FWER) when:
- You have a small, pre-specified set of primary metrics (1–5)
- A single false positive has serious downstream consequences — a wrong product decision, a misleading headline metric, a regulated claim
- You want a simple, defensible correction that everyone understands
Use Benjamini-Hochberg (FDR) when:
- You're testing many metrics simultaneously (10+)
- You're doing exploratory analysis — scanning for signals across a large metric taxonomy
- You can tolerate a small known rate of false positives in exchange for higher sensitivity
- You're running post-hoc subgroup analyses across many segments
A practical heuristic for product experimentation:
- Define 1–3 primary metrics upfront and apply Bonferroni (or no correction if there's only one)
- Treat all secondary and exploratory metrics as FDR-controlled, with q = 0.10 or 0.20
- Pre-register the distinction between primary and secondary before the experiment starts
This hybrid approach preserves rigor on what matters most while staying sensitive to signals across a broader metric set.
7. A Common Mistake: Correcting After You've Already Looked
Multiple testing corrections only work if you commit to them before you analyze results. If you run 20 tests, see which ones look interesting, and then apply a correction to just those — you've already introduced selection bias that no correction can fix.
The discipline is in the design:
- Decide which metrics you're testing before the experiment runs
- Specify whether they're primary or secondary
- Apply the appropriate correction at analysis time
Post-hoc corrections applied to a cherry-picked subset of results are not multiple testing corrections. They're rationalization.
Takeaway
Multiple testing isn't just a theoretical concern — it's a practical problem that affects every team running experiments across more than one metric. The question isn't whether to correct, but which standard applies.
One sentence summary:
Bonferroni asks "did I get anything wrong?" — FDR asks "how much of what I found is wrong?" Choose based on how much a false positive actually costs you.
SQL Growth