Multiple Hypothesis Testing: Bonferroni and FDR

by DataMarvin

14 hours ago

Modern experimentation platforms rarely test a single metric. Recommendation systems track click-through rate, watch time, retention, and revenue simultaneously. Advertising experiments segment results across countries, devices, creatives, and user cohorts. The statistical problem is that every additional test increases the probability of false discoveries.

1. Why Multiple Testing Becomes Dangerous

Suppose you test 20 independent metrics with significance level α = 0.05. Even if all null hypotheses are true — meaning none of your treatments actually work — the probability of observing at least one false positive is:

1 − (1 − 0.05)^20 ≈ 0.64

That means there's roughly a 64% chance of reporting at least one statistically significant result purely by noise.

This is the multiple testing problem. The more hypotheses you test, the more likely you are to find something that looks significant — even when nothing real is happening.

2. Two Error Rates Worth Knowing

Before diving into corrections, it helps to distinguish two different ways of measuring "how wrong you might be."

Family-Wise Error Rate (FWER) The probability of making at least one false positive across all tests. This is the strict standard — it asks: what's the chance I report anything wrong?

False Discovery Rate (FDR) The expected proportion of false positives among all significant results. This is a softer standard — it asks: of everything I call significant, what fraction is actually noise?

	FWER	FDR
Question it answers	"Did I make any mistakes?"	"What fraction of my findings are wrong?"
Standard	Strict	More lenient
Best suited for	Small number of critical tests	Large-scale testing, many metrics
Common correction	Bonferroni	Benjamini-Hochberg

The right standard depends on your context. In a clinical trial where a single false positive could harm patients, FWER control is essential. In a product experiment tracking 20 metrics, FDR control often makes more practical sense.

3. Bonferroni Correction — The Conservative Approach

How it works

The Bonferroni correction controls FWER by dividing your significance threshold by the number of tests:

α_adjusted = α / m

Where m is the number of hypotheses being tested.

If you're testing 20 metrics at α = 0.05, each individual test must pass a threshold of:

α_adjusted = 0.05 / 20 = 0.0025

Only results with p < 0.0025 are called significant. This guarantees the probability of any false positive across all 20 tests stays below 5%.

The intuition

Think of your total α budget (0.05) as a fixed allowance. Bonferroni says: split it equally across all tests. Each test gets a smaller slice, so each individual threshold is stricter.

When Bonferroni works well

You have a small number of pre-specified hypotheses (2–5)
The cost of any false positive is high — a wrong decision has serious consequences
Tests are independent or nearly so

The problem with Bonferroni

Bonferroni is conservative — often too conservative. By controlling for the worst-case scenario (any false positive at all), it rejects many true effects, especially when:

You're testing many metrics simultaneously
Metrics are correlated (as they often are in product experiments)
Effect sizes are small but real

In large-scale experimentation, Bonferroni can make it nearly impossible to detect real improvements. You pay for zero false positives with a lot of missed true positives.

4. False Discovery Rate — A More Practical Standard

The core shift

Rather than asking "did I make any mistakes?", FDR control asks: "among everything I call significant, what fraction is wrong?"

Formally, FDR is defined as:

FDR = E[ false positives / total significant results ]

If you call 10 results significant and FDR = 0.10, you expect roughly 1 of those 10 to be a false positive. You're accepting a small, known rate of error — in exchange for much higher sensitivity.

The Benjamini-Hochberg Procedure

The most widely used FDR control method is the Benjamini-Hochberg (BH) procedure, introduced in 1995. Here's how it works:

Run all m tests and collect their p-values
Rank p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)
Find the largest k such that:

p(k) ≤ (k / m) · q

Where q is your target FDR level (e.g., 0.10).

Reject all hypotheses from p(1) through p(k)

A worked example

Suppose you test 10 metrics and collect these p-values (sorted):

Rank (k)	Metric	p-value	BH threshold (k/10 × 0.10)	Reject?
1	CTR	0.001	0.010	✅
2	Revenue	0.008	0.020	✅
3	Retention	0.019	0.030	✅
4	Session length	0.043	0.040	❌
5	Bounce rate	0.065	0.050	❌
...	...	...	...	❌

At rank 4, the p-value (0.043) exceeds the BH threshold (0.040). So we reject the first 3 hypotheses and declare CTR, Revenue, and Retention significant — controlling FDR at 10%.

Note: once we find the largest k where the condition holds, all hypotheses ranked 1 through k are rejected, even if some intermediate p-values crossed their threshold. It's a sequential procedure, not a per-test cutoff.

5. Bonferroni vs. Benjamini-Hochberg — Side by Side

	Bonferroni	Benjamini-Hochberg
Controls	FWER	FDR
Error standard	Zero false positives	Known proportion of false positives
Correction formula	α / m per test	Ranked p-value comparison
Conservatism	High	Moderate
Statistical power	Low (many missed effects)	Higher
Best for	Few critical tests	Many simultaneous tests
Assumption	Tests can be independent or correlated	Works under independence; extensions for correlation exist

6. Which Standard Should You Use in Practice?

This is where context matters more than methodology.

Use Bonferroni (FWER) when:

You have a small, pre-specified set of primary metrics (1–5)
A single false positive has serious downstream consequences — a wrong product decision, a misleading headline metric, a regulated claim
You want a simple, defensible correction that everyone understands

Use Benjamini-Hochberg (FDR) when:

You're testing many metrics simultaneously (10+)
You're doing exploratory analysis — scanning for signals across a large metric taxonomy
You can tolerate a small known rate of false positives in exchange for higher sensitivity
You're running post-hoc subgroup analyses across many segments

A practical heuristic for product experimentation:

Define 1–3 primary metrics upfront and apply Bonferroni (or no correction if there's only one)
Treat all secondary and exploratory metrics as FDR-controlled, with q = 0.10 or 0.20
Pre-register the distinction between primary and secondary before the experiment starts

This hybrid approach preserves rigor on what matters most while staying sensitive to signals across a broader metric set.

7. A Common Mistake: Correcting After You've Already Looked

Multiple testing corrections only work if you commit to them before you analyze results. If you run 20 tests, see which ones look interesting, and then apply a correction to just those — you've already introduced selection bias that no correction can fix.

The discipline is in the design:

Decide which metrics you're testing before the experiment runs
Specify whether they're primary or secondary
Apply the appropriate correction at analysis time

Post-hoc corrections applied to a cherry-picked subset of results are not multiple testing corrections. They're rationalization.

Takeaway

Multiple testing isn't just a theoretical concern — it's a practical problem that affects every team running experiments across more than one metric. The question isn't whether to correct, but which standard applies.

One sentence summary:

Bonferroni asks "did I get anything wrong?" — FDR asks "how much of what I found is wrong?" Choose based on how much a false positive actually costs you.