Logo SQL Growth

Multiple Hypothesis Testing: Bonferroni and FDR

by DataMarvin
14 hours ago
Views: 8
Illustrative Image

Modern experimentation platforms rarely test a single metric. Recommendation systems track click-through rate, watch time, retention, and revenue simultaneously. Advertising experiments segment results across countries, devices, creatives, and user cohorts. The statistical problem is that every additional test increases the probability of false discoveries.


1. Why Multiple Testing Becomes Dangerous

Suppose you test 20 independent metrics with significance level α = 0.05. Even if all null hypotheses are true — meaning none of your treatments actually work — the probability of observing at least one false positive is:

1 − (1 − 0.05)^20 ≈ 0.64

That means there's roughly a 64% chance of reporting at least one statistically significant result purely by noise.

This is the multiple testing problem. The more hypotheses you test, the more likely you are to find something that looks significant — even when nothing real is happening.


2. Two Error Rates Worth Knowing

Before diving into corrections, it helps to distinguish two different ways of measuring "how wrong you might be."

Family-Wise Error Rate (FWER) The probability of making at least one false positive across all tests. This is the strict standard — it asks: what's the chance I report anything wrong?

False Discovery Rate (FDR) The expected proportion of false positives among all significant results. This is a softer standard — it asks: of everything I call significant, what fraction is actually noise?

FWERFDR
Question it answers"Did I make any mistakes?""What fraction of my findings are wrong?"
StandardStrictMore lenient
Best suited forSmall number of critical testsLarge-scale testing, many metrics
Common correctionBonferroniBenjamini-Hochberg

The right standard depends on your context. In a clinical trial where a single false positive could harm patients, FWER control is essential. In a product experiment tracking 20 metrics, FDR control often makes more practical sense.


3. Bonferroni Correction — The Conservative Approach

How it works

The Bonferroni correction controls FWER by dividing your significance threshold by the number of tests:

α_adjusted = α / m

Where m is the number of hypotheses being tested.

If you're testing 20 metrics at α = 0.05, each individual test must pass a threshold of:

α_adjusted = 0.05 / 20 = 0.0025

Only results with p < 0.0025 are called significant. This guarantees the probability of any false positive across all 20 tests stays below 5%.

The intuition

Think of your total α budget (0.05) as a fixed allowance. Bonferroni says: split it equally across all tests. Each test gets a smaller slice, so each individual threshold is stricter.

When Bonferroni works well

  • You have a small number of pre-specified hypotheses (2–5)
  • The cost of any false positive is high — a wrong decision has serious consequences
  • Tests are independent or nearly so

The problem with Bonferroni

Bonferroni is conservative — often too conservative. By controlling for the worst-case scenario (any false positive at all), it rejects many true effects, especially when:

  • You're testing many metrics simultaneously
  • Metrics are correlated (as they often are in product experiments)
  • Effect sizes are small but real

In large-scale experimentation, Bonferroni can make it nearly impossible to detect real improvements. You pay for zero false positives with a lot of missed true positives.


4. False Discovery Rate — A More Practical Standard

The core shift

Rather than asking "did I make any mistakes?", FDR control asks: "among everything I call significant, what fraction is wrong?"

Formally, FDR is defined as:

FDR = E[ false positives / total significant results ]

If you call 10 results significant and FDR = 0.10, you expect roughly 1 of those 10 to be a false positive. You're accepting a small, known rate of error — in exchange for much higher sensitivity.

The Benjamini-Hochberg Procedure

The most widely used FDR control method is the Benjamini-Hochberg (BH) procedure, introduced in 1995. Here's how it works:

  1. Run all m tests and collect their p-values
  2. Rank p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)
  3. Find the largest k such that:

p(k) ≤ (k / m) · q

Where q is your target FDR level (e.g., 0.10).

  1. Reject all hypotheses from p(1) through p(k)

A worked example

Suppose you test 10 metrics and collect these p-values (sorted):

Rank (k)Metricp-valueBH threshold (k/10 × 0.10)Reject?
1CTR0.0010.010
2Revenue0.0080.020
3Retention0.0190.030
4Session length0.0430.040
5Bounce rate0.0650.050
............

At rank 4, the p-value (0.043) exceeds the BH threshold (0.040). So we reject the first 3 hypotheses and declare CTR, Revenue, and Retention significant — controlling FDR at 10%.

Note: once we find the largest k where the condition holds, all hypotheses ranked 1 through k are rejected, even if some intermediate p-values crossed their threshold. It's a sequential procedure, not a per-test cutoff.


5. Bonferroni vs. Benjamini-Hochberg — Side by Side

BonferroniBenjamini-Hochberg
ControlsFWERFDR
Error standardZero false positivesKnown proportion of false positives
Correction formulaα / m per testRanked p-value comparison
ConservatismHighModerate
Statistical powerLow (many missed effects)Higher
Best forFew critical testsMany simultaneous tests
AssumptionTests can be independent or correlatedWorks under independence; extensions for correlation exist

6. Which Standard Should You Use in Practice?

This is where context matters more than methodology.

Use Bonferroni (FWER) when:

  • You have a small, pre-specified set of primary metrics (1–5)
  • A single false positive has serious downstream consequences — a wrong product decision, a misleading headline metric, a regulated claim
  • You want a simple, defensible correction that everyone understands

Use Benjamini-Hochberg (FDR) when:

  • You're testing many metrics simultaneously (10+)
  • You're doing exploratory analysis — scanning for signals across a large metric taxonomy
  • You can tolerate a small known rate of false positives in exchange for higher sensitivity
  • You're running post-hoc subgroup analyses across many segments

A practical heuristic for product experimentation:

  • Define 1–3 primary metrics upfront and apply Bonferroni (or no correction if there's only one)
  • Treat all secondary and exploratory metrics as FDR-controlled, with q = 0.10 or 0.20
  • Pre-register the distinction between primary and secondary before the experiment starts

This hybrid approach preserves rigor on what matters most while staying sensitive to signals across a broader metric set.


7. A Common Mistake: Correcting After You've Already Looked

Multiple testing corrections only work if you commit to them before you analyze results. If you run 20 tests, see which ones look interesting, and then apply a correction to just those — you've already introduced selection bias that no correction can fix.

The discipline is in the design:

  • Decide which metrics you're testing before the experiment runs
  • Specify whether they're primary or secondary
  • Apply the appropriate correction at analysis time

Post-hoc corrections applied to a cherry-picked subset of results are not multiple testing corrections. They're rationalization.


Takeaway

Multiple testing isn't just a theoretical concern — it's a practical problem that affects every team running experiments across more than one metric. The question isn't whether to correct, but which standard applies.

One sentence summary:

Bonferroni asks "did I get anything wrong?" — FDR asks "how much of what I found is wrong?" Choose based on how much a false positive actually costs you.

More

Based on Tags

Recent Popular

Most Popular

  • Why You Shouldn't Peek at Your A/B Test Results

    An Introduction t Sequential AB Testing

    Illustrative Image
  • Stratified Sampling in A/B Testing

    Why Random Isn't Always Enough

    Illustrative Image
  • What Is CUPED

    and Why It Makes Your Experiments Faster

    Illustrative Image