Logo SQL Growth

Why You Shouldn't Peek at Your A/B Test Results

by DataMarvin
5 hours ago
Views: 7
Illustrative Image

Have you ever refreshed your experiment dashboard every day just to see if results are "significant yet"? That habit might be quietly ruining your experiments.

In this post, I'll explain why peeking at results mid-experiment is statistically dangerous — and how Sequential A/B Testing offers a principled way to handle it.


1. The Core Assumption of Standard A/B Tests

The A/B testing framework most of us learn rests on a strict set of rules:

  1. Calculate the required sample size before the experiment starts
  2. Wait until that sample is fully collected
  3. Analyze the results exactly once

When we say "significance level of 5%," we're making a promise: the probability of falsely concluding an effect exists (when it doesn't) will be kept below 5%. But that promise holds only if we analyze the data just once.


2. The Peeking Problem

In practice, we run a two-week experiment and check the dashboard every single day. "Is it significant today? What about tomorrow?" Repeatedly testing as data accumulates is called peeking — and it silently inflates your false positive rate.

How bad does it get?

Number of interim looksActual false positive rate (α = 0.05)
1 (no peeking)5%
5~14%
14~25%
100~45%

Source: Based on Armitage et al. (1969) simulation estimates

If you check daily and stop the moment results look significant, you could be working with a false positive rate of 25–45%. That means roughly 1 in 4 "winning" experiments might be noise.


3. Sequential Testing: A Framework That Allows Interim Looks

Sequential Testing is not a free pass to peek whenever you want. More precisely, it's a framework that permits interim analyses — provided they were planned in advance — while keeping the overall false positive rate under control.

The core idea is simple: spend your total α budget (0.05) across multiple planned checkpoints.

Total α budget = 0.05 ↓ Distributed across N planned interim analyses ↓ Overall false positive rate stays at 0.05

For example, if you plan to check results at Week 1 and Week 2 of a two-week experiment, you decide upfront how to split the α budget across those two looks. The rule governing that split is called an Alpha Spending Function.


4. Alpha Spending: The Intuition

There are several Alpha Spending Functions, but the intuition is the same across all of them:

💡 Analogy: Imagine you have a ₩50,000 budget for the month. Spend it all in the first week and you'll have nothing left. Alpha spending works the same way — the more α you use on early checks, the stricter the threshold becomes at the end.

Two common approaches

Pocock boundary

  • Applies the same threshold at every interim look
  • Easier to stop early, but the final threshold is also stricter than usual
  • Pro: symmetric and intuitive
  • Con: the final p-value threshold is lower than 0.05, which can feel counterintuitive

O'Brien-Fleming boundary

  • Sets a very high bar early on — early stopping requires an overwhelming signal
  • The final threshold stays close to the standard 0.05
  • Pro: conservative and familiar-feeling at the end
  • Con: rarely stops early unless the effect is massive

In practice, O'Brien-Fleming is the more popular choice. It preserves the familiar final threshold while still allowing principled early stopping when the evidence is undeniable.


5. Standard A/B Test vs. Sequential Testing

Standard A/B TestSequential Testing
Interim checks❌ Not valid (inflates false positives)✅ Allowed (if pre-planned)
Early stopping❌ No statistical basis✅ Possible on clear signal
False positive control✅ Maintained at 5%✅ Maintained at 5% (if design is followed)
Design complexityLowMedium to High

The key difference isn't just "can you peek" — it's that Sequential Testing requires you to commit to the plan before the experiment begins. Deviating from the plan (e.g., adding extra looks, stopping at an unplanned time) breaks the statistical guarantees.


6. What Sequential Testing Is Not

It's easy to misread Sequential Testing as a way to "stop whenever things look good." It isn't. A few common misconceptions:

  • "I'll just stop when p < 0.05" → This is still peeking. Sequential Testing requires pre-specified boundaries, not ad hoc stopping.
  • "Sequential testing means continuous monitoring" → Continuous monitoring (like mSPRT or confidence sequences) is a related but distinct approach — it handles always-valid inference differently.
  • "The more interim looks, the better" → More looks = more α spent early = stricter thresholds throughout. There's a real cost.

Takeaway

Sequential Testing doesn't make experimentation more casual — it makes planned flexibility statistically valid. If you want to look at results mid-experiment, commit to when and how you'll look before the experiment starts.

One line summary: If you want to peek, plan the peek first.

More

Based on Tags

No related posts found based on the tags.

Recent Popular

Most Popular

  • What Is CUPED

    and Why It Makes Your Experiments Faster

    Illustrative Image
  • Stratified Sampling in A/B Testing

    Why Random Isn't Always Enough

    Illustrative Image