Why You Shouldn't Peek at Your A/B Test Results

by DataMarvin Lab

May 30, 2026

Have you ever refreshed your experiment dashboard every day just to see if results are "significant yet"? That habit might be quietly ruining your experiments.

In this post, I'll explain why peeking at results mid-experiment is statistically dangerous — and how Sequential A/B Testing offers a principled way to handle it.

1. The Core Assumption of Standard A/B Tests

The A/B testing framework most of us learn rests on a strict set of rules:

Calculate the required sample size before the experiment starts
Wait until that sample is fully collected
Analyze the results exactly once

When we say "significance level of 5%," we're making a promise: the probability of falsely concluding an effect exists (when it doesn't) will be kept below 5%. But that promise holds only if we analyze the data just once.

2. The Peeking Problem

In practice, we run a two-week experiment and check the dashboard every single day. "Is it significant today? What about tomorrow?" Repeatedly testing as data accumulates is called peeking — and it silently inflates your false positive rate.

How bad does it get?

Number of interim looks	Actual false positive rate (α = 0.05)
1 (no peeking)	5%
5	~14%
14	~25%
100	~45%

Source: Based on Armitage et al. (1969) simulation estimates

If you check daily and stop the moment results look significant, you could be working with a false positive rate of 25–45%. That means roughly 1 in 4 "winning" experiments might be noise.

3. Sequential Testing: A Framework That Allows Interim Looks

Sequential Testing is not a free pass to peek whenever you want. More precisely, it's a framework that permits interim analyses — provided they were planned in advance — while keeping the overall false positive rate under control.

The core idea is simple: spend your total α budget (0.05) across multiple planned checkpoints.

Total α budget = 0.05 ↓ Distributed across N planned interim analyses ↓ Overall false positive rate stays at 0.05

For example, if you plan to check results at Week 1 and Week 2 of a two-week experiment, you decide upfront how to split the α budget across those two looks. The rule governing that split is called an Alpha Spending Function.

4. Alpha Spending: The Intuition

There are several Alpha Spending Functions, but the intuition is the same across all of them:

💡 Analogy: Imagine you have a ₩50,000 budget for the month. Spend it all in the first week and you'll have nothing left. Alpha spending works the same way — the more α you use on early checks, the stricter the threshold becomes at the end.

Two common approaches

Pocock boundary

Applies the same threshold at every interim look
Easier to stop early, but the final threshold is also stricter than usual
Pro: symmetric and intuitive
Con: the final p-value threshold is lower than 0.05, which can feel counterintuitive

O'Brien-Fleming boundary

Sets a very high bar early on — early stopping requires an overwhelming signal
The final threshold stays close to the standard 0.05
Pro: conservative and familiar-feeling at the end
Con: rarely stops early unless the effect is massive

In practice, O'Brien-Fleming is the more popular choice. It preserves the familiar final threshold while still allowing principled early stopping when the evidence is undeniable.

5. Standard A/B Test vs. Sequential Testing

	Standard A/B Test	Sequential Testing
Interim checks	❌ Not valid (inflates false positives)	✅ Allowed (if pre-planned)
Early stopping	❌ No statistical basis	✅ Possible on clear signal
False positive control	✅ Maintained at 5%	✅ Maintained at 5% (if design is followed)
Design complexity	Low	Medium to High

The key difference isn't just "can you peek" — it's that Sequential Testing requires you to commit to the plan before the experiment begins. Deviating from the plan (e.g., adding extra looks, stopping at an unplanned time) breaks the statistical guarantees.

6. What Sequential Testing Is Not

It's easy to misread Sequential Testing as a way to "stop whenever things look good." It isn't. A few common misconceptions:

"I'll just stop when p < 0.05" → This is still peeking. Sequential Testing requires pre-specified boundaries, not ad hoc stopping.
"Sequential testing means continuous monitoring" → Continuous monitoring (like mSPRT or confidence sequences) is a related but distinct approach — it handles always-valid inference differently.
"The more interim looks, the better" → More looks = more α spent early = stricter thresholds throughout. There's a real cost.

Takeaway

Sequential Testing doesn't make experimentation more casual — it makes planned flexibility statistically valid. If you want to look at results mid-experiment, commit to when and how you'll look before the experiment starts.