Logo SQL Growth

Group Sequential Testing vs. Always-Valid Inference

by DataMarvin
14 hours ago
Views: 11
Illustrative Image

In the previous post on Sequential Testing, we established that peeking at experiment results mid-run inflates your false positive rate — unless you plan for it statistically. But "planning for interim looks" isn't a single approach. There are two fundamentally different frameworks for doing this:

  • Group Sequential Testing (GST) — you pre-specify exactly when you'll look
  • Always-Valid Inference (AVI) — you can look at any time, continuously

Both solve the peeking problem. But they make different assumptions, offer different flexibility, and suit different operational contexts. Understanding the distinction helps you choose the right tool — and avoid misapplying either one.


1. A Quick Recap: Why This Problem Exists

Standard hypothesis testing assumes you analyze data exactly once, after collecting a fixed sample. If you check results repeatedly — even with the same α threshold — your cumulative false positive rate grows with each look.

The core challenge: how do you enable valid inference at multiple points in time, without inflating Type I error?

GST and AVI answer this question in different ways.


2. Group Sequential Testing — Planned Looks, Allocated Budgets

The core idea

Group Sequential Testing was originally developed for clinical trials in the 1970s, where researchers needed to stop a trial early if a drug was clearly working (or clearly harmful). The key word is planned: GST requires you to commit, before the experiment starts, to:

  1. How many interim analyses you will run
  2. When each analysis will happen (e.g., at 25%, 50%, 75%, 100% of target sample)
  3. What stopping boundary applies at each look

The stopping boundaries are calibrated so that the total false positive rate across all planned looks stays at your target α (typically 0.05). This is the Alpha Spending framework covered in the Sequential Testing post.

How it works in practice

Suppose you run a two-week experiment and plan to check results at Day 7 and Day 14. Under O'Brien-Fleming boundaries, your stopping thresholds might look like:

Interim lookSample collectedp-value threshold to stop
Look 1 (Day 7)50%p < 0.0054
Look 2 (Day 14)100%p < 0.0492

At Look 1, the bar is very high — you need overwhelming evidence to stop early. At the final look, the threshold is close to the standard 0.05. Across both looks, the combined false positive rate stays at 5%.

What makes GST powerful

  • Familiar framework: p-values, α, power — the same statistical language most teams already use
  • Well-understood error control: decades of theory and validation, especially in regulated industries (pharma, clinical research)
  • Compatible with standard tooling: most experimentation platforms (GrowthBook, Statsig, Optimizely) support GST-style sequential testing

The constraint you have to accept

GST is rigid by design. The statistical guarantees hold only if you follow the pre-specified plan:

  • You cannot add extra looks that weren't planned
  • You cannot stop at a time that doesn't correspond to a pre-specified boundary
  • Deviating from the plan — even slightly — invalidates the error control

This rigidity is a feature in controlled environments (clinical trials). In fast-moving product teams, it can feel like a constraint.


3. Always-Valid Inference — Look Whenever You Want

The core idea

Always-Valid Inference takes a fundamentally different approach. Instead of allocating an α budget across pre-specified looks, AVI constructs test statistics that are valid at any stopping time — including continuous monitoring.

The key mathematical object is the e-value (or equivalently, a confidence sequence or anytime-valid p-value). Unlike a standard p-value, which is only valid at a pre-specified sample size, an anytime-valid p-value remains below α with probability at most α — regardless of when you stop and look.

The most well-known implementation is mSPRT (mixture Sequential Probability Ratio Test), which underlies the "always-valid" inference used in platforms like Optimizely's Stats Engine and Stitchfix's experimentation framework.

The intuition behind e-values

Think of it this way. A standard p-value is like a photograph — it's only valid at the moment it was taken. Look at it before the shutter clicks (too early) or after the scene has changed (too late), and it doesn't mean what you think.

An anytime-valid p-value is more like a live feed — it's valid whenever you choose to look, because the underlying test statistic is constructed to maintain error control over time.

Mathematically, this relies on the optional stopping theorem and the concept of a test martingale: a sequence of statistics that, under the null hypothesis, never systematically grows — no matter when you decide to stop.

What makes AVI powerful

  • Maximum operational flexibility: you can monitor continuously, stop at any point, resume after pausing, or run indefinitely
  • No pre-commitment required: no need to specify number of looks or timing in advance
  • Intuitive for data-driven teams: dashboards can show "valid" results at any moment, not just at pre-scheduled checkpoints

The tradeoff you have to accept

Flexibility comes at a cost: statistical power.

Always-valid tests are generally less powerful than GST at any fixed sample size. Because they must remain valid at every possible stopping point — including very early ones — they require more data to reach the same conclusions as a fixed-horizon or group sequential test.

In practical terms: if you run an AVI test and a GST test on the same experiment with the same target α and power, the AVI test will typically require a larger sample to detect the same effect.


4. Side-by-Side Comparison

Group Sequential TestingAlways-Valid Inference
Pre-specification requiredYes — number of looks, timing, boundariesNo
When can you lookOnly at pre-specified interim pointsAny time, continuously
Early stoppingYes, at planned boundariesYes, at any point
Error control mechanismAlpha spending (e.g., O'Brien-Fleming)e-values / test martingales (e.g., mSPRT)
Statistical powerHigher at fixed sampleLower — pays a power tax for flexibility
Implementation complexityModerateHigher (less standard tooling)
Best suited forPlanned experiments with fixed timelinesContinuous monitoring, always-on tests

5. The Power Tax — How Much Does Flexibility Cost?

This is the most important practical tradeoff. AVI guarantees validity at every point in time, which means it has to "hedge" against the possibility that you stop very early — even when the evidence is thin. That hedging costs power.

As a rough benchmark: under typical experimental conditions, AVI approaches require somewhere between 10–30% more data than a well-designed fixed-horizon test to achieve the same power. GST with O'Brien-Fleming boundaries sits between the two — slightly less efficient than a fixed-horizon test, but much more efficient than continuous monitoring.

Whether this cost is worth paying depends entirely on your operational context.


6. Which One Should You Use?

Neither framework is universally better. The choice depends on how your team operates.

Choose Group Sequential Testing if:

  • You run experiments with a defined end date and can plan interim looks in advance
  • You want maximum statistical efficiency
  • Your experimentation platform already supports GST
  • Your organization values methodological transparency and documented stopping rules

Choose Always-Valid Inference if:

  • You run continuous or "always-on" experiments with no fixed end date
  • Your team monitors dashboards daily and it's unrealistic to enforce "no peeking" policies
  • You need to stop and restart experiments based on business events (launches, seasonality)
  • You prefer a simpler operational story: "the dashboard is always valid"

Use both, carefully, if:

  • You have a heterogeneous experiment portfolio — some short planned tests (GST), some long-running feature flags (AVI)

7. A Note on Terminology

The terminology in this space is inconsistent across platforms and papers, which causes a lot of confusion.

  • "Sequential testing" sometimes refers specifically to GST, and sometimes to the broader family of methods that includes AVI
  • "Always-valid p-values" and "anytime-valid p-values" are used interchangeably
  • mSPRT is one specific implementation of AVI, not the only one
  • Confidence sequences are the interval analog of anytime-valid p-values — a sequence of confidence intervals that jointly cover the true parameter with high probability, regardless of when you stop

When evaluating a platform's claims about "valid sequential testing," it's worth asking: is this GST (pre-specified looks, alpha spending) or AVI (anytime valid, e-values)? The answer affects how you should design and interpret experiments.


Takeaway

Both Group Sequential Testing and Always-Valid Inference solve the peeking problem — but they make fundamentally different bets about how your team operates.

GST says: commit to your plan, and we'll guarantee validity at those checkpoints. AVI says: look whenever you want, and we'll guarantee validity throughout — but you'll pay a power cost.

One sentence summary:

GST trades flexibility for efficiency. AVI trades efficiency for flexibility. Neither is free.

Choose the one that fits how your team actually runs experiments — not the one that sounds more rigorous on paper.

More

Based on Tags

Recent Popular

Most Popular

  • Why You Shouldn't Peek at Your A/B Test Results

    An Introduction t Sequential AB Testing

    Illustrative Image
  • Stratified Sampling in A/B Testing

    Why Random Isn't Always Enough

    Illustrative Image
  • What Is CUPED

    and Why It Makes Your Experiments Faster

    Illustrative Image