Logo SQL Growth

Always-Valid Inference

by DataMarvin
3 hours ago
Views: 8
Illustrative Image

In a previous post, we compared Group Sequential Testing (GST) and Always-Valid Inference (AVI) as two ways to stop an experiment early. We established the key difference: GST requires you to pre-specify when you'll look; AVI lets you look whenever you want.

But the "how" of AVI — why it works mathematically, what it actually constructs — deserves its own treatment. This post is that treatment. We'll build up the intuition for Always-Valid Inference from scratch, without assuming prior familiarity with the technical machinery.


1. The Problem, Restated Precisely

Standard hypothesis testing produces a p-value that is only valid at a fixed, pre-determined sample size. The guarantee is:

Under the null hypothesis, if we collect exactly n observations and compute the p-value once, it will be below α with probability at most α.

This guarantee breaks the moment you check the p-value before reaching n, because you've changed the procedure. You're no longer computing one p-value at n observations — you're computing many p-values at n₁, n₂, n₃, ... observations and stopping when one crosses the threshold.

As we showed in the Sequential Testing post, doing this without adjustment inflates your false positive rate dramatically. With 14 interim looks, your effective false positive rate reaches ~25% even though you set α = 0.05.

The question AVI answers: is it possible to construct a test statistic that remains valid — in the sense that the false positive rate stays below α — no matter when you stop?

The answer is yes. But it requires building a fundamentally different kind of test statistic.


2. What "Always Valid" Actually Means

The formal guarantee of Always-Valid Inference is:

Under the null hypothesis, the probability that the test statistic ever crosses the rejection threshold — at any sample size, at any stopping time — is at most α.

This is a much stronger statement than the standard guarantee. Standard testing says: the false positive rate is controlled at a specific sample size. AVI says: the false positive rate is controlled across all sample sizes simultaneously.

The technical name for this guarantee is anytime validity or uniform validity over time.

To appreciate how unusual this is: imagine a security system that guarantees "the alarm won't go off falsely at exactly 3pm." That's the standard guarantee. AVI is more like a security system that guarantees "the alarm will never go off falsely — not at 3pm, not at 4pm, not at any time you check."


3. Why Standard P-Values Can't Do This

It's worth understanding precisely why the standard p-value fails under continuous monitoring.

A standard p-value under the null hypothesis is uniformly distributed on [0, 1]. This means:

  • P(p-value < 0.05) = 0.05 at any fixed sample size ✓
  • But P(p-value < 0.05 at some point during sequential monitoring) ≫ 0.05 ✗

The reason: a uniform [0,1] random variable, observed repeatedly, will dip below 0.05 with high probability eventually — just by random fluctuation. The more times you observe it, the more chances it has to cross the threshold.

This is sometimes called the random walk problem. A standard p-value under the null performs a kind of random walk, and random walks eventually visit any region of their range if given enough time.

What we need instead is a test statistic whose behavior under the null is fundamentally different — one that doesn't tend to grow over time when there's no real effect, and whose crossing of a threshold is genuinely informative rather than an artifact of repeated observation.


4. The Key Insight: From P-Values to E-Values

The breakthrough in AVI comes from replacing the p-value with a different quantity: an e-value.

What is an e-value?

An e-value is a non-negative random variable E such that, under the null hypothesis:

E[E]1E[E] ≤ 1

That's the entire definition. The expected value of an e-value, under the null, is at most 1.

This might seem like a strange definition. Why does this property matter?

It matters because of Markov's inequality: for any non-negative random variable E,

P(E1/α)αE[E]αP(E ≥ 1/α) ≤ α · E[E] ≤ α

So if E is an e-value (E[E]1E[E] ≤ 1 under null), then P(E1/α)αP(E ≥ 1/α) ≤ α under the null. You reject when E1/αE ≥ 1/α, and your false positive rate is controlled at αα.

More importantly: this property holds regardless of when you compute E. You can compute E after 100 observations or after 10,000, and the false positive guarantee remains.

E-values vs. p-values

p-valuee-value
Null distributionUniform [0,1]Non-negative, mean ≤ 1
Rejection rulep < αE ≥ 1/α
Valid at fixed n?YesYes
Valid under continuous monitoring?NoYes
Can be combined across studies?Not directlyYes — multiply e-values
Statistical powerHigher at fixed nLower — pays a cost for flexibility

The last point in that table — e-values can be multiplied across independent studies — is a remarkable property that p-values don't have. The product of two e-values is itself an e-value, making meta-analysis and sequential combination mathematically clean.


5. Building an E-Value in Practice: The Likelihood Ratio

The most natural way to construct an e-value is through a likelihood ratio: how much more likely is the observed data under the alternative hypothesis than under the null?

E=P(dataalternative)/P(datanull)E = P(data | alternative) / P(data | null)

If the data look much more like the alternative than the null, E is large — and you reject. If the data look equally consistent with both, E hovers near 1. If the data look more like the null, E is small.

Under the null hypothesis, E[E]=E[P(dataalt)/P(datanull)]1E[E] = E[P(data|alt) / P(data|null)] ≤ 1 by definition of what it means for the null to be true. So the likelihood ratio is always a valid e-value.

The challenge: you usually don't know the alternative hypothesis precisely. You need to specify what effect size you're looking for. One solution — which we'll cover in depth in the next post — is to average the likelihood ratio over a distribution of plausible alternatives, producing a mixture likelihood ratio, also known as the mSPRT (mixture Sequential Probability Ratio Test).


6. Anytime-Valid Confidence Intervals

E-values have a natural companion: confidence sequences.

A confidence sequence is a sequence of intervals [l_t, u_t], one for each sample size t, such that the true parameter is contained in all of them simultaneously with probability at least 1α1 − α.

This is different from a standard confidence interval, which guarantees coverage only at a fixed sample size. A confidence sequence guarantees joint coverage over all time — the interval at every single time step simultaneously contains the truth with high probability.

Standard CI: P(θ ∈ [l_n, u_n])1 − α (at fixed n only) Confidence seq: P(θ ∈ [l_t, u_t] for all t)1 − α

In practice, a confidence sequence looks like a funnel: it starts wide when you have little data and narrows as evidence accumulates. But unlike standard CIs, which give you a falsely narrow interval if you stopped early, the confidence sequence is always valid — it automatically accounts for the fact that you might stop at any point.


7. The Power Cost: What You Pay for Flexibility

Always-Valid Inference is not free. The anytime validity guarantee comes at a cost in statistical power.

Intuitively: to guarantee that the false positive rate is controlled at every possible stopping time, AVI has to be conservative at early time points — when you have little data, the threshold for rejection must be high. This conservatism persists throughout the test, resulting in a power loss relative to a fixed-horizon test.

The rough magnitude: under typical experimental conditions, AVI approaches require 10–30% more data than a fixed-horizon test to achieve the same statistical power.

Efficiency (approximate): Fixed-horizon test: ████████████████████ 100% Group Sequential (O'B-F): ██████████████████░░ ~95% Always-Valid Inference: ████████████████░░░░ ~7585%

Whether this cost is acceptable depends on the operational context. For teams that can't realistically enforce "no peeking" policies, AVI's validity guarantee is worth the power cost. For teams that can pre-specify their analysis schedule, Group Sequential Testing is more efficient.


8. AVI in Practice: What Platforms Actually Implement

Most "always-valid" implementations in commercial experimentation platforms are based on the mSPRT framework, which constructs e-values by mixing likelihood ratios over a prior distribution on the effect size.

Platforms that offer AVI-style testing:

  • Optimizely Stats Engine — early adopter, uses mSPRT-based approach
  • Statsig — offers sequential testing with always-valid guarantees
  • Stitchfix — published foundational mSPRT work for two-sample tests
  • Eppo (now part of Datadog) — supports sequential testing with anytime-valid guarantees

The exact implementation differs across platforms — choice of prior, how the mixing distribution is specified, whether confidence sequences are reported alongside p-values — but the underlying mathematical guarantee is the same.


Takeaway

Always-Valid Inference solves a real operational problem: it makes continuous monitoring of experiments statistically valid. The key departure from standard testing is replacing the p-value — which is only valid at fixed sample sizes — with an e-value, whose expected value under the null is bounded regardless of when you stop.

One sentence summary:

Always-Valid Inference works because e-values, unlike p-values, maintain their statistical validity under continuous monitoring — the false positive rate stays at α whether you stop at 100 observations or 100,000.

More

Based on Tags

Recent Popular

Most Popular

  • Why You Shouldn't Peek at Your A/B Test Results

    An Introduction t Sequential AB Testing

    Illustrative Image
  • Stratified Sampling in A/B Testing

    Why Random Isn't Always Enough

    Illustrative Image
  • What Is CUPED

    and Why It Makes Your Experiments Faster

    Illustrative Image