Logo SQL Growth

Stratified Sampling in A/B Testing

by DataMarvin
5 hours ago
Views: 5
Illustrative Image

You flip a coin 10 times and get 7 heads. That's not rigged — it's just randomness. Now imagine running an A/B test where, by chance, 70% of your high-value users end up in the treatment group. Your results will look great. But was it your treatment — or just the users?

This is the problem that stratified sampling solves.


1. The Problem With Pure Random Assignment

Simple random assignment works well in theory. Given a large enough sample, treatment and control groups should look similar on average. But "on average" hides a lot.

In practice, especially with smaller experiments, pure randomization can produce imbalanced groups. If your treatment group happens to contain more power users, more mobile users, or more users from a high-converting region — your results are confounded before the experiment even begins.

The risk isn't just bias. It's also variance. Even without systematic bias, random imbalance on key variables adds noise to your estimates — making it harder to detect a real effect.


2. What Stratified Sampling Does

Stratified sampling divides your user population into subgroups — called strata — based on characteristics that matter, and then randomizes within each stratum.

The goal is to guarantee that each stratum is proportionally represented in both treatment and control — by design, not by luck.

Full population ↓ Split into strata (e.g., by user tier, platform, region) ↓ Randomize within each stratum ↓ Treatment and control are balanced on stratification variables

For example, if 20% of your users are on iOS, stratified sampling ensures that roughly 20% of both your treatment group and control group are iOS users — not 12% in one and 28% in the other.


3. A Concrete Example

Suppose you're testing a new recommendation algorithm on a shopping app. You know from past data that user spending tier (low / mid / high) strongly predicts purchase behavior.

Without stratification:

TreatmentControl
Low spenders55%45%
Mid spenders50%50%
High spenders38%62%

High spenders ended up skewed toward control — by chance. Now your treatment looks worse than it actually is.

With stratification by spending tier:

TreatmentControl
Low spenders50%50%
Mid spenders50%50%
High spenders50%50%

Balanced by design. Your estimate of the treatment effect is now cleaner.


4. Stratification vs. CUPED — What's the Difference?

If you read the previous post on CUPED, you might be wondering: aren't these doing the same thing?

They're related, but different in approach.

Stratified SamplingCUPED
When it actsBefore the experiment (at assignment)After the experiment (at analysis)
What it controlsDiscrete group imbalanceContinuous covariate variance
Requires pre-experiment dataFor defining strataFor covariate adjustment
ScopeRandomization designStatistical analysis

Stratified sampling is a design-time intervention — you bake balance into the experiment from the start. CUPED is an analysis-time intervention — you adjust for pre-existing differences after the fact.

They're complementary. Many teams use both: stratify at assignment to prevent imbalance, then apply CUPED at analysis to further reduce variance.


5. How to Choose Your Strata

Not every variable is worth stratifying on. The benefit of stratification comes from reducing variance, and variance reduction only happens if the strata are actually predictive of your outcome metric.

Good candidates for strata:

  • User value tier (low / mid / high spender) — if revenue is your metric
  • Platform (iOS / Android / web) — if behavior differs significantly across platforms
  • Geography (country or region) — if conversion rates vary by market
  • User tenure (new / returning) — if new users behave very differently from veterans
  • Prior engagement level — heavy vs. casual users

Rules of thumb:

  1. Stratify on variables that are strongly correlated with your outcome. If the variable doesn't predict behavior, stratifying on it adds complexity without benefit.
  2. Keep the number of strata manageable. Too many strata with too few users per cell leads to thin groups that are hard to randomize meaningfully.
  3. Strata should be defined before the experiment starts — never based on data collected during the experiment.

6. Proportional vs. Disproportional Stratification

By default, most teams use proportional stratification — each stratum contributes to treatment and control in proportion to its share of the total population. If 30% of users are mobile, 30% of each group is mobile.

But there's a case for disproportional stratification: intentionally oversampling smaller but important strata to ensure you have enough statistical power to analyze them separately.

For example, if only 5% of your users are from a key new market you're expanding into, proportional sampling might give you only 500 users per group — not enough for a subgroup analysis. Disproportional stratification lets you allocate more users from that stratum on purpose.

The tradeoff: disproportional stratification requires weighting when computing overall estimates, which adds analytical complexity.


7. Stratification in Practice

Most modern experimentation platforms support stratified assignment, though the implementation details vary.

In practice, the most common approach is pre-stratification at the time of user bucketing — typically done by hashing user IDs within strata to ensure stable, reproducible assignment.

A few things to watch for:

Late stratification doesn't work If you define strata after users have already been assigned, you've lost the randomization benefit. Strata must be defined — and assignment must happen — before exposure.

Strata need to be stable If a user's stratum membership can change during the experiment (e.g., they move from "low spender" to "mid spender"), you'll have an assignment consistency problem. Use strata based on behavior measured well before the experiment window.

Small strata can cause problems If a stratum has very few users, the randomization within that stratum becomes noisy. Consider merging thin strata or collapsing categories before assignment.


8. When Stratification Matters Most

Stratification has the highest impact when:

  • Your experiment has a relatively small sample size — where random imbalance is more likely
  • Your outcome metric has high variance across user segments
  • You're planning subgroup analyses — if you want valid estimates per segment, you need adequate representation by design
  • You're running experiments in new or small markets where baseline behavior is less stable

For very large experiments (millions of users), pure randomization tends to produce balanced groups naturally — stratification is a "nice to have" rather than a necessity. But for smaller experiments, it can meaningfully reduce both bias risk and variance.


Takeaway

Pure random assignment is not always enough. When your user population is heterogeneous and your sample is limited, stratified sampling gives you balance by design — not by luck.

One sentence summary:

Don't leave group balance to chance — stratify on what matters before the experiment starts.

Used alongside CUPED at the analysis stage, stratified sampling is one of the most reliable ways to run cleaner, faster, more trustworthy experiments.

More

Based on Tags

No related posts found based on the tags.

Recent Popular

Most Popular

  • Why You Shouldn't Peek at Your A/B Test Results

    An Introduction t Sequential AB Testing

    Illustrative Image
  • What Is CUPED

    and Why It Makes Your Experiments Faster

    Illustrative Image