Stratified Sampling in A/B Testing
You flip a coin 10 times and get 7 heads. That's not rigged — it's just randomness. Now imagine running an A/B test where, by chance, 70% of your high-value users end up in the treatment group. Your results will look great. But was it your treatment — or just the users?
This is the problem that stratified sampling solves.
1. The Problem With Pure Random Assignment
Simple random assignment works well in theory. Given a large enough sample, treatment and control groups should look similar on average. But "on average" hides a lot.
In practice, especially with smaller experiments, pure randomization can produce imbalanced groups. If your treatment group happens to contain more power users, more mobile users, or more users from a high-converting region — your results are confounded before the experiment even begins.
The risk isn't just bias. It's also variance. Even without systematic bias, random imbalance on key variables adds noise to your estimates — making it harder to detect a real effect.
2. What Stratified Sampling Does
Stratified sampling divides your user population into subgroups — called strata — based on characteristics that matter, and then randomizes within each stratum.
The goal is to guarantee that each stratum is proportionally represented in both treatment and control — by design, not by luck.
Full population ↓ Split into strata (e.g., by user tier, platform, region) ↓ Randomize within each stratum ↓ Treatment and control are balanced on stratification variables
For example, if 20% of your users are on iOS, stratified sampling ensures that roughly 20% of both your treatment group and control group are iOS users — not 12% in one and 28% in the other.
3. A Concrete Example
Suppose you're testing a new recommendation algorithm on a shopping app. You know from past data that user spending tier (low / mid / high) strongly predicts purchase behavior.
Without stratification:
| Treatment | Control | |
|---|---|---|
| Low spenders | 55% | 45% |
| Mid spenders | 50% | 50% |
| High spenders | 38% | 62% |
High spenders ended up skewed toward control — by chance. Now your treatment looks worse than it actually is.
With stratification by spending tier:
| Treatment | Control | |
|---|---|---|
| Low spenders | 50% | 50% |
| Mid spenders | 50% | 50% |
| High spenders | 50% | 50% |
Balanced by design. Your estimate of the treatment effect is now cleaner.
4. Stratification vs. CUPED — What's the Difference?
If you read the previous post on CUPED, you might be wondering: aren't these doing the same thing?
They're related, but different in approach.
| Stratified Sampling | CUPED | |
|---|---|---|
| When it acts | Before the experiment (at assignment) | After the experiment (at analysis) |
| What it controls | Discrete group imbalance | Continuous covariate variance |
| Requires pre-experiment data | For defining strata | For covariate adjustment |
| Scope | Randomization design | Statistical analysis |
Stratified sampling is a design-time intervention — you bake balance into the experiment from the start. CUPED is an analysis-time intervention — you adjust for pre-existing differences after the fact.
They're complementary. Many teams use both: stratify at assignment to prevent imbalance, then apply CUPED at analysis to further reduce variance.
5. How to Choose Your Strata
Not every variable is worth stratifying on. The benefit of stratification comes from reducing variance, and variance reduction only happens if the strata are actually predictive of your outcome metric.
Good candidates for strata:
- User value tier (low / mid / high spender) — if revenue is your metric
- Platform (iOS / Android / web) — if behavior differs significantly across platforms
- Geography (country or region) — if conversion rates vary by market
- User tenure (new / returning) — if new users behave very differently from veterans
- Prior engagement level — heavy vs. casual users
Rules of thumb:
- Stratify on variables that are strongly correlated with your outcome. If the variable doesn't predict behavior, stratifying on it adds complexity without benefit.
- Keep the number of strata manageable. Too many strata with too few users per cell leads to thin groups that are hard to randomize meaningfully.
- Strata should be defined before the experiment starts — never based on data collected during the experiment.
6. Proportional vs. Disproportional Stratification
By default, most teams use proportional stratification — each stratum contributes to treatment and control in proportion to its share of the total population. If 30% of users are mobile, 30% of each group is mobile.
But there's a case for disproportional stratification: intentionally oversampling smaller but important strata to ensure you have enough statistical power to analyze them separately.
For example, if only 5% of your users are from a key new market you're expanding into, proportional sampling might give you only 500 users per group — not enough for a subgroup analysis. Disproportional stratification lets you allocate more users from that stratum on purpose.
The tradeoff: disproportional stratification requires weighting when computing overall estimates, which adds analytical complexity.
7. Stratification in Practice
Most modern experimentation platforms support stratified assignment, though the implementation details vary.
In practice, the most common approach is pre-stratification at the time of user bucketing — typically done by hashing user IDs within strata to ensure stable, reproducible assignment.
A few things to watch for:
Late stratification doesn't work If you define strata after users have already been assigned, you've lost the randomization benefit. Strata must be defined — and assignment must happen — before exposure.
Strata need to be stable If a user's stratum membership can change during the experiment (e.g., they move from "low spender" to "mid spender"), you'll have an assignment consistency problem. Use strata based on behavior measured well before the experiment window.
Small strata can cause problems If a stratum has very few users, the randomization within that stratum becomes noisy. Consider merging thin strata or collapsing categories before assignment.
8. When Stratification Matters Most
Stratification has the highest impact when:
- Your experiment has a relatively small sample size — where random imbalance is more likely
- Your outcome metric has high variance across user segments
- You're planning subgroup analyses — if you want valid estimates per segment, you need adequate representation by design
- You're running experiments in new or small markets where baseline behavior is less stable
For very large experiments (millions of users), pure randomization tends to produce balanced groups naturally — stratification is a "nice to have" rather than a necessity. But for smaller experiments, it can meaningfully reduce both bias risk and variance.
Takeaway
Pure random assignment is not always enough. When your user population is heterogeneous and your sample is limited, stratified sampling gives you balance by design — not by luck.
One sentence summary:
Don't leave group balance to chance — stratify on what matters before the experiment starts.
Used alongside CUPED at the analysis stage, stratified sampling is one of the most reliable ways to run cleaner, faster, more trustworthy experiments.
SQL Growth