A/B Test Trial Simulation
This simulator shows how real A/B tests progress over time, with each new participant adding noise and uncertainty.
Unlike static calculations, this demonstrates the actual trial experience where p-values start at 1 (no evidence) and gradually decrease as evidence accumulates.
Why do this?
Traditional method
- Recruits a target amount of people, then measuring the effect differences.
- Has an "over-recruitment" problem.
Sequential method
- Recruit sequentially, updating the results
- Cancels experiment as soon as enough participants has shown significant difference between control and treatment.
- Early lucky or unlucky observations throw the experiment off in tangent, especially if there is STD for control and treatment, and if STD is unknown.
With this simulation, we...
- Check if a certain trial is worth doing or not, by sampling the distribution of required observations to conclude a trial.
- Budget the cost of recruiting, especially if each observation costs a lot.
EG: if median trials-to-significance is 150, it is recommended to recruit up to 200 observation to run a single experiment. You can exit the experiment early if you get lucky, but even at unlucky times, you have 200 observations prepared.
Understanding Trial Progression
The Real Problem
- Static calculations assume perfect knowledge
- Real trials have variance and noise
- Each participant adds uncertainty
- P-values fluctuate as data accumulates
True Effects vs. Observed Effects
- True Effect: The actual underlying difference (e.g., 2% vs 3% conversion)
- Observed Effect: What you measure (e.g., 1.8-2.2% vs 2.7-3.3% due to variance)
Variance Impact
- Low Variance: Quick detection of true effects
- High Variance: Harder to detect true effects, requires more trials
- No Variance: Perfect detection immediately (unrealistic)
How the Simulation Works
Parameters
- Base Effect: Control group's true effect (e.g., 20)
- Base Std: Standard deviation as % of base effect (e.g., 20% = ±20% of 20 = 16-24)
- Target Effect: Treatment group's true effect (e.g., 30)
- Target Std: Standard deviation as % of target effect (e.g., 20% = ±20% of 30 = 24-36)
Simulation Process
- Start at p=1 (no evidence yet)
- For each trial: Simulate one observation for each group with variance
- Accumulate data: Keep running totals of successes/failures
- Calculate p-value: Use chi-square test on accumulated data
- Calculate odds ratio: From accumulated data
- Plot progression: Show how p-value and odds ratio evolve
Chart Interpretation
Axes
- X axis: Trials - Number of runs accumulated
- Left Y-axis: -Log₁₀(p-value) - lower means more significant
- Right Y-axis: Odds Ratio - values > 1 favor treatment
Lines
- Blue line: P-value evolution (starts at p=1, becomes significant when crossing threshold)
- Red line: Observed odds ratio (fluctuates due to variance)
- Gray dashed line: Standard significance threshold (p=0.05)
- Orange dotted line: Doubt-adjusted threshold (stricter early, standard later)
Key Insights
- P-value starts at 1 (no evidence)
- Gradual decrease as evidence accumulates
- High variance makes detection harder
- Odds ratio fluctuates around true value
- Doubt index discounts early significance due to luck factor
- Sustained evidence required for confirmation (EG: 5 consecutive trials)
Real-World Examples
E-commerce A/B Test
- Control: 20% conversion ± 20% variance (16-24% range)
- Treatment: 30% conversion ± 20% variance (24-36% range)
- Result: May take 200-500 trials to detect significance
Email Marketing Test
- Control: 20% open rate ± 10% variance (18-22% range)
- Treatment: 25% open rate ± 10% variance (22.5-27.5% range)
- Result: May take 100-300 trials to detect significance
Landing Page Test
- Control: 10% conversion ± 30% variance (7-13% range)
- Treatment: 20% conversion ± 30% variance (14-26% range)
- Result: May take 500-1000 trials to detect significance
This approach provides a realistic view of how A/B tests actually work in practice, helping you understand the uncertainty and variance inherent in real-world experimentation.