Methodology · Experimentation

Building a Testing Culture

Q: Peeking at results before significance

Every early peek inflates false-positive risk. If you must monitor, define an alpha-spending plan up front (Pocock or O'Brien-Fleming bounds). Otherwise: don't look. Set a calendar reminder for the planned end date and trust the math.

Q: Underpowered tests called too early

A 4-day test on most surfaces hasn't seen day-of-week variance yet. Minimum 14 days for any test on a high-seasonality surface; longer if conversion volume is low.

Q: Testing on a too-narrow segment

Single-affiliate, single-channel, single-geo tests don't generalize. Target the full population that will eventually use the feature, even if it means waiting longer for power.

Q: Stacking concurrent tests on the same surface

Either use mutually-exclusive groups or sequence the tests. Stacking pollutes both tests and produces a debate with no possible resolution.

Q: Treating local optima as global wins

Primary metric up, secondary metrics down — that's a local optimum and a future regret. Always pre-register the secondary metrics that would constitute a deal-breaker.

Q: Calling a winner without a pre-registered success criterion

Post-hoc rationalization wins every time you let it. Write down what 'winning' means — primary metric, threshold, sample size, secondary guardrails — before launch.

Q: Skipping documentation

A test without a wiki entry (hypothesis, variations, code, result) is a learning that vanishes the moment the PM rotates. Documentation is the institutional memory of a testing program.

Q: Never re-testing winners

Wins decay. If you've never re-tested last year's winners, half your confidence is fictional. Build re-validation into the backlog as its own lane.

The teams that learn the fastest, win. This is the operating model I use to build experimentation programs that ship 200+ A/B tests a year without sacrificing rigor — covering hypothesis design, prioritization, velocity, statistical discipline, and the cultural shifts that decide whether the program lives or dies.

Role: Product & Growth Leadership
Period: Refined across 15+ years of practice
Tests / year · velocity benchmark: 200+

ExperimentationA/B TestingGrowthStatistical RigorOperating Model

Data ends the argument

The single biggest blocker to a testing culture isn't tooling, sample size, or developer bandwidth — it's the HiPPO(Highest Paid Person's Opinion). Most product orgs ship features because someone senior felt strongly about them, then declare victory based on launch metrics that were never going to fail. A testing culture replaces that with a simple rule: if it can be tested, it should be tested.

The math is the easy part. A calculator tells you how much traffic you need and how long to run the test, and the results are usually straightforward. The hard part is what the results do to leaders. A/B testing is a humbling practice — roughly 80% of test ideas don't win. The best programs I've been part of accept that as the price of admission. You'll occasionally find a 30%, 40%, even 50% lift, but most of the time the output isn't a winner shipped to production — it's a learningthat reshapes the next hypothesis. Senior leaders who can sit with that, and submit their own ideas to the same gate as everyone else's, are what makes the program work.

What testing demands of senior leaders

Lead by Example

Your strongest convictions are the ones most worth testing. The conviction itself is not the evidence.

Public Vulnerability

Losing well — and updating strategy in public — is the leadership signal that builds psychological safety for the whole team.

Standardize the Gate

No executive override lane. Every bet, from the CEO to the junior PM, passes through the same validation gate.

Adaptive Strategy

If a test kills a bet, the strategy that depended on it must pivot. Experimentation is only valuable if it changes the roadmap.

Institutional Memory

Documentation is the difference between a random walk and a compound learning engine. Treat your test wiki as a core asset.

Invest in the Platform

Rigor compounds; shortcuts decay. Underfunding testing infrastructure is borrowing against your future learning rate.

Move from Ideas to Fundable Bets

A great idea without a hypothesis is a wish. To maintain speed without sacrificing discipline, I require a non-negotiable format for every experiment: IF [change], THEN [user behavior], BECAUSE [reasoning].

For more complex surfaces, we extend this to Multivariate Testing (MVT). While a standard A/B test changes one variable, MVT tests multiple variables simultaneously to understand interaction effects (e.g., how a new headline performs when combined with a different CTA button). MVT requires significantly higher traffic but provides a deeper map of the user's mental model.

“IF we display a security badge on the payment page, THEN customers will feel more confident in the site's security. This will increase conversion BECAUSE trusted brands convert better — and even a 2% lift on this surface would be material to the P&L.”

Why the format matters

The third clause — the becauseand the order-of-magnitude business case — is what separates an idea from a fundable bet. Without it, you're running a curiosity experiment. With it, you're running an investment.

What makes a hypothesis great

Rooted in a customer pain point
Not 'what should we change,' but 'what's broken for the user.' Pain is the strongest forcing function for ideas worth testing.
Backed by mixed evidence
Quantitative AND qualitative. Either alone is half a story; together they tell you the why behind the what.
Written before the solution
If the solution comes first, you're not testing — you're confirming. The hypothesis names the problem; the variations propose the answer.
Part of a decision tree
A hypothesis is best when it's one node in a 'suite of hypotheses' — so you know what to test next, whether this one wins or loses.

Building a High-Signal Backlog

Quantitative Signals

Funnel drop-off, search behavior, and engagement analytics. Data tells you where to look; it rarely tells you exactly what to do.

Qualitative Insight

Support tickets, session replays, and NPS verbatims. The fastest way to understand why a number is moving is to listen to the person behind it.

Heuristic Review

UX audits and usability scans. Most low-hanging fruit can be found by looking at your product as a stranger would.

Market Context

Competitive analysis on a fortnightly cadence. Competitors are a free lab of someone else's validated or failed hypotheses.

Design Thinking

Rapid prototyping and journey mapping. Generates the 'step-change' ideas that data alone never surfaces.

Elements of Value

Mapping against Bain's 30 functional and emotional drivers. Note: emotional drivers often return 2× the lift of functional ones.

Score everything; argue about scores, not opinions

Prioritization is where most testing programs collapse. The instinct is to build the backlog, then prioritize by political consensus. The result: the loudest stakeholder wins, and the team's velocity plateaus at whatever the most senior person can review in a week. The fix is to make every idea pass through a numeric scoring rubric before it gets discussed — so the conversation is about the score, not the idea.

For roadmap-level decisions I use a 10-signal scoring rubric (see my Prioritization Model). For test-backlog-level decisions a lighter three-axis rubric is enough — the tests are smaller bets, the cost of being wrong is lower, and speed matters more than precision.

The test-backlog rubric

Effort estimate (S / M / L)
Honest dev sizing — not the optimistic version. If every test is 'small' the rubric isn't doing work.
Forecasted business impact (S / M / L)
Required: the dollar number (or relative size) you'd defend in a CFO review. Forecasts can be wrong; missing forecasts are always wrong.
Priority bucket (High / Low / Hold)
Forces a decision instead of 'later.' Hold is allowed and useful — Low-priority is the trap, because Low becomes Never.
Vetting checklist before development
Validate the pain point with research, prioritize the most urgent, generate ≥3 solution candidates, pick 1–4 variations, define primary metric, define secondary metrics, define monitoring guardrails. Tests that skip this checklist tend to be the ones that produce ambiguous results.

Speed without sacrificing rigor

Velocity is the compounding metric of an experimentation program. A team running 50 tests a year and a team running 200 tests a year don't differ by 4× in output — they differ by 4× in learning rate, which compounds. The constraint is rarely developer hours; it's the QA and analysis loop. These targets break that constraint.

The cadence that ships 200+ tests/year

Test sizing at hypothesis stage
Small (~20% of tests), Medium (~60%), Large (~20%). Sized when the hypothesis is written, tracked through delivery. Surprises in sizing are themselves a signal.
UAT turnaround targets
Small: ½ day. Medium: 1 day. Large: 2 days. Anything slower means QA is the bottleneck — fix the QA process, not the tests.
End-to-end QA, not first-bug-stop
QA tests the full flow, batches all defects in one report, sends them back together. The stop-at-first-bug pattern costs more cycles than it saves.
Cross-browser/device matrix
Test cases written during development, not after. Modern matrix: Chrome / Firefox / Safari / Edge × Windows / Mac / iOS / Android. Older browsers handled via graceful degradation, not test coverage.
Developers QA their own code first
Catches the trivial defects before the formal QA pass — easily 30% of would-be bugs. Costs 15 minutes; saves a day of back-and-forth.

Tests you shouldn't run

Regulatory, legal, or accessibility-driven changes
Compliance isn't optional; you're not testing whether to comply. Test the rollout's secondary effects, not the change itself.
Traffic too low for the MDE
If you'd need 6 months to reach significance, the answer is qualitative research, not a degraded test. The calculator is honest with you here — listen to it.
Pure 0→1 MVP launches
A one-arm rollout with monitoring beats a 50/50 test for true 0→1 features. Test the iterations once you have a baseline.
Internal/operational improvements
Test the outcome (CR, NPS, support volume), not the implementation. Caring about the implementation is a category error.
Overlapping tests on the same surface
Either run mutually-exclusive groups or sequence them. Stacked tests pollute each other's signal and the post-mortem is unfalsifiable.
Major seasonality windows
Black Friday, Cyber Monday, Christmas, fiscal-year resets. Anomalies will mask or fake your effect — wait for the noise floor to drop.
No pre-registered success criterion
Without a written-down definition of 'winning' before launch, post-hoc rationalization wins every time. This is the most-violated rule in the field.

“Statistically significant” is necessary, not sufficient

Most teams declare a winner the moment a single metric crosses 95% confidence. This is wrong, and it's why so many “wins” don't replicate post-rollout. A winner is validated by five signals together. Any one in isolation is noise.

Five Winning-Test Criteria

Statistical Significance (95% Standard)
Default to 95% confidence. Drop the bar only for low-risk, reversible decisions where the downside is limited and the rationale is pre-registered.
Minimum Sample Volume
Measured by conversions, not sessions. A 100k-session test with 50 conversions is underpowered. Aim for at least 250–500 conversions per variant.
Stable Trend Lines
Read the shape, not the endpoint. Moving toward the mean (convergence) usually indicates a decaying effect or novelty response.
KPI Alignment
Primary metric wins must be supported by secondary KPIs. A lift in conversion that degrades Revenue per Visit is a local optimum, not a business win.
Product Sniff Test
Does the result align with customer behavior? The sniff test is the final filter against measurement error and tracking bugs.

The win you shipped 18 months ago probably isn't winning anymore

No tested change is permanent. Winners decay for four reliable reasons: novelty effect(users responded to “different,” not “better”), segment composition shift(the audience that won the test isn't the audience using the product today), competitive response (whatever you copied or invented, your competitor caught up), and interaction effects (other changes shipped in the meantime altered the surface around your win).

The implication is operational, not philosophical: re-test winners on a 6–12 month cadence. Treat the testing backlog as having two lanes — net-new hypotheses and re-validation of past wins. The team that re-tests is the team that compounds learning instead of accumulating false confidence.

Nine steps from idea to documented learning

Brainstorm
Small group (≤5), no idea killed at the table. Rank afterward. The best ideas often come out 20 minutes in — kill the meeting too early and you miss them.
Submit
Every idea enters the same backlog (one tool, one format). Friction at intake = ideas you'll never see again.
Prioritize
Score (effort × impact × priority), commit publicly. Public commitment is the forcing function against quiet de-prioritization.
Document & design
Test Card or wiki entry per test, not per sprint. Include screenshots of every variation (including control) and the specific tracking logic.
Setup
Build per spec. For scale, I implement hash-based consistent bucketing — ensuring users stay in their variant without the latency or overhead of a central database.
QA & approval
Preview link, full browser/device matrix, batched bug reporting. Sign-off is gated on the pre-registered success criterion being in writing.
Run & notify
Calendar invite to all stakeholders the moment it goes live. Surprises about who launched what are how testing programs lose political capital.
Analyze & review
Five-criteria check (sample size, significance, trend, tertiary, sniff). All five, not the most convenient subset.
Document result
Update the Wiki with the final code, result data, and the next-step hypothesis. Every test must produce a compound learning.

Anti-patterns we actively block

These are the failure modes I watch for in any program audit. Most testing programs are running at least three of them right now.

Peeking at results before significance

Underpowered tests called too early

Testing on a too-narrow segment

Stacking concurrent tests on the same surface without isolation

Treating local optima as global wins

Calling a winner without a pre-registered success criterion

Skipping documentation

Never re-testing winners

Advanced testing methods

High-performance programs eventually outgrow simple 50/50 A/B testing. To maximize yield, I deploy two advanced approaches: Multi-Armed Bandit (MAB)and the Twice-Tested Rule for statistical anomalies.

The Twice-Tested Rule:If a result looks “too good to be true” (e.g., a 40% lift on a mature surface), I re-test it. At 95% confidence, the probability of a false positive is 5%. The statistical probability of getting the same false positive twice in a row is 0.25% (0.05 × 0.05). If the lift holds twice, it's real wealth.

Multi-Armed Bandits (MAB):Used in sophisticated ad platforms (Facebook, Google) and high-velocity systems. Instead of fixed 50/50 splits, MAB algorithms (like Thompson Sampling) automatically shift traffic to the winning variation in real-time. This reduces the “regret” of serving a sub-optimal experience to users during the test window. It's ideal for short-lived campaigns where speed to value beats pure long-term learning.

Building a Cross-Functional Squad

Product / Business Owner

Responsible for test definition, business KPIs, and selecting the specific traffic segments for the experiment.

Direct Response Copy

Owns the brand voice and ensures the variations adhere to direct response principles that drive user action.

UX & UI Design

Creates high-fidelity mockups, wireframes, and ensures the experiment logic doesn't break accessibility or usability standards.

CRO & Data Analyst

Validates statistical significance, cross-checks data tracking in analytics, and maintains the institutional wiki.

Engineering Lead

Ensures the test is built for performance, handles consistent bucketing logic, and manages the release cadence.

QA / CROQA Owner

Certifies feature functionality across the browser/device matrix and verifies tracking events before launch.

Continuous Validation via AA Testing

A testing platform you can't trust is worse than no platform at all. To ensure our measurement is accurate, I implement always-on AA tests.

By running an identical experience through the testing pipeline as two separate groups, we verify that the system reports zero statistical difference. If an AA test shows significance, we know our bucketing logic or data tracking is flawed. This is the bedrock of a data-driven culture.

“The team that wins isn't the team with the best ideas. It's the team that's most comfortable being publicly wrong, fastest, in the smallest possible increments.”

Significance Calculator →Sample Size Calculator Build vs Buy TCO Prioritization Model Product Development Framework How I Run a Team UX Maturity Model

Data ends the argument

What testing demands of senior leaders

Lead by Example

Public Vulnerability

Standardize the Gate

Adaptive Strategy

Institutional Memory

Invest in the Platform

Move from Ideas to Fundable Bets

Why the format matters

What makes a hypothesis great

Rooted in a customer pain point

Backed by mixed evidence

Written before the solution

Part of a decision tree

Building a High-Signal Backlog

Quantitative Signals

Qualitative Insight

Heuristic Review

Market Context

Design Thinking

Elements of Value

Score everything; argue about scores, not opinions

The test-backlog rubric

Effort estimate (S / M / L)

Forecasted business impact (S / M / L)

Priority bucket (High / Low / Hold)

Vetting checklist before development

Speed without sacrificing rigor

The cadence that ships 200+ tests/year

Test sizing at hypothesis stage

UAT turnaround targets

End-to-end QA, not first-bug-stop

Cross-browser/device matrix

Developers QA their own code first

Tests you shouldn't run

Regulatory, legal, or accessibility-driven changes

Traffic too low for the MDE

Pure 0→1 MVP launches

Internal/operational improvements

Overlapping tests on the same surface

Major seasonality windows

No pre-registered success criterion

“Statistically significant” is necessary, not sufficient

Five Winning-Test Criteria

Statistical Significance (95% Standard)

Minimum Sample Volume

Stable Trend Lines

KPI Alignment

Product Sniff Test

The win you shipped 18 months ago probably isn't winning anymore

Nine steps from idea to documented learning

Brainstorm

Submit

Prioritize

Document & design

Setup

QA & approval

Run & notify

Analyze & review

Document result