Evidence-Based Design

Product design, especially growth design, could be a battleground where many factors constantly change the guard: user needs, business impact, industry trends, designers’ intuition, and even stakeholders’ personal experience. Diverse inputs can spark good ideas, it can also lead to deadlock when everyone has an opinion, but no one has a framework. Decision goes to whoever speaks the loudest or whichever proposal comes with the most compelling story. Bias accumulates quietly. Over time, this can lead to significant revenue loss and harm the business, especially when design heavily impacts conversions.

Modern medicine once faced a remarkably similar crisis. For most of its history, a treatment were chosen because an experienced doctor liked it, not because it was rigorously tested. In the early 1990s, Evidence-Based Medicine emerged as a game changer with a simple, but radical rule: clinical decisions should be grounded in the best available evidence. To support it, researchers developed a hierarchy of evidence that ranks evidence not by status, but by how well they control the biases that distort judgment. It results in a field that could defend its decisions and improve systematically.

I'm proposing Evidence-Based Design borrowing the same ambition from medicine. Evidence-Based design is a practice of evaluating design decisions based on the quality of the evidence supporting them, to help designers make better decisions, and own those decisions confidently.

The evidence hierarchy

The evidence hierarchy is indeed a hierarchy of bias reduction. The higher the level, the more variables are controlled, the less the result depends on who designed the study, who participated, or what assumptions were built in.

An evidence is considered higher level when it:

  • is collected in real product contexts

  • controls variables and isolates causation

  • is replicable across contexts

  • reduces interpretation bias

Level

Evidence Type

What it is?

Why it ranks here?

Level 1

Replicated Experiments

Multiple level 2 experiments testing the same hypothesis across different products, different audiences or over time, showing consistent results.

Repeatedly validated evidence across contexts helps eliminating the possibility that a single A/B test result was a coincidence or fluke.

Level 2

Well-Designed Experiment

A single well-designed randomized controlled experiment(A/B tests, A/B/N tests, multivariate tests).

Isolates causation by controlling variables and measuring specific metrics in the live product.

Level 3

Product Behavioral Data

Data showing how users interact with an existing product (visits, conversions, feature adoption, engagement, and user feedback). *

Reveals what is happening in the real product with real customers, but results can be mixed signals influenced by many factors, making it difficult to attribute to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing against prototypes (task completion, findability, comprehension, etc) , as well as hypothesis-driven surveys and interviews where users respond to hypothetical questions or mocked scenarios.

They are strong for diagnosing potential frictions or possible user intent, but both methods share a say-do gap that responses are based on simulated or imagined experiences rather than real ones.

Level 5

External Secondary Research

Information gathered from external sources (Published research, industry benchmarks, competitor analysis, and design patterns).

They are useful for generating hypotheses and identifying conventions, but they are generic and lack of direct evidence for specific product and users.

Level 6

Expert Opinion

Input based on intuition and prior experience.

The most accessible and lowest cost evidence, but also the most prone to authority bias and personal experience that may not transfer across contexts.

Level

Evidence Type

What it is?

Why it ranks here?

Level 1

Replicated Experiments

Multiple level 2 experiments testing the same hypothesis across different products, different audiences or over time, showing consistent results.

Repeatedly validated evidence across contexts helps eliminating the possibility that a single A/B test result was a coincidence or fluke.

Level 2

Well-Designed Experiment

A single well-designed randomized controlled experiment(A/B tests, A/B/N tests, multivariate tests).

Isolates causation by controlling variables and measuring specific metrics in the live product.

Level 3

Product Behavioral Data

Data showing how users interact with an existing product (visits, conversions, feature adoption, engagement, and user feedback). *

Reveals what is happening in the real product with real customers, but results can be mixed signals influenced by many factors, making it difficult to attribute to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing against prototypes (task completion, findability, comprehension, etc) , as well as hypothesis-driven surveys and interviews where users respond to hypothetical questions or mocked scenarios.

They are strong for diagnosing potential frictions or possible user intent, but both methods share a say-do gap that responses are based on simulated or imagined experiences rather than real ones.

Level 5

External Secondary Research

Information gathered from external sources (Published research, industry benchmarks, competitor analysis, and design patterns).

They are useful for generating hypotheses and identifying conventions, but they are generic and lack of direct evidence for specific product and users.

Level 6

Expert Opinion

Input based on intuition and prior experience.

The most accessible and lowest cost evidence, but also the most prone to authority bias and personal experience that may not transfer across contexts.

Level

Evidence Type

What it is?

Why it ranks here?

Level 1

Replicated Experiments

Multiple level 2 experiments testing the same hypothesis across different products, different audiences or over time, showing consistent results.

Repeatedly validated evidence across contexts helps eliminating the possibility that a single A/B test result was a coincidence or fluke.

Level 2

Well-Designed Experiment

A single well-designed randomized controlled experiment(A/B tests, A/B/N tests, multivariate tests).

Isolates causation by controlling variables and measuring specific metrics in the live product.

Level 3

Product Behavioral Data

Data showing how users interact with an existing product (visits, conversions, feature adoption, engagement, and user feedback). *

Reveals what is happening in the real product with real customers, but results can be mixed signals influenced by many factors, making it difficult to attribute to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing against prototypes (task completion, findability, comprehension, etc) , as well as hypothesis-driven surveys and interviews where users respond to hypothetical questions or mocked scenarios.

They are strong for diagnosing potential frictions or possible user intent, but both methods share a say-do gap that responses are based on simulated or imagined experiences rather than real ones.

Level 5

External Secondary Research

Information gathered from external sources (Published research, industry benchmarks, competitor analysis, and design patterns).

They are useful for generating hypotheses and identifying conventions, but they are generic and lack of direct evidence for specific product and users.

Level 6

Expert Opinion

Input based on intuition and prior experience.

The most accessible and lowest cost evidence, but also the most prone to authority bias and personal experience that may not transfer across contexts.

*Fake door tests where a feature or CTA is shown to real users in the live product but clicking leads to a placeholder, can be considered product behavioral data even without a real feature built, as they capture genuine intent in a real context.

The cost of evidence

Higher quality evidence can cost significantly more in time, money, and organizational effort. Running a well designed A/B test alone requires good experimentation infrastructure, good measurement systems, sufficient sample size, dedicated engineering capacity and analytical capability. That can easily go up to millions of dollars. Most teams, especially early stage ones, don't have all of this in place. That's not a failure, but how everyday design decisions get made in reality, and why designers with strong instincts are still so valued.

That said, relying on lower quality evidence is not without risk. Our goal is to make those tradeoffs deliberately. The following principles are ones I find useful at work:

Match evidence quality to decision weight

Not every decision needs the highest level of evidence. Before deciding what we need and how to validate them, we should consider three things: business impact, building efforts, and reversibility. A high impact change that requires significant engineering investment and difficult to undo deserves more rigorous evidence than a low effort, easily reversible one. Designer should aligning with product, business, and engineering partners on these dimensions early. It can make evidence investment intentional rather than reactive.

Be cautious about evidence debt

Just like tech debt, evidence debt builds up when decisions are consistently made on poor evidence. Unvalidated assumptions pile up, and correcting them downstream may cost much more than getting better evidence upfront. It's worth asking ourselves over time: where sits our biggest unvalidated assumptions, and what would it take to test them?

Build evidence infrastructure incrementally

No team needs to start at the top. A practical and more economic approach is to build the evidence infrastructure by time. Start from the bottom and building up: expert opinion and secondary research first, then adding usability testing, then product analytics, and eventually A/B testing infrastructure. It's critical to know where we currently stands and what to invest. The team I'm working with has spent years building a strong evidence infrastructure. And as it matured, decisions became more data-driven, discussions shifted from opinions to tradeoffs. It makes design decisions traceable and measurable. Also creates confidence for me and the team to make and own decisions since we understand the risks behind them, and knowing that we will continuously validate the unknown.

Infrastructure should also be designed to amortize cost across many decisions, like reusable research panels, testing protocols and templates, shared analytics instrumentation, no-code/low-code testing tools. My colleague Sabina introduced me to multi-armed bandit testing: unlike standard A/B tests that split traffic evenly until completion, multi-armed bandit automatically shift traffic toward better performing variants as the test runs. In her words, "Earning while learning". This is another perfect example of how infrastructure can be more cost-effective. While I haven't practiced it myself, some A/B testing tools has supported this automation, and its worth keeping an eye on.

Evidence management is a design skill

Evidence-Based Design requires designers to not only gather insights from research, stakeholder interviews, data, but also actively manage them. I have three proposals:

Know what evidence to collect

This is a product management mindset, but it's also very helpful for designers: start with assumptions rather than jumping into solutions. We start with listing all the assumptions and assess them on two dimensions: how much impact it would have if it proves wrong, and how strong the existing evidence for it already is. That naturally surfaces the assumptions worth investing in: the ones that are high impact but poorly evidenced. From there, the question becomes what type of evidence is actually needed(usually a higher level one), and designing research or tests specifically around validating those assumptions.

This also means sequencing the tests. Lower level evidence can act as a filter to eliminate weak options early. For example, if a fake door test shows no interest, a full A/B test might be no longer needed.

We used a 2x2 matrix and tables to prioritize assumptions based on impact and existing evidence.

Understand what evidence is telling you, and what it isn’t

Every evidence type has its limitations. Bias, statistic errors, and some research methods only focus on part of the story, like usability test captures friction but not intent or business impact. Reading evidence honestly means asking what else could explain this result, where the gaps are, and what other signals would either reinforce or challenge what you’re seeing. Cross-referencing multiple levels rather than relying on any single source is very helpful in getting a more complete picture.

In medicine, blind review is a standard safeguard. Product design rarely works this way. In most teams, the designer, product manager or researcher who ran the research is also the one who reads the results and presents the conclusions. It’s easy to unconsciously overweight evidence that supports the existing direction(a trap I often fall into myself). This is where team work should come to play. Two heads are better than one. Share test data cross teams: with product, engineering, sales, customer service, finance etc, and get perspective from different angles before drawing the conclusions.

Improve the quality of evidence produced

The level of evidence matters, but so does how well it's executed. A usability test with a high-fidelity interactive prototype will produce much more reliable results than the same test run on static mockups. And a survey sent to the wrong audience has little value regardless of how well the questions are written. There are plenty of practical research guidelines available, so I won't go into detail here.

Build and launch with a testing mindset

This isn’t a new concept. Lean design and lean product both emphasized rapid product iteration and learning from real product behavior. In fact, rapid iteration is a great way to generate higher level evidence in real product context. However, it can easily become reactive, and constantly reacting to short-term signals. A better approach is to make validation plans part of the MVP itself. Instead of throwing away designs that are not selected, park them in a testing backlog as alternative hypotheses. And when collecting data and user feedback after launch, consolidate them to an evidence library, so you can cross-referencing early research and post launch data. This turns rapid iteration into a cumulative knowledge building process, rather than a series of reactive changes.

Filling the gap in design frameworks

Design education and the methods that define our practice have always been strongest at exploration. We've mastered user journey mapping, brainstorming, workshops, and a growing toolkit of facilitation techniques. They are exceptionally good at helping teams find possible solutions. But when it comes to choosing between them, the existing frameworks go quiet. I've been finding myself empowered and energized during the divergent phrase of the design, but scrambling in the convergent phrase.

Evidence-Based Design is a direct response to that gap. It gives designers something concrete to bring to decisioning table: not just recommendations, but defensible rationales grounded in what is actually known.

This matters most in the moments that define design impact: when we need to negotiate with business partners, defend a direction under pressure, or make the case that a proposed shortcut carries real risk. Final decisions will always be shaped by factors beyond evidence like budget, timing, technical feasibility, or legal constraints. Evidence-Based Design does not change that reality. But it ensure the design perspective in those conversations is powerful and grounded in evidence, and speaks the same language as product and business. That is what earns designers a genuine voice, and over time, genuine ownership.