Evidence Based Design

Design—especially growth and product design—is often a battleground. User needs, business impact, designers’ intuition, stakeholders’ personal experience, and industry trends all pull in different directions. That diversity can spark good ideas. But it also creates a real problem: when everyone has an opinion and no one has a framework, decisions go to whoever speaks loudest, whatever trend feels most current, or whichever proposal comes with the most compelling story. And as bias accumulates quietly, designers struggle to own their decisions — widening the gap between design work and measurable impact.

Modern medicine faced a remarkably similar crisis. For most of its history, medical decisions were driven by the authority of senior physicians, institutional tradition, and anecdotal experience. A treatment was trusted because an experienced doctor believed in it — not because it had been rigorously tested. In the early 1990s, that started to change. Evidence-Based Medicine emerged with a simple but radical idea: clinical decisions should be grounded in the best available evidence, shaped by clinical expertise, and balanced against the patient’s own values and context. To support this, researchers developed a hierarchy of evidence — ranking research methods not by status, but by how well each one controls for the biases that distort judgment. The result was a field that could defend its decisions and get better over time.

Evidence-Based Design borrows that same ambition. It proposes that design decisions should be evaluated according to the quality of available evidence — and that designers, with the right framework, can move from makers of things to makers of decisions.

The Evidence Hierarchy

The evidence hierarchy in design is, at its core, a hierarchy of bias reduction. The higher the level, the more variables are controlled — and the less the result depends on who designed the study, who participated, or what assumptions were built in.

Level

Evidence Type

What

Why

Level 1

Replicated Experiments

Multiple controlled experiments testing the same hypothesis across different cohorts or time periods, producing consistent results.

Replication across contexts eliminates the possibility that a single A/B test result was a coincidence, a fluke, or specific to one audience.

Level 2

Well-Designed A/B Experiment

A single randomized experiment with sufficient sample size, controlled variables, and clearly defined success metrics.

Isolates causation by controlling variables and measuring specific outcomes and metrics in the live product.

Level 3

Product Behavioral Data

Behavioral signals showing how users interact with an existing product — visits, conversions, feature adoption, engagement, interaction patterns, and user feedback or support data. *

Reveals what is happening in the real product with real customers, but results can be mixed signals — influenced by many factors at once, making it difficult to attribute outcomes to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing with prototypes (task completion, findability, comprehension) and hypothesis-driven surveys and interviews where users respond to hypothetical questions or prototype scenarios.

Strong for diagnosing friction or capturing user intent and motivation, but cannot measure direct business impact. Both methods share a say-do gap — responses are based on simulated or imagined experiences rather than real ones. Within this level, evidence quality depends heavily on the fidelity of the prototype and the design of the research.

Level 5

Competitive & Market Intelligence

Competitor patterns and industry conventions.

Useful for generating hypotheses and identifying conventions users may already expect, but provides no direct evidence of effectiveness for a specific product and users.

Level 6

Expert Opinion

Input from experienced designers, product leaders, or stakeholders based on intuition, heuristics, and prior experience.

The most accessible and lowest-cost evidence, but also the most prone to bias — authority bias, confirmation bias, and personal experience that may not transfer across contexts. Valuable as a starting point and for low-stakes decisions, but should not substitute for higher-level evidence on critical decisions.

Level

Evidence Type

What

Why

Level 1

Replicated Experiments

Multiple controlled experiments testing the same hypothesis across different cohorts or time periods, producing consistent results.

Replication across contexts eliminates the possibility that a single A/B test result was a coincidence, a fluke, or specific to one audience.

Level 2

Well-Designed A/B Experiment

A single randomized experiment with sufficient sample size, controlled variables, and clearly defined success metrics.

Isolates causation by controlling variables and measuring specific outcomes and metrics in the live product.

Level 3

Product Behavioral Data

Behavioral signals showing how users interact with an existing product — visits, conversions, feature adoption, engagement, interaction patterns, and user feedback or support data. *

Reveals what is happening in the real product with real customers, but results can be mixed signals — influenced by many factors at once, making it difficult to attribute outcomes to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing with prototypes (task completion, findability, comprehension) and hypothesis-driven surveys and interviews where users respond to hypothetical questions or prototype scenarios.

Strong for diagnosing friction or capturing user intent and motivation, but cannot measure direct business impact. Both methods share a say-do gap — responses are based on simulated or imagined experiences rather than real ones. Within this level, evidence quality depends heavily on the fidelity of the prototype and the design of the research.

Level 5

Competitive & Market Intelligence

Competitor patterns and industry conventions.

Useful for generating hypotheses and identifying conventions users may already expect, but provides no direct evidence of effectiveness for a specific product and users.

Level 6

Expert Opinion

Input from experienced designers, product leaders, or stakeholders based on intuition, heuristics, and prior experience.

The most accessible and lowest-cost evidence, but also the most prone to bias — authority bias, confirmation bias, and personal experience that may not transfer across contexts. Valuable as a starting point and for low-stakes decisions, but should not substitute for higher-level evidence on critical decisions.

Level

Evidence Type

What

Why

Level 1

Replicated Experiments

Multiple controlled experiments testing the same hypothesis across different cohorts or time periods, producing consistent results.

Replication across contexts eliminates the possibility that a single A/B test result was a coincidence, a fluke, or specific to one audience.

Level 2

Well-Designed A/B Experiment

A single randomized experiment with sufficient sample size, controlled variables, and clearly defined success metrics.

Isolates causation by controlling variables and measuring specific outcomes and metrics in the live product.

Level 3

Product Behavioral Data

Behavioral signals showing how users interact with an existing product — visits, conversions, feature adoption, engagement, interaction patterns, and user feedback or support data. *

Reveals what is happening in the real product with real customers, but results can be mixed signals — influenced by many factors at once, making it difficult to attribute outcomes to any single design decision.

Level 4

Usability Research & Hypothesis-Driven Research

Usability testing with prototypes (task completion, findability, comprehension) and hypothesis-driven surveys and interviews where users respond to hypothetical questions or prototype scenarios.

Strong for diagnosing friction or capturing user intent and motivation, but cannot measure direct business impact. Both methods share a say-do gap — responses are based on simulated or imagined experiences rather than real ones. Within this level, evidence quality depends heavily on the fidelity of the prototype and the design of the research.

Level 5

Competitive & Market Intelligence

Competitor patterns and industry conventions.

Useful for generating hypotheses and identifying conventions users may already expect, but provides no direct evidence of effectiveness for a specific product and users.

Level 6

Expert Opinion

Input from experienced designers, product leaders, or stakeholders based on intuition, heuristics, and prior experience.

The most accessible and lowest-cost evidence, but also the most prone to bias — authority bias, confirmation bias, and personal experience that may not transfer across contexts. Valuable as a starting point and for low-stakes decisions, but should not substitute for higher-level evidence on critical decisions.

*Tests like fake door tests — where a feature or CTA is shown to real users in the live product but clicking leads to a placeholder — can be considered product behavioral data even without a real feature built, as they capture genuine desirability signal in a real context.

Evidence Quality as a Design Skill

The designer’s job is not just to gather evidence — it is to actively manage its quality. That means three things:

1. Knowing what evidence to collect.

Before choosing a research method or planning a test, the question being asked needs to be precise. “Is this design better?” is not testable. “Does this flow reduce drop-off at the payment step?” is. Good evidence starts with a well-defined question — and recognizing when that question hasn’t been defined yet is itself a form of evidence management. This also means knowing when evidence is sufficient to act on. Waiting for perfect evidence can be as costly as acting on weak evidence. Part of the skill is knowing when you know enough.

2. Understanding what evidence is telling you — and what it isn’t.

Every evidence type has limitations. Behavioral data shows what happened but not why. Usability research captures friction but not business impact. A single experiment may not hold across different audiences or time periods. Reading evidence honestly means asking what else could explain this result, where the gaps are, and what other signals would either reinforce or challenge what you’re seeing. Cross-referencing multiple levels — triangulating rather than relying on any single source — is how a partial picture becomes a more complete one.

3. Improving the quality of evidence you produce.

The level of evidence matters, but so does how well it’s executed within that level. A usability test with a high-fidelity interactive prototype will produce meaningfully more reliable results than the same test run on a static image. A survey sent to the wrong audience tells you little regardless of how well the questions are written. Within any level, there is always room to reduce bias — in how the study is designed, who participates, and what is measured.

These three things together — collecting deliberately, reading honestly, and executing rigorously — are what help designers make better decisions and own them with confidence. This is a skill built by consistently asking the right questions at every stage: what are we actually trying to validate, how reliable is this signal, and what are we still assuming? The more deliberately those questions are asked, the stronger the evidence behind every decision becomes — and the more confidently a designer can own them.

The Cost of Evidence

Higher quality evidence costs more — in time, money, and organizational effort. Running a well-designed A/B experiment alone requires experimentation infrastructure, engineering capacity to build multiple design variations, measurement systems to track the right signals, and the analytical capability to interpret results reliably. And that’s just one test. Most teams — especially early-stage ones — don’t have all of this in place. That’s not a failure. It’s a reality that shapes how design decisions get made every day, and why designers with strong instincts are still so valued. Intuition is often the most cost-efficient evidence available.

That said, relying on lower-quality evidence is not without risk. The goal is to make deliberate tradeoffs — and the following principles can help:

1. Match evidence quality to decision impact

Not every decision needs the same level of evidence. The first step is aligning with product and business partners on how much impact a decision is expected to have — before deciding how to validate it. A change that touches a core conversion flow deserves more investment than a minor visual adjustment where designer experience is enough.

2. Be cautious about evidence debt

Just as technical debt accumulates when engineering shortcuts compound over time, evidence debt builds when decisions are consistently made on weak signals. Unvalidated assumptions don’t disappear — they pile up, and correcting them downstream costs more than getting better evidence upfront. It’s worth periodically asking: where are our biggest unvalidated assumptions, and what would it take to test them?

3. Build evidence infrastructure incrementally

No team needs to start at the top. A practical approach is to build capability progressively — starting from expert opinion and competitive research, then adding usability testing, then product analytics, and eventually A/B testing infrastructure. The goal is not to reach the top immediately, but to know where the team currently stands and invest intentionally in climbing over time.

Designers Can Own Decisions

Design education and the methods that define our practice have always been strongest at exploration — user journey mapping, brainstorming, workshops, and a growing toolkit of facilitation techniques, all rich and widely taught. They are exceptionally good at helping teams find possible solutions. But when it comes to choosing between them, the frameworks go quiet. Convergence is acknowledged as a step, but the criteria for making the call are left undefined — no structured way to weigh evidence, rank proposals, or defend a choice with rigor. The decision is left unanchored.

Evidence-Based Design is a direct response to that gap. It does not replace the tools designers already use—it fills the space those tools leave empty. By giving designers a framework to assess the quality of evidence behind competing proposals, it gives designers something concrete to bring to decision conversations: not just a recommendation, but a defensible rationale grounded in what is actually known.

This matters most in the moments that define design impact—negotiating with business partners, defending a direction under pressure, or making the case that a proposed shortcut carries real risk. Final decisions will always be shaped by factors beyond evidence: budget, timing, technical feasibility, legal constraints. Evidence-Based Design does not change that reality. What it does is ensure that the design perspective in those conversations is grounded in evidence rather than opinion alone. That is what earns designers a genuine voice—and over time, genuine ownership—of the decisions that shape their products.

A New Standard for Design

Design as a discipline has always evolved by raising its own bar. It moved from purely aesthetic craft to user-centered practice. It developed methods for research, synthesis, and ideation that gave designers a shared language and a more credible seat at the table. Each of these shifts expanded what designers could contribute—and who they could be in an organization.

Evidence-Based Design proposes the next shift: from a discipline that generates options to one that owns decisions. Not because intuition, craft, or creative exploration lose their value—they remain essential inputs. But because in the moments that matter most, when a product direction must be chosen, a stakeholder must be persuaded, or a risk must be named, designers deserve more than a well-crafted opinion. They deserve a framework that makes their reasoning defensible, their assumptions visible, and their decisions accountable.