Metric Validation
Choosing the wrong metric is one of the most common reasons experiments produce misleading results. GrowthLab checks metric quality automatically wherever experiments are created or refined, so a weak metric gets caught before you launch. There is no separate "validate" button to remember.
Where validation happens
- During AI generation - Every generated experiment is anchored to a real target metric tied to the opportunity it serves, not a vanity number. The chain rejects experiments whose metric is a pre-existing behavior that could move for unrelated reasons.
- In the Hypothesis Refiner - When you refine a hypothesis, the AI adds guardrail metrics (things that must not regress, e.g. "activation rate must not drop") and a method-fit note that checks whether your test method can actually detect the effect at your traffic.
- In the decision contract - The experiment's Details tab shows the contract the metric must satisfy: the metric, the ship-if threshold, the required sample size (n=), and the duration. That is the sample-size math that stops you calling a result early.
What a good metric avoids
The same pitfalls the AI screens for:
| Pitfall | What it is |
|---|---|
| Vanity | Looks impressive but doesn't correlate with business value |
| Lagging | Takes too long to move within your experiment window |
| Confounding | Could be distorted by variables outside the test |
| Validity | Doesn't actually measure what you think it measures |
Guardrail metrics
A guardrail is a metric you are NOT trying to improve but must not break. The Refiner suggests guardrails automatically (e.g. "refund rate must stay flat") so a win on your primary metric doesn't quietly hurt something else.
Sample size and duration
Every experiment carries a required sample size and duration in its decision contract. Don't decide before you hit the sample size, even if early numbers look good. The contract exists so the result maps to a real decision: ship, kill, or iterate.
💡 Tip: If the Refiner's method-fit note says your A/B test is underpowered for your traffic, switch to a faster method (painted door, interview, smoke test). A fast learning beats a slow, underpowered one.