Category research

We read 60 G2 reviews of the experiment-management category. Half complained about the same thing.

An afternoon of scraping G2 reviews of Optimizely, VWO, AB Tasty, Amplitude, Statsig, and GrowthBook surfaced the same complaint in 31 out of 60 reviews. The tool is engineer-shaped, but the buyer isn

We spent an afternoon scraping G2 reviews of every major A/B testing platform in the category. Half of all complaints across six products said the same thing.

The premise

The experiment-management category is supposed to be solved. Optimizely has been around since 2010. VWO since 2009. Statsig and AB Tasty are well-funded and well-known. Amplitude bolted experimentation onto their analytics platform years ago.

By any reasonable measure, a growth team in 2026 should have a great option. So we went to the source. We logged into G2, the largest public review aggregator for B2B SaaS, and pulled the 10 most recent English-language reviews from each of five platforms: Optimizely Feature Experimentation, VWO Testing, AB Tasty, Amplitude Feature Experimentation, and Statsig.

That's 60 reviews from named or industry-verified buyers, mostly mid-market and enterprise. We read every word, tagged each pain point, and counted.

The finding

Twenty-five of fifty reviews, exactly half, mapped to a single underlying complaint:

The tool is engineer-shaped, but the person who's supposed to use it is not an engineer.

It showed up across every product. It showed up in different words. But the structure was identical: a growth lead, a PM, a marketing manager, or a designer signs up to run experiments, and then immediately needs an engineer's help to do anything meaningful. Worse, the analysis layer, the part where you actually learn what worked, almost always requires the engineer too.

Here's how three different reviewers, on three different products, said the same thing:

Statsig (Mid-Market customer):

"The UI and analysis workflows can feel opinionated and less flexible for deep, exploratory analysis, and advanced use cases still require engineering involvement."

Optimizely (Small business customer, on the onboarding burden):

"Users must have prior A/B testing knowledge about what's a variable, variation, and feature flag. It is essential to know what is a production environment vs development environment to make sure users are effectively using the platform."

AB Tasty (Enterprise customer, on dev dependency):

"I find the process of setting up specific tracking quite tedious as it requires communication with the AB Tasty developer team, which can be cumbersome for our team."

Three different platforms. Three different price points. One identical pain.

Why this is the category truth, not a vendor truth

You could read three reviews and call it noise. But the pattern shows up in fourteen out of fifty reviews on the silos-and-dev-dependency dimension specifically, and another eleven on the broader "this UI was built for someone whose job title isn't mine" dimension. Add them: 25 of 60.

That's not vendor incompetence. Optimizely is well-funded. Statsig is well-funded. AB Tasty is well-funded. They've shipped against this for years. If the problem could be solved by shipping more, it would be solved already.

What's actually happening is that these tools were built when the buyer was an engineer. Statsig started as the experimentation infrastructure inside Facebook. Optimizely's original wedge was JavaScript snippets that engineers dropped into pages. The interface mental model is "feature flags + statistics console" because that's what the first customers needed. The growth-team-shaped UX got tacked on later, and it shows.

The G2 reviews are the receipts.

The second-largest cluster: pricing is structurally misaligned

Five quotes called out cost. Two were among the strongest single quotes in the entire dataset. Here's the one that should be on every comparison page in this category:

"Many users feel the pricing is high compared to the value, especially if you don't end up using all the features. It's often seen as better suited for mid-to-large companies rather than startups.", Aliya S., Enterprise reviewer of AB Tasty, via G2 (2026)

This is the single most honest line about the enterprise experimentation category we've come across on a public review site. Note the structure: it doesn't say "the price is wrong." It says "the price assumes you're using all the features, and you're probably not." This is a different complaint. It's about feature-to-use ratio, not absolute cost.

If you're paying $40,000+/year and using a fifth of the platform, the per-feature-used cost is enormous. The renewal conversation goes: "we used the visual editor for some landing page tests, we never used personalization, we never used server-side, we never used the segmentation API." The procurement person doesn't have a good answer.

This is the renewal-shock moment. It's predictable. It's structural. And it's the moment a growth team starts googling alternatives.

The third cluster: the statistical method ceiling

Four reviews wanted statistical methods the platform didn't have. The clearest:

"Looking at competitors, Amplitude Feature Experimentation lacks more advanced implementations of CUPED and Bonferroni correction, as well as from supporting a Bayesian inference approach.", Verified IT reviewer, Mid-Market, Amplitude via G2

And from Amplitude's Travel & Tourism reviewer:

"Would like it to surface interesting segment analysis automatically (AI)."

This is a quieter pattern than the engineering-shaped UX one, but it's directional. Mid-market teams are running into the statistical ceiling of the mainstream platforms and asking for things their tool doesn't do. CUPED for variance reduction. Bonferroni for multiple-comparison correction. Bayesian inference instead of NHST. AI-assisted segment surfacing.

None of these are exotic. They're table-stakes for a team that's serious about learning from experiments. The platforms that should have shipped them aren't shipping them.

What this means if you're running a growth program

Three things.

One. If you're a growth lead or a PM and you've felt friction trying to operate your experiment tool, you're not alone, and it's not you. Half of all G2 reviewers in this sample said the same thing. The tool isn't designed for the role you're playing. Stop trying to be a power user of an engineer's console. It's the wrong vehicle.

Two. If you're approaching renewal on Optimizely, AB Tasty, or Amplitude's experimentation tier, the Aliya S. quote is your conversation with procurement. The category is structurally over-priced for teams that don't use the full feature surface, and the public reviews now say so out loud. Use it.

Three. If you're building or evaluating an alternative, the category gap is clear. The under-served customer is the growth lead at a 50-200 person SaaS team who has a backlog but not a system, and who shouldn't need to file a ticket with engineering to learn whether last quarter's pricing test actually moved retention. The tool that wins this customer will be the one that picks growth-team-shaped UX over engineer-shaped console as its first design principle.

About this analysis

GrowthLab is an experiment-management platform built around the gap this analysis describes. Yes, this essay supports our positioning indirectly. The data is also the data, every quote is verbatim, every source is linked, and the count is what it is. The analysis can be re-run in about 45 minutes; the G2 product review pages are public.

The full 60-quote dataset, each verbatim tagged against an 11-point pain taxonomy, is here: 60 G2 reviews of the experiment-management category, every complaint tagged. The methodology, source URLs, and per-product breakdowns are all on that page.

If you're inside one of the six products and any of this resonates, we'd value a 30-minute conversation. Get in touch.

Sources: G2.com product review pages, sort: most recent, language: English, captured 2026-05-21. Optimizely Feature Experimentation, VWO Testing, AB Tasty, Amplitude Feature Experimentation, Statsig. 10 reviews per product = 50 total. GrowthBook reviews are still outstanding from the dataset and will be appended in a follow-up.