Why your funnel conversion rate is lying to you
- analytics
- funnels
- simpsons-paradox
- skills
A founder I met with last month walked me through their pricing-page launch — I've changed numbers and identifying details for privacy. The redesign had gone live on Monday; by Friday afternoon, conversion rate was up from 8% to 15%. Every step of the funnel was converting better. Paid signups had nearly doubled. They shipped the new design to 100% of traffic and started planning the next quarter.
Two weeks later, revenue had fallen below where it was before the redesign.
The Friday report
Here's what they were looking at:
| Step | Before | After | Δ |
|---|---|---|---|
| /pricing pageview | 5,000 | 5,000 | 0 |
| signup_started | 1,150 | 1,650 | +500 |
| signup_completed | 950 | 1,400 | +450 |
| paid_subscription | 400 | 750 | +350 |
| Conversion rate | 8.0% | 15.0% | +7.0 pts |
Same visitor count week over week. Every step converted better than the prior week. Paid signups had nearly doubled. By any standard read, this was a launch worth shipping.
Where the visitors came from
The same five days, broken out by source:
| Source | Visits before | Visits after | CVR before | CVR after |
|---|---|---|---|---|
| Paid social | 4,000 | 1,000 | 5.0% | 3.0% |
| Direct | 1,000 | 4,000 | 20.0% | 18.0% |
| Total | 5,000 | 5,000 | 8.0% | 15.0% |
Direct visitors converted at 20% before the redesign and 18% after. Paid social converted at 5% and then 3%. Both worse. The 7-point jump in the headline came from how the mix shifted that week: paid spend was paused for QA, an organic launch wave landed, and direct grew from 20% of the funnel to 80% of it.
Every type of visitor was performing worse with the new design than the old one. The headline had only gone up because that week's visitors were a different mix from the previous week's, not because the funnel had actually changed for the better.
Simpson's paradox
Formalized by Edward Simpson in 1951, the paradox describes an aggregate rate moving in a direction none of its subgroups did. The aggregate is a weighted average of its parts. Shift enough traffic from low-converting sources to high-converting ones and it can rise even when every individual source falls.
That's exactly what happened to the team's funnel. Both segment rates fell two points; the share of high-converting traffic quadrupled; the headline came out seven points higher than the week before.
The trap runs in both directions.
False positive (the team's case). A regression hiding behind a mix shift in your favor — top-line up while every segment converts worse. Ship the change and find out two weeks later when the mix reverts.
False negative. A pricing page converting at a "bad" 3.8% top-line that actually has direct visitors at 12%, organic at 8%, email at 15%, and one channel (say paid social at 0.1%) dragging the average down. The team launches a redesign sprint, breaks what was working for the high-intent audience, and the headline barely moves because the dragging channel is most of the traffic. The flagship post in this series walks the false-negative direction end-to-end.
The rule that catches both
Don't read a change in top-line conversion rate as real until each individual segment has moved in the same direction.
Manually: split by source. Then device. Then new vs returning. If every split moves with the headline, the move is real. If any split contradicts the headline, you have a mix-shift hypothesis to investigate before shipping.
What we built to catch this
Clamp publishes an open-source pack of analytics skills for Claude Code. The one for this trap is channel-and-funnel-quality. It runs the per-segment check on every funnel CVR change before the agent reports anything, plus three more checks that are easy to skip when doing the work by hand:
- Sample-size discipline. Per-segment CVRs computed on fewer than ~300 visits are flagged as noise and excluded from the verdict. A 12-point swing on 40 visits is not a swing.
- Splits beyond source. Mix shifts happen along device (mobile vs desktop) and new vs returning. After a launch wave, the new-vs-returning split is often as informative as source.
- "Direct" credit attribution. When a paid campaign or content push is running, some of its visitors land in the analytics as "direct": they typed the URL, came back later, or copy-pasted a link. If direct surges the same week as a campaign launches, the change is partly borrowed credit. The
traffic-change-diagnosisskill has the full pattern.
Run the founder's example through the skill and the per-segment check fails on the first comparison. The agent returns "the +7 points came from the visitor mix changing, not from the redesign," with the per-source table as evidence. The team rolls back on Friday afternoon instead of finding out two weeks later.
Try it
The skill pack installs with one command. It works with Claude Code, Claude Desktop, and any MCP-compatible client, against any analytics source (GA4, Mixpanel, Amplitude). With Clamp MCP, the agent gets typed tool calls for funnels, segment splits, and channel breakdowns.
/plugin marketplace add clamp-sh/analytics-skills
/plugin install analytics-skills@clamp-sh
Then the one-time profile setup so subsequent answers are calibrated to your model and traffic scale:
/analytics-skills:analytics-profile-setup
If you don't yet have analytics an agent can query, Clamp Analytics gives you the data layer. The free tier covers 100K events/month, which is enough to instrument an entire early-stage app.
Skill source: github.com/clamp-sh/analytics-skills. Methodology background: the flagship post.
FAQ
Where else does Simpson's paradox show up in analytics?
Anywhere a rate is computed as a weighted average across subgroups. A/B test winners that are mix-shift artifacts. Retention curves that look stable in aggregate while a falling cohort is replaced by a healthier new one. Churn rates that "improve" when the high-churn cohort just left. Ad campaign ROI that looks healthy blended but is dragged by one expensive segment. The fix is always the same: split before reading, check sample size per subgroup, check direction per subgroup.
Does this matter for A/B tests?
It matters more for A/B tests, not less. A/B tests randomize at the visitor level, but if the test launched the same week as something that shifted traffic mix, the interaction between the change and the shifted segment can produce a misleading result. Run every A/B test result through a per-segment check before calling a winner. The experiment-result-reader skill enforces this automatically.
What if I only have the top-line number?
Then you don't have enough to spot this. The minimum data you need is the conversion rate split by at least source and device. Without that split, you can't tell whether a top-line move is real or mix.