How we taught Claude Code to diagnose analytics like a senior analyst

  • analytics
  • skills
  • mcp
  • claude-code

Last week our analytics said sessions dropped 26%. We almost launched a paid campaign to chase it. The drop turned out to be a Hacker News post decaying out of the comparison window — meanwhile branded organic search was up 14%, the real signal of what the launch did. Three AI agents we tried gave us five-bullet listicles of "things to check." The senior analyst we eventually consulted asked one question — "is last week's launch still in the comparison period?" — and the diagnosis was done in 90 seconds.

This post is about what that 90 seconds actually contains, and how we built it into the analytics tool itself so the next traffic drop doesn't burn an afternoon.

Why every "AI analytics" tool gives generic answers

The default AI analytics workflow is what senior analysts call the dashboard hunt: load every metric you can find, interpret each one in isolation, list possible causes for each, return the union as a recommendation. It produces output like:

"Sessions are down 26% week-over-week. Possible causes: (1) seasonality, (2) algorithm update, (3) site changes, (4) campaign budget, (5) tracking issues. Recommend reviewing each."

This looks like analysis. It isn't. There's no triangulation, no sample-size check, no measurement-first hygiene, no prioritization by likelihood. It treats the question — what happened? — as retrieval rather than investigation.

The reason this happens is structural. LLMs are good at retrieving and summarizing. They aren't, by default, good at the slow, boring discipline that makes analytics correct: check the cheap things first, narrow before broadening, separate signal from noise with sample-size discipline, and present the answer as one diagnosed cause with confidence rather than a list of guesses.

A senior analyst's first move on a traffic-drop question isn't to look at sessions. It's to ask whether the drop is real at all. Did tracking change? Did a launch fall out of the comparison window? Is the sample size big enough to trust the rate? Three checks, two minutes. By the time most AI agents are at "let me run a few queries," the analyst has already ruled out 60% of the hypothesis tree.

The methodology, encoded

We took the methodology that makes a senior analyst correct and encoded it as Claude Skills — six model-invoked agent skills the LLM loads automatically when an analytics question arrives. The thesis: methodology is the moat. Anyone can build an MCP server that exposes analytics queries; the differentiator is what the agent does with those queries.

The methodology has five non-obvious pieces.

1. MECE hypothesis trees (Mutually Exclusive, Collectively Exhaustive — Barbara Minto at McKinsey, 1963, formalized in The Minto Pyramid Principle). Before running any query, split the possible causes so they don't overlap and no major branch is missed. For a traffic-change question:

Traffic change
├── Measurement (tracking regression, bot filter, attribution shift)
├── Time shape (cliff = discrete event; ramp = campaign/SEO; spike = bot/viral)
├── Channel (one moved, all moved proportionally, or mix shift)
└── Cohort (new vs returning, geography, device)

Walk in order, cheap checks first. Most "dropped traffic" questions resolve at the first branch.

2. Sample-size discipline. Below ~300 observations per bucket, most "changes" are noise. A rate computed on 50 sessions is statistically meaningless and the agent should refuse to interpret it. The Skills check this before any benchmark comparison.

3. The quality matrix for channels. Channels can't be ranked on a single dimension. Volume × engagement × conversion as a matrix. A channel with high volume and zero engagement is vanity traffic. A channel with low volume and high conversion is probably underinvested. Most "best channel" rankings get the wrong channel because they pick one column.

4. Benchmark calibration with population, year, and definition. Quoting "70% bounce rate is bad" without checking that you're talking about GA4 (which redefined bounce in 2023 — engaged sessions are >10s OR have a key event OR have 2+ pageviews; bounce rate = 1 − engagement rate) on a docs page (where 70% engagement is healthy) is the kind of error most AI tools ship every day. The Skills carry the qualifiers and refuse to compare across definitions.

5. Fingerprint library for common failure modes. Tracking regressions, bot spikes, deploy-correlated drops, SEO decay, campaign ramps. Each has a recognizable shape in the data. Trained agents check the fingerprints before generating speculation.

The Skills bundle this as analytics-diagnostic-method (the spine) plus four specialists: traffic-change-diagnosis, channel-and-funnel-quality, metric-context-and-benchmarks, experiment-result-reader. Each specialist builds on the spine. A sixth, analytics-profile-setup, runs once and captures your business context (model, primary conversion, traffic range, ICP) into a local file so subsequent answers calibrate to your industry instead of cross-industry averages.

Worked example: the phantom traffic drop

Hypothetical B2B SaaS at ~80k monthly sessions. Last week's numbers:

Metric Last 7d Prior 7d Change
Sessions 18,400 24,800 -26%
Visitors 12,200 14,800 -18%
Pageviews 41,000 58,600 -30%

A naive read: traffic fell off a cliff, ship a paid campaign, write a post-mortem.

The Skills-led read, in order:

Measurement check first. Has tracking changed? Pull the deploy log: no analytics-touching deploys in the window. Pageview-to-session ratio is stable (2.2 vs 2.4). The tracking is intact; the change is real behaviour.

Time shape. Pageviews dropped harder than visitors (-30% vs -18%). Same audience, fewer pages per session — engagement-side change is part of the story too. Note that and keep going.

Channel split:

Channel Last 7d Prior 7d Δ
Organic search 8,000 7,400 +600
Referral 6,800 13,000 -6,200
Paid 2,800 3,000 -200
Direct 800 1,400 -600

Referral lost 6,200 sessions out of 6,400 lost in total. The drop is concentrated in one channel — the rest of the site is essentially flat.

Drill into referrer host:

Referrer Last 7d Prior 7d Δ
news.ycombinator.com 200 11,200 -11,000
All other referral 6,600 1,800 +4,800

Hacker News fell 11,000 sessions; the rest of referral grew 4,800. The "drop" is one HN post decaying out of its first week — exactly the shape a launch-list referral produces.

Branded search. Pull a breakdown of organic search by query intent. Branded queries (containing the product name) are +14%. Non-branded is flat. The launch produced a step-change in brand awareness; people are coming back via Google after their first HN visit.

The diagnosis, in one sentence: Sessions excluding HN are up, the headline drop is a single-source decay, and the launch is downstream-converting into branded organic. Don't react.

The decision a dashboard couldn't reach: keep investing in compounding channels. Don't launch a paid campaign to "make up for the drop" — there is no drop in real demand. Re-measure in two weeks to see whether the branded-search lift sustains.

The numbers were all there. They just weren't arranged in the order that produces the answer.

Worked example: the funnel that's actually fine

Same hypothetical SaaS, different question: "Our pricing-to-paid funnel converts at 3.8%. Should we redesign /pricing?"

Funnel headline:

Step Count Conversion
/pricing pageview 8,600 100%
signup_started 1,890 22%
signup_completed 1,548 18%
paid_subscription 326 3.8%

3.8% sits below the rough B2B SaaS pricing-page benchmark of 5-8%. Naive prescription: A/B test the CTA, redesign the table, hire a CRO consultant.

The Skills-led read asks: for which segment? Quality matrix split by source:

Source /pricing visits Paid customers CVR LTV/CAC*
Direct 1,200 144 12.0% healthy
Organic search 1,000 80 8.0% healthy
Email 400 60 15.0% healthy
Hacker News referral 4,000 40 1.0% break-even
Paid social 2,000 2 0.1% catastrophic

* LTV/CAC math at $50 ACV / $600 LTV. Paid social at $2 CPC × 2,000 visits = $4,000 spent for 2 customers = $2,000 CAC = LTV/CAC of 0.3 — an order of magnitude below the 3:1 threshold.

Two segments do the work: direct + organic + email = 2,600 visits and 284 customers, blended 10.9%. Two segments drag the average: HN + paid social = 6,000 visits and 42 customers, blended 0.7%.

The diagnosis: the pricing page works. It converts the audience that matches its assumptions — high-intent visitors who already know what the product is — at 8-15%, well above benchmark. The 3.8% headline is a mix-shift artifact, dragged by 70% of the traffic the pricing page was never designed to convert.

Three Monday-actionable decisions:

  1. Cut paid social to /pricing. Two customers from $4,000 spent isn't an A/B-test problem; it's a channel-fit problem. The 0.1% CVR isn't going to triple with a button change.
  2. Build /from/hacker-news as a cold-cohort landing page. HN converts at 1% on /pricing because the page assumes context HN visitors don't have. If the cold-cohort page lifted HN to even 3%, that's 80 additional customers at zero acquisition cost.
  3. Stop optimizing /pricing for the headline number. Direct at 12%, organic at 8%, email at 15%: the page is healthy for the audience it's designed for. A redesign optimizing for the average risks breaking what works.

The headline number led one direction. The matrix led somewhere completely different. That's the difference between a dashboard and a diagnosis.

Try it

Six skills, one install. Works with Claude Code, Claude Desktop, and any MCP-compatible client. MIT-licensed and analytics-source-agnostic — the methodology applies to GA4, Mixpanel, Amplitude, anything — with first-class Clamp MCP integration so the agent knows exactly which tool to call at each step of the method.

bash
/plugin marketplace add clamp-sh/analytics-skills
/plugin install analytics-skills@clamp-sh

Then run the one-time profile setup so every subsequent answer is calibrated to your business:

bash
/analytics-skills:analytics-profile-setup

If you don't yet have analytics an agent can query, Clamp gives you the data layer in two minutes. Free tier covers 100K events/month — enough for most early-stage SaaS to instrument an entire app through it.

Skill source and changelog: github.com/clamp-sh/analytics-skills. Reference for the full skill catalog and what each one does: /docs/skills.

FAQ

Does this work without Clamp?

Yes. The Skills are analytics-agnostic. Use them against GA4, Mixpanel, Amplitude, PostHog — the methodology applies anywhere. With Clamp connected via MCP, the agent has typed tool calls and skips the tool-name guessing step.

Is this just prompt engineering?

The Skills are agent skills following the open Agent Skills spec. They're loaded by the model based on the question and constrain the agent's procedure — not just style or tone. The diagnostic-method skill, for example, forces a measurement-first check before any speculation about behaviour change.

Can my agent answer "is 2% conversion good" correctly?

With Skills: yes — it asks for the model (B2B SaaS vs ecom vs lead gen), checks sample size, quotes the benchmark with population/year/source, and connects to LTV/CAC if known. Without Skills: the agent will quote a cross-industry average that's almost certainly wrong for your business.

How much data do I need before this is useful?

The sample-size threshold for trusting a rate is roughly 300 observations per bucket. Below that, the Skills will say so and refuse to interpret. For absolute counts and trend lines, smaller is fine — and the methodology of "measurement first, narrow before broaden, one hypothesis not five" applies at any scale.

Get the next post.One email when something lands. No marketing.