Back to Blog
AIFebruary 25, 202610 min read

Our AI Agents Ran $10k+ of Data Analysis for $253

We built three agentic tools that run 1,059 queries across a research dataset, score 475 findings, & produce a publishable deck. Cost: $253. Time: under three hours.

AIAgentsEngineeringData
M

Matthew Warneford

CEO

HMM Algeciras container ship loaded with thousands of colourful shipping containers entering port

Three AI tools ran 1,059 queries across 45 behavioural areas in our youth trends dataset, surfaced 475 unique findings, & produced a designed slide deck. Total cost : $253. Total time : 2 hours 40 minutes.1 That's a 39x reduction in cost.

This isn't the first time this has happened. In 1956, loading a ship by hand cost $5.86 per ton. The shipping container dropped it to $0.16 - a 36x reduction. That single ratio made it cheaper to manufacture goods 8,000 miles away, restructured the global economy, and lifted 800m people out of poverty.

The compression ratio for AI-driven intellectual work is almost identical. I don't know what that means for the global economy, but I can tell you how we did it.

70 of those findings were headline-worthy. Not because previous analysis was wrong : because a human team would never run 1,059 exploratory cross-tabs. The economics don't allow it. AI changes that.

When the agents sliced data across four or more dimensions simultaneously (country × age group × parental trust × sharing behaviour), nearly 70% of results were genuinely surprising.

These are the first in a series of agentic research tools we're building, initially to interrogate our own data, & increasingly to help clients do the same with theirs. This post is about how they work.

The output: Five Platforms, Five Different Jobs

Slide 1 of 19

Why most data goes unexplored

An analyst exploring a dataset picks the most promising angles first, tests a handful of hypotheses, & stops when time runs out.

1,059 queries at 10 minutes each is 176 hours. A month of analyst time. Over $10k.

So what normally happens : the team runs ~50 queries, writes up the strongest findings, & moves on. The other 1,009 questions never get asked. AI removes that constraint. It doesn't replace the analyst : it removes the ceiling on how many questions they can afford to ask.

The tools

We built three agentic tools, each with a specific job & a strict scope.2

AI Analyst

The AI Analyst takes a free-form research question & finds relevant data. It searches the survey database by meaning, not by schema : vector embeddings map "video consumption" to questions about TikTok, YouTube, & Netflix without knowing the question IDs.3

The search is two-stage. Vector similarity casts a wide net of possible matches. An LLM then ranks them by actual relevance to the research question & cuts to the best.

Once it has the right questions, it writes SQL. It knows the schema : question types (multi-select, single-choice, grid, frequency), how items & variants & answer codes work. It executes, interprets, & records findings.

Every finding passes through a multi-gate validation before it's recorded. An LLM parses the SQL to extract which questions, items, & answer codes it references. A programmatic check validates those references against the database. Then the SQL is re-run & an LLM checks whether the results actually support the claimed finding.4

One of the most common errors : querying a multi-variant question without specifying which variant, silently mixing data from different platforms. Looks plausible in the output. You'd only catch it reading the SQL.5

Deep Research

The Deep Research tool is the orchestrator. It maps the full question space first (~20% of budget), then dispatches the Analyst into each area (~80%).6

The separation is driven by context windows. If one agent is both mapping the territory & running SQL, it burns context on query results & validation errors. Splitting them means the orchestrator stays focused on "where to look next" & the Analyst gets a fresh context window for each area.

The orchestrator maps the territory. The executor explores it. Separate them, or one agent burns its context doing both.

The harder problem was knowing when to move on. The Analyst exploring "TikTok usage by age" needs to recognise when to stop splitting by age & try a different behavioural cut, like device type or time of day. Two signals : the findings stop being interesting (two consecutive low-value results means switch dimension, three means stop the area entirely), or the sample sizes get too small to split further.7

Agents don't need to be told what to do. They need to be told when to stop.

Report Designer

A good report should surprise the reader & help them make smarter decisions. The Report Designer surfaces the findings most likely to do both, scoring 475 data points into four buckets : headline, interesting, context, skip. Batches of 15, scored in parallel.8

BucketCountShare
Headline : challenges assumptions7014.7%
Interesting : unexpected direction or magnitude17436.7%
Context : factually useful, not surprising14731.0%
Skip : obvious8417.7%

51.4% rated interesting or better.

Then a tension pass. The scorer identifies contradiction pairs : findings where direction & magnitude conflict, or where a dominant platform is absent from an expected behaviour. "High awareness but near-zero usage" is more interesting than either stat alone. These pairs become the strongest angles for the report.9

The biggest iteration was fighting obviousness. Early reports led with volume : the topic with the most findings dominated. We added two corrections. A volume bias rule : a topic with 30 findings is not more important than a topic with 3. And a "parent test" : if someone who works with children would say "well, obviously", it's not insight, it's confirmation.10

From the scored findings, the Report Designer builds a narrative, writes the report in MDX, generates charts from live SQL, & lays it out as a slide deck.11 Every slide is rendered as an image & visually inspected. If the layout is wrong, text overflows, or a chart is unreadable, the agent adjusts & re-renders.

After writing, a separate model fact-checks every claim. It extracts every number from the report & verifies it maps to a real finding. Three issue types : unsupported (hallucination), inaccurate (wrong number), misleading (findings combined in a way that changes meaning). Two revision rounds max.12

The writing takes about 20 minutes. The design & visual validation take another 20.

Different models for different jobs

Opus reasons. Sonnet scores. GPT validates. OpenAI embeds.13

The expensive model does creative work where accumulated context matters : exploring data, writing narrative, designing layouts. Cheap models do stateless jobs where it doesn't : scoring findings into buckets, checking SQL references, verifying numbers.

The model that produced an error should not be the model that checks for it.

We've written about this architectural pattern separately.14

The numbers

Queries1,059
Time2h 40m
Cost$253
Unique findings475
Headline findings70
OutputPublishable deck

The scoring pipeline reliably separated signal from noise : headline findings averaged 24.5pp effect sizes, versus 17.1pp for interesting findings. The deeper the agents went, the better the results. Two-dimensional splits had a 51% hit rate. Three dimensions : 55%. Four or more : 69%, with a skip rate of just 5.6%.

The agents didn't just slice by existing dimensions. They created composite segments from combinations of behavioural flags : daily TikTok users who also create content, or parents who distrust a platform but whose children use it daily. A human analyst could build those cross-tabs, but wouldn't prioritise them without a hypothesis driving it. The agents had no hypothesis. They just ran everything.

Our agents can unlock insights in your data too

We built these tools for our own data : a longitudinal youth trends dataset covering media consumption, platform behaviour, & brand engagement across Gen Alpha & Gen Z. But the tools work against any structured dataset.

If you're sitting on research, survey data, or behavioural logs, there are likely findings in there that nobody's had the budget to look for. We can run these tools against your data, or if you work with young audiences, we can share what we're finding in ours.

Agent architecture

If you love agents as much as I do, you can nerd out on the full tool-call pipeline...

Full pipeline: Deep Research → AI Analyst → Report Designer

Three agentic tools, 1,059 queries, 475 findings, one publishable deck

Deep Researchorchestrator
Map question space~45 areasPick next (least-explored first)
AI Analyst×3 concurrent
Semantic searchLLM rankWrite SQLExecute & interpretRecord finding
Validation · 3 gates
1Extract SQL entities (LLM)
2Check references (programmatic)
3Re-run query & verify claim (LLM)
Fail → retry up to 3× or abandon
Momentum check
Interesting → continue exploring
2 consecutive low-value → switch dimension
3 low-value → area complete
Loop until all areas explored
475 unique findings
Report Designer
Score findings (15× Sonnet)4 bucketsTension pairsObviousness filtersWrite narrativeBuild charts from live SQL
Fact-check · separate model
Extract every claim → verify against cited findings
Issue types: unsupported · inaccurate · misleading
2 revision rounds max
Visual validation
Render slideInspect layoutAdjust if neededAccept
Publishable deck
$253 · 2h 40m

Footnotes

  1. Research phase : ~$203, ~2 hours. Report writing & design : ~$50, ~40 minutes. All costs are LLM API credits (Claude, GPT, OpenAI embeddings). No human time in the pipeline.

  2. Built on the Claude Agent SDK. Each tool is an AgentTool : a class that runs its own Claude loop with its own tool set. From the caller's perspective, invoking an agent is indistinguishable from calling a database query. Same interface, same contract. This is what makes nesting & composition straightforward.

  3. OpenAI text-embedding-3-large (3072 dimensions) with HNSW indexes on pgvector (Postgres). Embedding-based search means the agent doesn't need to know the database schema to find relevant questions.

  4. Full validation architecture described in Agents Got Good at Calling Tools. What If the Tools Are Agents?.

  5. Bug types detected by the validation gate : possible_overcounting, missing_variant_filter, raw_count_comparison, ambiguous_aggregation. These emerged from observing common agent failures over several months. On failure, the agent can fix the SQL or justify why the bug isn't a problem. A second LLM reviews the justification. Three attempts per finding. After that, abandoned.

  6. Three Analyst instances run concurrently via semaphore, picking areas from a shared queue. Areas with the fewest prior explorations are picked first for fair scheduling.

  7. Sub-group sample sizes below 200 are not split further. The diminishing returns test replaced a softer "when to stop" guideline that the agents routinely ignored.

  8. 15 concurrent Sonnet calls. ~$0.35. ~10 seconds. Bucket definitions include a calibration instruction : compare findings within the batch to ensure consistent grading.

  9. Tension pairs are identified in a separate pass after individual scoring. The system looks for contradictions, unexpected absences, & directional conflicts across the full finding set.

  10. The parent test applies to the specific claim, not the category. "Kids use TikTok a lot" is obvious. "Kids watch full movies on TikTok" is not. We also added a three-tier signal hierarchy : surface splits (expected demographic differences, background context only), structural signals (inflection points, conversion gaps, narrative backbone), counterintuitive patterns (most valuable).

  11. Charts generated by a build_chart tool that loads raw SQL query results for referenced findings & produces React chart components. Available types : line, stacked column, grouped column, stacked area, scatter, heatmap, indexed bar, flow. The deck is MDX (markdown + JSX components).

  12. The fact-checker receives only the cited findings, not all 475. It extracts every number, percentage, & comparison & maps each to a specific finding by reference ID. Issue types : unsupported (no matching finding), inaccurate (wrong number or flipped direction), misleading (findings combined in a way that changes meaning).

  13. Claude Opus for agent reasoning & report writing. Claude Sonnet for scoring & classification. GPT for SQL validation & numerical verification. OpenAI for embeddings.

  14. See Agents Got Good at Calling Tools. What If the Tools Are Agents? for the full pattern : stateful agents create, stateless tools validate, deterministic tools render.

M

Matthew Warneford

CEO

Expert insights on Gen Z, Gen Alpha, and digital-native audiences.

Want to learn more?

Get in touch with our team to discuss how Dubit can help your brand connect with digital-native audiences.