We benchmarked 9 AI data analysts

Most AI benchmarks measure code generation or general reasoning. Almost none of them answer the question we actually care about as a data tool: can an AI agent sit down in front of a real, messy production database and return a correct answer to a hard business question? So we built one that does, ran it against nine AI data analysts, and published the results.

The benchmark is called BI Bench. This post walks through what we found, how we ran it, and why the gaps between tools are bigger than any demo would suggest.

TL;DR

We benchmarked nine AI data analysts — Basedash, Codex, Hex, Claude Code, TextQL, Querio, Julius, Sigma, and Metabase — on the same set of difficult BI questions against the same real database with a complex, production-grade schema.
We used each tool’s default experience, including the default model, reasoning effort, memory, context, skills, semantic layer behavior, and related controls where those settings exist.
Basedash ranked first with 92.1% accuracy and the fastest average response time at 28.6 seconds per task. It was both the most accurate and the fastest tool tested.
Codex was the next most accurate at 90.9%, but it took 54.3 seconds per task — making Basedash roughly 1.9× faster for a comparable level of accuracy.
Hex was next on accuracy (80.6%) but much slower (198.2s). Claude Code (78.0%), TextQL (64.7%), Querio (54.9%), Julius (46.1%), Sigma (35.1%), and Metabase (12.4%) trailed further back.
Accuracy and speed do not move together. Codex was strong on both, Hex was slow despite high accuracy, and Metabase and Sigma were fast but much less accurate.

The results

Every agent answered the same questions against the same database and was graded against the same fixed criteria. Accuracy is the share of those criteria each response met; response time is the average wall-clock time per task. Here is the full leaderboard:

Rank	Tool	Accuracy	Avg response time
1	Basedash	92.1%	28.6s
2	Codex	90.9%	54.3s
3	Hex	80.6%	198.2s
4	Claude Code	78.0%	118.5s
5	TextQL	64.7%	134.5s
6	Querio	54.9%	255.7s
7	Julius	46.1%	68.1s
8	Sigma	35.1%	42.6s
9	Metabase	12.4%	40.9s

The best place to be on this chart is high accuracy and low response time. Basedash is the only tool that lands in both: highest accuracy in the group and the fastest response time. You can explore the interactive version, where accuracy is plotted against response time, on the BI Bench page.

What the numbers actually tell us

Accuracy and speed are not the same axis

The most common assumption about AI analysts is that a slower answer is a more thorough, more accurate answer. The data does not support that. Codex is the second most accurate tool and relatively quick at 54.3 seconds, while Hex is the third most accurate and one of the slowest at 198.2 seconds. Querio is the slowest tool in the run at 255.7 seconds, yet it lands in the middle of the pack on accuracy at 54.9%. Metabase is fast at 40.9 seconds but least accurate at 12.4%, and Sigma is nearly as fast at 42.6 seconds but scores 35.1%. Spending three minutes on an answer is no guarantee that it is right, and a fast answer is not automatically a correct one.

That matters because response time is not a vanity metric — it is the difference between two products. An answer that arrives in thirty seconds invites a follow-up question and keeps a person in flow. An answer that arrives three minutes later turns an interactive analysis into a batch job, and people stop asking.

The accuracy gap is wide

The spread between first and last place is enormous: from 92.1% down to 12.4% on the exact same questions and the same schema. This is the single most important takeaway. “AI data analyst” is a category where two tools can look nearly identical in a scripted demo and then differ by nearly 80 percentage points of accuracy on real questions. The only way to know where a tool lands is to test it on hard questions against a real schema, which is exactly what BI Bench is designed to do.

The Basedash vs. Codex comparison

The most instructive head-to-head is Basedash against Codex, OpenAI’s coding agent and the next most accurate agent. Codex’s 90.9% is a genuinely strong score — it is a capable coding agent that can reason its way through SQL. But it takes an average of 54.3 seconds per task to get there, which makes Basedash roughly 1.9× faster while also being more accurate. Claude Code lands lower in this run at 78.0% accuracy and 118.5 seconds per task. The lesson is that raw model capability is necessary but not sufficient. Purpose-built tooling around the model — schema understanding, the right retrieval, and a tight execution loop — is what turns a capable model into a fast, accurate data analyst.

How we ran the benchmark

The whole point of BI Bench is that it is repeatable and fair. Every agent sees the same database and the same questions, and every response is graded the same way. The run breaks down into three steps.

We ran every tool with its default settings to best represent the default user experience a new team would get: the default model, reasoning effort, memory, context, skills, semantic layer behavior, and any other setup that ships out of the box. Different tools expose different levels of control over these settings. Some let users choose a model or reasoning effort, some let teams configure context, skills, or semantic layers in detail, and some make most of those choices automatically. A tool may perform better with more manual configuration, but the default path is the fairest way to compare what most teams experience first.

This is not a complete map of every AI data analyst product in the market. Some tools disallow benchmarking in their terms, and others do not have self-serve onboarding that lets us run the same benchmark independently.

1. Connect

Each agent connects to the same real database with a deliberately complex, production-grade schema. We do not simplify it or hand the agent a curated subset of tables. Part of what we are testing is whether an agent can navigate a schema where several tables look plausible but only one is correct — the situation a human analyst actually faces.

2. Run

We send a difficult set of real-world BI questions to every agent and capture each response verbatim, along with the time it took to produce. These are the kinds of questions a data team actually fields: multi-step, ambiguous, and dependent on joining the right tables together.

3. Evaluate

Finally, every response is graded against fixed accuracy criteria covering correctness, whether the agent used the right tables and joins, and how well the answer matches what was asked. Accuracy is the share of those criteria each response meets, and response time is the average wall-clock time per task. Nothing is hand-scored differently from one agent to the next.

These numbers are a snapshot of where things stand today, not a final verdict. We will keep re-running BI Bench as these tools evolve and as we add harder questions.

Why this benchmark is different

Most public AI benchmarks score models on isolated, well-specified tasks: a coding puzzle, a math problem, a reading comprehension passage. Those are useful, but they do not reflect what happens when you point an agent at a company’s actual data warehouse. Real BI work is messy in ways that are hard to capture in a clean test set:

The schema is large and ambiguous. Several tables look like they could answer a question; only one is correct. Choosing wrong produces a confident, plausible, wrong number.
The questions are underspecified. “How is the new pricing doing?” requires the agent to make and surface assumptions, not just translate English to SQL.
The joins are non-trivial. The right answer often depends on combining several tables correctly, where a small mistake quietly changes the result.

BI Bench is built to reward the behavior that matters in production: picking the right tables, joining them correctly, and returning a verifiable answer quickly. That is a different test than “can this model write SQL,” and it produces a different ranking.

What this means if you are evaluating AI data analyst tools

If you are choosing an AI data analyst, the practical takeaway is that you cannot trust the demo and you cannot trust the model name. Two things consistently separate the tools that survive in production from the ones that stall in a pilot:

Accuracy on your own schema. Run your own version of this test. Pick five hard questions you have asked your data team this quarter, where you know the right answer, and ask each shortlisted tool the same five against a real slice of your warehouse — not a clean sandbox. We wrote a full framework for evaluating AI data analyst tools if you want a structured way to score them.
Speed at that accuracy. A tool that is accurate but takes three minutes per answer changes how people use it. Measure wall-clock time, not just whether the final answer was right.

BI Bench is one external reference point for both. It is public, it uses a real schema, and we publish the methodology so you can see exactly what is being measured.

FAQ

What is BI Bench?

BI Bench is Basedash’s benchmark for evaluating AI data analyst agents on complex, real-world business intelligence tasks. We connect a real database with a complicated schema, run a difficult set of questions through each agent, capture its responses, and score those responses against a fixed set of accuracy criteria. We also measure how long each agent takes to respond. You can see the live results at basedash.com/bi-bench.

Which AI data analysts were benchmarked?

The current run compares nine tools: Basedash, Codex, Hex, Claude Code, TextQL, Querio, Julius, Sigma, and Metabase. Each agent answers the same questions against the same database so the accuracy and speed numbers are directly comparable. We could not benchmark every tool in the category because some tools disallow benchmarking and others do not offer self-serve onboarding.

Which AI data analyst was the most accurate?

Basedash was the most accurate AI data analyst in the benchmark, with a 92.1% accuracy score. Codex was second at 90.9%, and Hex was third at 80.6%.

Which AI data analyst was the fastest?

Basedash was the fastest, averaging 28.6 seconds per task. That made it roughly 1.9× faster than Codex (54.3 seconds), the next most accurate tool. Metabase was the next fastest at 40.9 seconds, but it was also the least accurate at 12.4%. Sigma followed closely on speed at 42.6 seconds, with 35.1% accuracy.

How are agents evaluated on BI Bench?

Every agent connects to the same real database with a complex schema and answers an identical set of hard BI questions. We use each tool’s default user experience, including the default model, reasoning effort, memory, context, skills, and semantic layer behavior where those controls exist. Each response is graded against criteria covering correctness, use of the right tables and joins, and how well the answer matches the question. Accuracy is the share of criteria met, and we track the average response time per task alongside it.

How can I run a benchmark like this for my own data?

Pick five real questions you have asked your data team in the last quarter where you know the correct answer, connect each shortlisted tool to a read-only role on a representative slice of your warehouse, and ask each tool the same questions. Score the answers on correctness and measure how long each one took. Our guide to evaluating AI data analyst tools walks through a structured version of this process.

Basedash is the AI data analyst that ranked first on BI Bench — most accurate and fastest. If you want to see how it performs on your own data, you can explore the AI data analyst or book a demo.

We benchmarked 9 AI data analysts on real BI questions — here's what we found

TL;DR

The results

What the numbers actually tell us

Accuracy and speed are not the same axis

The accuracy gap is wide

The Basedash vs. Codex comparison

How we ran the benchmark

1. Connect

2. Run

3. Evaluate

Why this benchmark is different

What this means if you are evaluating AI data analyst tools

FAQ

What is BI Bench?

Which AI data analysts were benchmarked?

Which AI data analyst was the most accurate?

Which AI data analyst was the fastest?

How are agents evaluated on BI Bench?

How can I run a benchmark like this for my own data?

Max Musing

We benchmarked 9 AI data analysts on real BI questions — here's what we found

TL;DR

The results

What the numbers actually tell us

Accuracy and speed are not the same axis

The accuracy gap is wide

The Basedash vs. Codex comparison

How we ran the benchmark

1. Connect

2. Run

3. Evaluate

Why this benchmark is different

What this means if you are evaluating AI data analyst tools

FAQ

What is BI Bench?

Which AI data analysts were benchmarked?

Which AI data analyst was the most accurate?

Which AI data analyst was the fastest?

How are agents evaluated on BI Bench?

How can I run a benchmark like this for my own data?

Max Musing

Looking for an AI-native BI tool?