Skip to content

BI Bench: Comparing AI data analyst products

Benchmarks for AI agents tend to measure code generation or general reasoning. Almost none of them answer the question we actually care about: can an agent sit down in front of a real, messy production database and return a correct answer to a hard business question? BI Bench is our attempt to measure exactly that.

It's an internal benchmark we run against our own agent and against the other AI data analysts that teams are evaluating today. The setup is deliberately unforgiving. We connect each agent to the same real database with a large, complicated schema, where the right answer depends on knowing which of several similar tables to use and how they join together. Then we send every agent an identical set of difficult questions, capture each response, and grade it against a fixed set of criteria. Alongside accuracy, we record how long each agent takes, because an answer that arrives three minutes later is a very different product experience than one that arrives in thirty seconds.

Accuracy and speed, measured

The chart below plots every agent on two axes: accuracy runs up the vertical, and average response time runs along the horizontal. We've reversed the time axis so that faster agents sit toward the right, which means the best place to be is the top-right corner, where an agent is both accurate and fast.

Performance on BI BenchAccuracy vs response time
20%30%40%50%60%70%80%90%0s50s100s150s200sAccuracy scoreFaster and more accurate← SlowerAverage response timeFaster →Basedash80.3% · 18.2sHex Magic78.8% · 198.2sTextQL Ana67.4% · 134.5sJulius45.5% · 68.1sMetabase Metabot25.0% · 40.9s
Performance on BI Bench: accuracy versus average response time per task. Closer to the top-right is better.

Basedash lands squarely in that top-right corner. It posts the highest accuracy in the group at 80.30%, and it gets there while being the fastest agent we tested, at an average of 18.2 seconds per task. Being both the most accurate and the quickest is the result we were hoping for, but it's worth looking at the rest of the field to understand why.

The most instructive comparison is with Hex Magic, the next most accurate agent at 78.79%. That's a genuinely strong score, but it takes an average of 198.2 seconds per task to produce it, over 10 times slower than Basedash. In practice, that gap is the difference between an answer you wait on and an answer you can have a conversation with. TextQL Ana lands further back on accuracy at 67.42%, though it's similarly slow at 134.5 seconds per task. Julius and Metabase Metabot trail on accuracy, at 45.45% and 25.00% respectively, on the same set of questions.

Here are the full results:

#ToolAccuracyAvg response time
1Basedash80.30%18.2s
2Hex Magic78.79%198.2s
3TextQL Ana67.42%134.5s
4Julius45.45%68.1s
5Metabase Metabot25.00%40.9s

How we run the benchmark

The whole point of BI Bench is that it's repeatable and fair. Every agent sees the same database and the same questions, and every response is graded the same way. The run breaks down into three steps.

Connect

Each agent connects to the same real database with a deliberately complex, production-grade schema. We don't simplify it or hand the agent a curated subset of tables. Part of what we're testing is whether an agent can navigate a schema where several tables look plausible but only one is correct.

Run

We send a difficult set of real-world BI questions to every agent and capture each response verbatim, along with the time it took to produce. These are the kinds of questions a data team actually fields: multi-step, ambiguous, and dependent on joining the right tables together.

Evaluate

Finally, every response is graded against fixed accuracy criteria covering correctness, whether the agent used the right tables and joins, and how well the answer matches what was asked. Accuracy is the share of those criteria each response meets, and response time is the average wall-clock time per task. Nothing is hand-scored differently from one agent to the next.

We'll keep re-running BI Bench as these tools evolve and as we add harder questions, so the numbers above are a snapshot of where things stand today rather than a final word.

BI Bench FAQ

What is BI Bench?

BI Bench is Basedash's internal benchmark for evaluating AI data analyst agents on complex, real-world business intelligence tasks. We connect a real database with a complicated schema, run a difficult set of questions through each agent, capture its responses, and score those responses against a fixed set of accuracy criteria. We also measure how long each agent takes to respond.

How does Basedash score on BI Bench?

Basedash ranks first on BI Bench with an accuracy score of 80.30% and an average response time of 18.2 seconds, making it both the most accurate and the fastest agent in the evaluation.

Which AI data analyst agents were evaluated?

The current run compares Basedash, Hex Magic, TextQL Ana, Julius, and Metabase Metabot. Each agent answers the same questions against the same database so the accuracy and speed numbers are directly comparable.

How are agents evaluated on BI Bench?

Every agent connects to the same real database with a complex schema and answers an identical set of hard BI questions. We grade each response against criteria covering correctness, use of the right tables and joins, and how well the answer matches the question. Accuracy is the share of criteria met, and we track the average response time per task alongside it.

See the top-scoring agent on your data

Connect your database and put Basedash to work on your hardest questions.