DuckDB for BI: when single-node analytics beats a cloud warehouse
Max Musing
Max MusingFounder and CEO of Basedash · June 6, 2026

Max Musing
Max MusingFounder and CEO of Basedash · June 6, 2026

DuckDB is a single-process, columnar analytics database that runs inside your application or BI tool the same way SQLite runs inside a phone app. For a growing number of BI use cases, that single-node model is faster, cheaper, and simpler than spinning up a cloud data warehouse. It also breaks down in predictable ways once data, concurrency, or governance requirements pass certain thresholds.
This guide is for data engineers, analytics engineers, and engineering leaders deciding whether DuckDB belongs under a dashboard. It covers what DuckDB actually is, the three production patterns it shows up in, where it beats Snowflake or BigQuery for BI workloads, and the specific cases where you should keep using a cloud warehouse.
DuckDB is an embedded OLAP database. The most useful mental model is “SQLite for analytics.”
It runs inside the process that’s querying it, not as a separate server. There is no daemon, no listener, no port to open. A Python script, a Node service, a Rust CLI, a desktop app, and a BI tool all link DuckDB the same way: import the library, open a database file (or an in-memory database), run SQL.
The engine is columnar, vectorized, and built for analytical queries: aggregations, joins, window functions, scans across large tables. It also reads external file formats natively, with no ETL step:
select date_trunc('month', created_at) as month,
sum(amount) as revenue
from 's3://my-bucket/orders/*.parquet'
where created_at >= '2026-01-01'
group by 1
order by 1;
The query above streams Parquet files from S3, projects only the two columns it needs, pushes the date filter into the file scan, and aggregates the result without writing anything to disk. There is no “load” step, no managed warehouse, no compute cluster to start.
DuckDB has first-class support for Parquet, CSV, JSON, Arrow, Iceberg, and Delta. The httpfs extension makes any HTTP- or S3-addressable file a queryable table. The postgres and mysql extensions let DuckDB query external databases. DuckDB-Wasm runs the entire engine in the browser.
What DuckDB is not:
The case for a distributed warehouse used to be obvious: analytical datasets were larger than a single machine could hold or process. That assumption is increasingly wrong for the dataset shapes most BI teams actually have.
Three things changed:
Machines got large. A single AWS m7i.metal-24xl has 96 vCPUs and 384 GB of RAM for around $5 an hour on-demand and well under $1 an hour reserved. A Hetzner dedicated server with similar specs is under $300 a month. A 2024 MacBook Pro has more RAM than a 2014 Redshift node. The “single-node” limit moved from gigabytes to terabytes.
Storage got separated. Object storage (S3, GCS, R2) is essentially infinite, durable, and cheap. Putting Parquet files on object storage and reading them from a query engine on demand is now the canonical pattern, used by Snowflake, BigQuery, Databricks, and every lakehouse vendor. DuckDB participates in the same ecosystem as a reader.
Columnar formats got fast. Parquet with predicate pushdown, row group statistics, and dictionary encoding lets a query engine skip 90 percent or more of a file without reading it. A modern columnar reader on a single machine is genuinely competitive with a distributed warehouse for selective analytical queries.
The result is that a serious BI workload, say 200 GB of orders, sessions, and product events, fits comfortably on one machine, runs on Parquet in S3, and does not need a multi-node cluster or a five-figure monthly warehouse bill.
DuckDB is a strong choice when the BI workload looks like one of the following.
A startup with 50 GB of production data in Postgres, a few dozen dashboards, and 20 internal users does not need Snowflake. A nightly job that exports relevant tables to Parquet, plus a BI tool that points at DuckDB reading those Parquet files, is faster end to end. Cold queries return in under a second. There is no compute to manage, no warehouse to suspend, no per-query cost.
The savings are real. The same workload on a small Snowflake warehouse runs $1,500-$5,000 a month after queries, storage, and serverless features. The DuckDB version runs on a single $200 server, or for free as part of the BI tool itself.
Running dbt against a cloud warehouse for development is slow and expensive. Every test run hits real compute, every iteration pays for credits. The dbt-duckdb adapter lets engineers run the entire project locally against Parquet copies of production data, with identical SQL and the same models. Cycle time drops from minutes to seconds.
This is now standard practice in analytics engineering: develop against DuckDB locally, deploy to a cloud warehouse in production. For teams where the cloud warehouse is overkill in production too, DuckDB just becomes the runtime.
Some embedded analytics workloads are naturally single-tenant per query: each customer sees only their own data, and the data per customer is small (megabytes to a few GB). Spinning up a Snowflake virtual warehouse for each customer is wasteful. A DuckDB instance per query, reading that customer’s Parquet shard from S3, finishes in milliseconds and costs nothing at idle.
This pattern works best when the data is naturally partitioned by customer and the per-customer working set is small. It does not work well when customers share large dimension tables or when cross-tenant analytics is part of the product.
DuckDB is the fastest way to query a CSV, a JSON file, an S3 prefix, or a Parquet dataset interactively. Analysts use it the way they used to use Pandas, with two differences: SQL is the interface and the engine actually streams data. Datasets that previously required a notebook running on a beefy instance now run on a laptop.
DuckDB reads Iceberg and Delta natively. For teams already producing those formats from Spark, Flink, or a lakehouse vendor, DuckDB is a serviceable interactive query layer for BI without standing up a separate warehouse for the read path. It is not a replacement for the lakehouse ingestion stack, just a fast reader on top of it.
There are workloads where DuckDB is the wrong answer, and the failure modes are predictable.
A SaaS product embedding dashboards for thousands of customers needs strong tenant isolation, consistent latency under concurrent load, and a shared metric layer that holds across customers. DuckDB does not natively enforce row-level security, does not have a built-in concept of users, and does not scale concurrent queries across nodes. You can build the isolation in the layer above, but you are reinventing what a cloud warehouse already gives you. For this case, a cloud warehouse with row access policies plus a query layer that injects tenant predicates is the safer architecture. See multi-tenant analytics architecture for the full pattern.
DuckDB can read terabytes of Parquet. Joining two multi-TB tables on arbitrary keys, with no partition pruning to help, is still going to use a lot of memory and time on a single node. If your workload involves frequent joins across very large fact tables, a distributed warehouse with shuffle joins will outperform a single-node engine, and the operational cost is justified.
DuckDB is not a transactional system. It supports inserts and updates, but it is not designed to be the system of record for high-throughput writes, change data capture, or real-time event streams. Use it as a read layer over data that lands in object storage, or over a transactional database via the postgres extension.
If you need column masking, row-level security as a database feature, column-level lineage, fine-grained roles, time-travel queries, and an audit log enforced by the database, you want a cloud warehouse. DuckDB has none of those as native features. Bolting them on at the application layer is doable but not the path of least resistance.
A dashboard with 500 concurrent users hitting the same DuckDB process will queue. The engine is highly parallel inside a single query, but it is not designed for the level of concurrent multi-query throughput a cloud warehouse provides. The fix is one DuckDB process per query (cheap at low concurrency, expensive at high) or to use a different engine. For collaborative analytics and team workflows where many people query the same data simultaneously, cloud warehouses still win.
If DuckDB fits your workload, it shows up in production in one of three shapes.
A single DuckDB database file (or a directory of Parquet files) is the analytical store. Source data is loaded via ETL or CDC into Parquet on object storage. The BI tool runs DuckDB locally or on a small server and reads directly from Parquet. There is no other warehouse.
This is the simplest setup and works for a surprising number of teams. It’s especially common in early-stage SaaS, small operations teams, and analytics consultancies who need a fast read path without a Snowflake bill.
A typical layout:
s3://acme-analytics/
orders/
year=2026/month=01/orders.parquet
year=2026/month=02/orders.parquet
sessions/
year=2026/month=01/sessions.parquet
dim_customers/
customers.parquet
DuckDB queries this as:
create view orders as
select * from 's3://acme-analytics/orders/**/*.parquet';
create view sessions as
select * from 's3://acme-analytics/sessions/**/*.parquet';
Partition pruning happens automatically from the directory layout. A query filtered to a single month reads only that month’s files.
The canonical store is Snowflake, BigQuery, or Postgres. DuckDB sits in front, runs dashboards against materialized Parquet exports, and falls back to the warehouse for fresh or unmaterialized data.
This is the right pattern when you already have a cloud warehouse and want to cut its bill for BI workloads. The warehouse handles ingestion, transformations, and the long tail of queries. A nightly (or hourly) job exports the most-queried tables to Parquet. DuckDB serves dashboards from Parquet, and the warehouse only gets hit when an analyst writes an ad-hoc query against a table that wasn’t exported.
Teams who go through this exercise typically see warehouse spend drop by 60-80 percent for dashboard traffic. The warehouse cost playbook walks through the migration in detail.
For customer-facing analytics with per-customer data, a DuckDB process is spun up for each session or each query. It reads the customer’s data from S3 (or from a customer-specific database) and serves the dashboard. When the session ends, the process exits.
This pattern works well when:
MotherDuck and several embedded analytics platforms have built this as a managed service: you specify the customer’s data path, the platform spins up a DuckDB worker, and the dashboard renders against it.
At Basedash, we recently went all in on DuckDB for Basedash Warehouse. The product surface did not need to change: customers still sync data, build dashboards, ask questions in chat, schedule reports, and trigger Slack workflows from the same Basedash workspace. What changed is the analytical engine underneath that warehouse serving layer.
The reason is workload fit. Basedash Warehouse is not trying to be a general-purpose enterprise warehouse for every ad-hoc query a company might ever run. It is a managed BI serving layer for data that is already being shaped for charts, reports, and AI-generated follow-up questions. Those workloads are mostly reads, usually bounded by a workspace, and often repeat the same dashboard-ready scans. That is exactly where DuckDB is strongest.
Moving the warehouse layer to DuckDB gave us a few concrete advantages:
This is the same tradeoff this guide recommends for other teams. DuckDB is not the right answer because it is new or because it can replace every cloud warehouse. It is the right answer when the workload is bounded, read-heavy, and latency-sensitive, and when the product benefits from owning the BI serving layer. Basedash Warehouse fits that shape, so DuckDB became the right default under the hood.
The relevant dimensions for a BI backend, not generic “data warehouse” criteria:
| Attribute | DuckDB | Snowflake | BigQuery | ClickHouse Cloud |
|---|---|---|---|---|
| Deployment | Embedded library | Managed SaaS | Managed SaaS | Managed SaaS or self-host |
| Scale model | Vertical (single node) | Horizontal (virtual warehouses) | Serverless (slots) | Horizontal (replicated shards) |
| Storage | Parquet on object store or local file | Proprietary (S3-backed) | Proprietary (Colossus) | MergeTree on object store or local |
| Idle cost | Zero | Warehouse credits (auto-suspend) | Zero (storage only) | Cluster cost (auto-suspend) |
| Cold-query latency | Sub-second to a few seconds | Several seconds (warehouse resume) | Sub-second to a few seconds | Sub-second |
| Concurrency model | Per-process | Multi-cluster warehouses | Slots and reservations | Per-cluster, per-node |
| Native RLS | No | Yes | Yes | Yes |
| BI tool integrations | Growing (Evidence, Rill, Hex, dbt, Mode, Basedash) | Universal | Universal | Strong and growing |
| Best fit | Internal BI, embedded per-customer, dev/test | Multi-team enterprise BI | Ad-hoc + Google ecosystem | Real-time, high-cardinality |
The columns to read carefully are idle cost and concurrency. DuckDB wins on idle cost (zero) and loses on concurrency. Cloud warehouses win on managed concurrency and lose on idle cost. The right choice depends on which one matters more for your workload.
For workload-specific comparisons across warehouses, see the best BI dashboarding tools for ClickHouse and other warehouse-specific guides.
DuckDB connectivity varies more by BI tool than most other warehouses, because the embedded model is unusual.
For BI tools that don’t yet support DuckDB natively, the practical workaround is to put a small DuckDB server behind a Postgres wire protocol shim (DuckDB has an experimental Postgres-compatible interface) and connect the BI tool as if it were Postgres.
Five patterns show up in nearly every team’s first deployment.
Treating DuckDB as a server. Running a long-lived DuckDB process and hitting it with concurrent queries from a multi-user BI tool will queue. DuckDB is built for parallel queries inside a process, not many concurrent processes hitting the same database. Either use a process-per-query model, route queries through MotherDuck, or use a different engine for high-concurrency dashboards.
Skipping partitioning. Querying a single 500 GB Parquet file is slow. Querying a Hive-partitioned directory of 500 MB files by date is fast. The DuckDB engine pushes filters into file scans, but only if the data is laid out so file pruning helps.
Reading from S3 without httpfs caching. The first query against an S3 dataset downloads the relevant byte ranges. Subsequent queries against the same data should hit a local cache. If you don’t configure the httpfs extension’s cache (or use a CDN like CloudFront in front of the bucket), every query pays the network round-trip.
Mixing types badly. DuckDB is strict about types. Inferred Parquet schemas with mixed integer-and-string columns or unexpected nulls will fail at query time. Validate schemas at ingestion, not at query time, especially for data going into a dashboard people refresh every morning.
Ignoring memory limits. DuckDB uses as much memory as you give it. On a shared server, a single bad query can OOM the process. Set memory_limit and threads per session or globally, and run the heaviest queries with a deliberate cap.
A practical decision sequence:
If you answer “DuckDB is fine” for all five, DuckDB is probably the right backend and you can stop paying for warehouse compute for BI. If even one says “cloud warehouse,” the hybrid pattern (DuckDB for hot Parquet, warehouse for the rest) is usually right.
Yes. DuckDB hit 1.0 in 2024 and is used in production by Evidence, Rill, MotherDuck, Mode, Hex, dbt, and many smaller analytics products, plus internal data platforms at large companies. The engine has formal release notes, semantic versioning, and a maintained extension ecosystem. The constraint is that “production” for DuckDB means embedded analytics workloads, not “global, distributed, multi-tenant database.” Use it for what it’s good at.
For some workloads, yes; for others, no. A team with a few hundred GB of BI data, an internal analytics use case, and a small data team can replace Snowflake with DuckDB and Parquet on object storage. A team with multi-TB datasets, many concurrent users across departments, and external customer-facing analytics will still need Snowflake (or BigQuery, Redshift, or ClickHouse). The honest answer is most teams who already have Snowflake are not better off ripping it out, but most teams who don’t have it yet can probably skip it.
Different shapes of the same idea. ClickHouse is a distributed, multi-node OLAP database optimized for high-cardinality, high-throughput, real-time workloads like product analytics, observability, and security logging. DuckDB is a single-node embedded engine optimized for ad-hoc analytical SQL and BI workloads. ClickHouse handles billions of rows per second of ingestion; DuckDB does not. DuckDB sits inside a Python script or a BI tool; ClickHouse runs as a server. For BI dashboards over small to mid-sized data, DuckDB is simpler. For real-time analytics over high-cardinality event streams, ClickHouse is the right tool.
MotherDuck is a hosted DuckDB service. It runs DuckDB on shared infrastructure, lets you query the same data from a laptop or a server with the same SQL, and adds features (storage, sharing, scheduling) that an embedded library does not provide on its own. It is the path of least resistance if you like DuckDB’s economics but do not want to operate it yourself. For BI tools that connect over standard SQL, MotherDuck looks like a cloud warehouse with DuckDB-style performance and pricing.
Not as a database feature. You can implement RLS by giving each tenant a separate DuckDB process or database file, by parameterizing views per tenant, or by enforcing the predicate in the BI tool above DuckDB. For embedded analytics with strict isolation requirements, this works but adds operational complexity. If RLS is a hard requirement and you have many tenants, a cloud warehouse with native row access policies is usually a better fit.
Yes. AI agents that generate SQL don’t care which engine runs the query, as long as the dialect is supported. DuckDB’s SQL is broadly compatible with Postgres, which most agents already target. The thing to watch is that the agent connects to DuckDB with the right schema in context and that any tenant or permission predicates are applied before the agent’s SQL runs. The semantic layer guide covers the wider pattern for keeping AI-generated SQL inside guardrails.
Three concrete steps:
Most teams who run this experiment for one dashboard end up running it for many. Some keep the warehouse for ad-hoc workloads and move all dashboards. Others rip the warehouse out entirely. Both are reasonable outcomes for the right kind of workload.
Written by
Founder and CEO of Basedash
Max Musing is the founder and CEO of Basedash, an AI-native business intelligence platform designed to help teams explore analytics and build dashboards without writing SQL. His work focuses on applying large language models to structured data systems, improving query reliability, and building governed analytics workflows for production environments.
Basedash lets you build charts, dashboards, and reports in seconds using all your data.