Skip to content

DuckDB is a single-process, columnar analytics database that runs inside your application or BI tool the same way SQLite runs inside a phone app. For a growing number of BI use cases, that single-node model is faster, cheaper, and simpler than spinning up a cloud data warehouse. It also breaks down in predictable ways once data, concurrency, or governance requirements pass certain thresholds.

This guide is for data engineers, analytics engineers, and engineering leaders deciding whether DuckDB belongs under a dashboard. It covers what DuckDB actually is, the three production patterns it shows up in, where it beats Snowflake or BigQuery for BI workloads, and the specific cases where you should keep using a cloud warehouse.

TL;DR

  • DuckDB is an in-process OLAP database. It runs on a laptop, a single server, or inside a browser (DuckDB-Wasm), and reads Parquet, CSV, JSON, Iceberg, and Delta directly. It scales vertically, not horizontally.
  • DuckDB reached 1.0 in 2024 and is now used in production by BI vendors (Evidence, Rill, MotherDuck) and analytics tooling (dbt-duckdb, Mode notebooks, Hex’s DuckDB engine).
  • For BI workloads under roughly a few hundred GB of working set, with a handful of concurrent users and no strong governance or multi-tenant isolation requirements, DuckDB is usually the right backend. It’s faster on cold queries, cheaper at idle, and simpler to operate.
  • For multi-tenant SaaS analytics, datasets in the multi-TB range, high-concurrency embedded use cases, or environments with strict row-level security and audit requirements, a cloud warehouse (Snowflake, BigQuery, Redshift, ClickHouse) is still the safer default.
  • The most common production pattern is hybrid: a cloud warehouse for the canonical store, scheduled jobs that materialize partitioned Parquet to object storage, and a DuckDB query layer that reads from Parquet to serve dashboards.
  • Basedash recently migrated Basedash Warehouse to DuckDB under the hood because our warehouse workload is read-heavy, dashboard-shaped, and naturally scoped by workspace. DuckDB lets us serve charts, AI chats, reports, and Slack workflows from a simpler analytical layer without adding a usage-based query meter.

What DuckDB actually is

DuckDB is an embedded OLAP database. The most useful mental model is “SQLite for analytics.”

It runs inside the process that’s querying it, not as a separate server. There is no daemon, no listener, no port to open. A Python script, a Node service, a Rust CLI, a desktop app, and a BI tool all link DuckDB the same way: import the library, open a database file (or an in-memory database), run SQL.

The engine is columnar, vectorized, and built for analytical queries: aggregations, joins, window functions, scans across large tables. It also reads external file formats natively, with no ETL step:

select date_trunc('month', created_at) as month,
       sum(amount) as revenue
from 's3://my-bucket/orders/*.parquet'
where created_at >= '2026-01-01'
group by 1
order by 1;

The query above streams Parquet files from S3, projects only the two columns it needs, pushes the date filter into the file scan, and aggregates the result without writing anything to disk. There is no “load” step, no managed warehouse, no compute cluster to start.

DuckDB has first-class support for Parquet, CSV, JSON, Arrow, Iceberg, and Delta. The httpfs extension makes any HTTP- or S3-addressable file a queryable table. The postgres and mysql extensions let DuckDB query external databases. DuckDB-Wasm runs the entire engine in the browser.

What DuckDB is not:

  • A distributed system. There is no cluster mode. Scale is vertical: more cores, more RAM, faster disk.
  • A multi-tenant SaaS warehouse. There is no built-in user management, role-based access control, or row-level security as a database feature.
  • A streaming database. DuckDB is optimized for batch and interactive analytical queries, not high-throughput inserts or change data capture.

Why single-node analytics became practical again

The case for a distributed warehouse used to be obvious: analytical datasets were larger than a single machine could hold or process. That assumption is increasingly wrong for the dataset shapes most BI teams actually have.

Three things changed:

Machines got large. A single AWS m7i.metal-24xl has 96 vCPUs and 384 GB of RAM for around $5 an hour on-demand and well under $1 an hour reserved. A Hetzner dedicated server with similar specs is under $300 a month. A 2024 MacBook Pro has more RAM than a 2014 Redshift node. The “single-node” limit moved from gigabytes to terabytes.

Storage got separated. Object storage (S3, GCS, R2) is essentially infinite, durable, and cheap. Putting Parquet files on object storage and reading them from a query engine on demand is now the canonical pattern, used by Snowflake, BigQuery, Databricks, and every lakehouse vendor. DuckDB participates in the same ecosystem as a reader.

Columnar formats got fast. Parquet with predicate pushdown, row group statistics, and dictionary encoding lets a query engine skip 90 percent or more of a file without reading it. A modern columnar reader on a single machine is genuinely competitive with a distributed warehouse for selective analytical queries.

The result is that a serious BI workload, say 200 GB of orders, sessions, and product events, fits comfortably on one machine, runs on Parquet in S3, and does not need a multi-node cluster or a five-figure monthly warehouse bill.

When DuckDB is the right backend for BI

DuckDB is a strong choice when the BI workload looks like one of the following.

Internal analytics on small to mid-sized datasets

A startup with 50 GB of production data in Postgres, a few dozen dashboards, and 20 internal users does not need Snowflake. A nightly job that exports relevant tables to Parquet, plus a BI tool that points at DuckDB reading those Parquet files, is faster end to end. Cold queries return in under a second. There is no compute to manage, no warehouse to suspend, no per-query cost.

The savings are real. The same workload on a small Snowflake warehouse runs $1,500-$5,000 a month after queries, storage, and serverless features. The DuckDB version runs on a single $200 server, or for free as part of the BI tool itself.

Local development and dbt models

Running dbt against a cloud warehouse for development is slow and expensive. Every test run hits real compute, every iteration pays for credits. The dbt-duckdb adapter lets engineers run the entire project locally against Parquet copies of production data, with identical SQL and the same models. Cycle time drops from minutes to seconds.

This is now standard practice in analytics engineering: develop against DuckDB locally, deploy to a cloud warehouse in production. For teams where the cloud warehouse is overkill in production too, DuckDB just becomes the runtime.

Customer-facing analytics with per-customer data

Some embedded analytics workloads are naturally single-tenant per query: each customer sees only their own data, and the data per customer is small (megabytes to a few GB). Spinning up a Snowflake virtual warehouse for each customer is wasteful. A DuckDB instance per query, reading that customer’s Parquet shard from S3, finishes in milliseconds and costs nothing at idle.

This pattern works best when the data is naturally partitioned by customer and the per-customer working set is small. It does not work well when customers share large dimension tables or when cross-tenant analytics is part of the product.

Notebooks, prototyping, and ad-hoc analysis

DuckDB is the fastest way to query a CSV, a JSON file, an S3 prefix, or a Parquet dataset interactively. Analysts use it the way they used to use Pandas, with two differences: SQL is the interface and the engine actually streams data. Datasets that previously required a notebook running on a beefy instance now run on a laptop.

Lakehouse query layer

DuckDB reads Iceberg and Delta natively. For teams already producing those formats from Spark, Flink, or a lakehouse vendor, DuckDB is a serviceable interactive query layer for BI without standing up a separate warehouse for the read path. It is not a replacement for the lakehouse ingestion stack, just a fast reader on top of it.

When DuckDB is the wrong choice

There are workloads where DuckDB is the wrong answer, and the failure modes are predictable.

Multi-tenant SaaS embedded analytics at scale

A SaaS product embedding dashboards for thousands of customers needs strong tenant isolation, consistent latency under concurrent load, and a shared metric layer that holds across customers. DuckDB does not natively enforce row-level security, does not have a built-in concept of users, and does not scale concurrent queries across nodes. You can build the isolation in the layer above, but you are reinventing what a cloud warehouse already gives you. For this case, a cloud warehouse with row access policies plus a query layer that injects tenant predicates is the safer architecture. See multi-tenant analytics architecture for the full pattern.

Multi-TB working sets with cross-table joins

DuckDB can read terabytes of Parquet. Joining two multi-TB tables on arbitrary keys, with no partition pruning to help, is still going to use a lot of memory and time on a single node. If your workload involves frequent joins across very large fact tables, a distributed warehouse with shuffle joins will outperform a single-node engine, and the operational cost is justified.

High write throughput or streaming ingestion

DuckDB is not a transactional system. It supports inserts and updates, but it is not designed to be the system of record for high-throughput writes, change data capture, or real-time event streams. Use it as a read layer over data that lands in object storage, or over a transactional database via the postgres extension.

Strict governance and audit requirements

If you need column masking, row-level security as a database feature, column-level lineage, fine-grained roles, time-travel queries, and an audit log enforced by the database, you want a cloud warehouse. DuckDB has none of those as native features. Bolting them on at the application layer is doable but not the path of least resistance.

High concurrency with shared compute

A dashboard with 500 concurrent users hitting the same DuckDB process will queue. The engine is highly parallel inside a single query, but it is not designed for the level of concurrent multi-query throughput a cloud warehouse provides. The fix is one DuckDB process per query (cheap at low concurrency, expensive at high) or to use a different engine. For collaborative analytics and team workflows where many people query the same data simultaneously, cloud warehouses still win.

The three production patterns for DuckDB in BI

If DuckDB fits your workload, it shows up in production in one of three shapes.

Pattern 1: DuckDB as the warehouse

A single DuckDB database file (or a directory of Parquet files) is the analytical store. Source data is loaded via ETL or CDC into Parquet on object storage. The BI tool runs DuckDB locally or on a small server and reads directly from Parquet. There is no other warehouse.

This is the simplest setup and works for a surprising number of teams. It’s especially common in early-stage SaaS, small operations teams, and analytics consultancies who need a fast read path without a Snowflake bill.

A typical layout:

s3://acme-analytics/
  orders/
    year=2026/month=01/orders.parquet
    year=2026/month=02/orders.parquet
  sessions/
    year=2026/month=01/sessions.parquet
  dim_customers/
    customers.parquet

DuckDB queries this as:

create view orders as
  select * from 's3://acme-analytics/orders/**/*.parquet';

create view sessions as
  select * from 's3://acme-analytics/sessions/**/*.parquet';

Partition pruning happens automatically from the directory layout. A query filtered to a single month reads only that month’s files.

Pattern 2: DuckDB as a query layer over a cloud warehouse

The canonical store is Snowflake, BigQuery, or Postgres. DuckDB sits in front, runs dashboards against materialized Parquet exports, and falls back to the warehouse for fresh or unmaterialized data.

This is the right pattern when you already have a cloud warehouse and want to cut its bill for BI workloads. The warehouse handles ingestion, transformations, and the long tail of queries. A nightly (or hourly) job exports the most-queried tables to Parquet. DuckDB serves dashboards from Parquet, and the warehouse only gets hit when an analyst writes an ad-hoc query against a table that wasn’t exported.

Teams who go through this exercise typically see warehouse spend drop by 60-80 percent for dashboard traffic. The warehouse cost playbook walks through the migration in detail.

Pattern 3: DuckDB per session or per query (embedded analytics)

For customer-facing analytics with per-customer data, a DuckDB process is spun up for each session or each query. It reads the customer’s data from S3 (or from a customer-specific database) and serves the dashboard. When the session ends, the process exits.

This pattern works well when:

  • The per-customer working set is small (under a few GB).
  • Customers do not share large dimension tables.
  • Cold-start latency is acceptable (DuckDB can attach to a Parquet dataset in tens of milliseconds, but downloading large data over the network adds time).

MotherDuck and several embedded analytics platforms have built this as a managed service: you specify the customer’s data path, the platform spins up a DuckDB worker, and the dashboard renders against it.

Why Basedash Warehouse moved to DuckDB

At Basedash, we recently went all in on DuckDB for Basedash Warehouse. The product surface did not need to change: customers still sync data, build dashboards, ask questions in chat, schedule reports, and trigger Slack workflows from the same Basedash workspace. What changed is the analytical engine underneath that warehouse serving layer.

The reason is workload fit. Basedash Warehouse is not trying to be a general-purpose enterprise warehouse for every ad-hoc query a company might ever run. It is a managed BI serving layer for data that is already being shaped for charts, reports, and AI-generated follow-up questions. Those workloads are mostly reads, usually bounded by a workspace, and often repeat the same dashboard-ready scans. That is exactly where DuckDB is strongest.

Moving the warehouse layer to DuckDB gave us a few concrete advantages:

  • Simpler serving infrastructure. DuckDB runs inside the service that needs analytical SQL, so we do not need a separate always-on warehouse cluster just to answer dashboard queries.
  • Better economics for BI consumption. Basedash does not charge usage-based query fees, and DuckDB’s zero-idle model helps us keep warehouse-backed charts, chats, reports, and Slack workflows on a predictable serving path.
  • Fast local execution over modeled data. When data has already been synced and shaped for BI, DuckDB can scan and aggregate it close to the product instead of sending every interaction back through Snowflake, BigQuery, or another usage-billed warehouse.
  • Cleaner workspace isolation. Most Basedash Warehouse queries naturally belong to one workspace’s synced data. That maps well to DuckDB’s single-node model and avoids forcing every workload through one shared, highly concurrent database surface.

This is the same tradeoff this guide recommends for other teams. DuckDB is not the right answer because it is new or because it can replace every cloud warehouse. It is the right answer when the workload is bounded, read-heavy, and latency-sensitive, and when the product benefits from owning the BI serving layer. Basedash Warehouse fits that shape, so DuckDB became the right default under the hood.

DuckDB compared to cloud warehouses for BI

The relevant dimensions for a BI backend, not generic “data warehouse” criteria:

AttributeDuckDBSnowflakeBigQueryClickHouse Cloud
DeploymentEmbedded libraryManaged SaaSManaged SaaSManaged SaaS or self-host
Scale modelVertical (single node)Horizontal (virtual warehouses)Serverless (slots)Horizontal (replicated shards)
StorageParquet on object store or local fileProprietary (S3-backed)Proprietary (Colossus)MergeTree on object store or local
Idle costZeroWarehouse credits (auto-suspend)Zero (storage only)Cluster cost (auto-suspend)
Cold-query latencySub-second to a few secondsSeveral seconds (warehouse resume)Sub-second to a few secondsSub-second
Concurrency modelPer-processMulti-cluster warehousesSlots and reservationsPer-cluster, per-node
Native RLSNoYesYesYes
BI tool integrationsGrowing (Evidence, Rill, Hex, dbt, Mode, Basedash)UniversalUniversalStrong and growing
Best fitInternal BI, embedded per-customer, dev/testMulti-team enterprise BIAd-hoc + Google ecosystemReal-time, high-cardinality

The columns to read carefully are idle cost and concurrency. DuckDB wins on idle cost (zero) and loses on concurrency. Cloud warehouses win on managed concurrency and lose on idle cost. The right choice depends on which one matters more for your workload.

For workload-specific comparisons across warehouses, see the best BI dashboarding tools for ClickHouse and other warehouse-specific guides.

How BI tools connect to DuckDB

DuckDB connectivity varies more by BI tool than most other warehouses, because the embedded model is unusual.

  • Native DuckDB support. Evidence, Rill, MotherDuck’s own UI, and a growing list of newer tools treat DuckDB as a first-class backend. The BI tool runs DuckDB inside its own process or alongside it, and queries go straight to the engine.
  • Generic SQL drivers. Hex, Mode, Jupyter, and most modern notebook tools support DuckDB through the official Python and Node bindings. The DuckDB instance lives in the notebook kernel.
  • JDBC/ODBC. DuckDB ships JDBC and ODBC drivers, so Tableau, Power BI, and Looker can connect to a DuckDB file or a MotherDuck server. This works but loses the “embedded” benefit; the BI tool runs DuckDB on a separate server and queries it as a remote database.
  • MotherDuck-based connections. MotherDuck offers a hosted DuckDB service that any BI tool can connect to as if it were a cloud warehouse. This is a useful way to get DuckDB-style economics with a server other people manage.
  • Basedash. Connects to DuckDB through a standard SQL connection (file-based or MotherDuck), and treats DuckDB the same as any other warehouse. The AI agent generates SQL that runs against DuckDB without translation, and metrics and dashboards can mix DuckDB queries with queries against Postgres or Snowflake in the same workspace.

For BI tools that don’t yet support DuckDB natively, the practical workaround is to put a small DuckDB server behind a Postgres wire protocol shim (DuckDB has an experimental Postgres-compatible interface) and connect the BI tool as if it were Postgres.

Common mistakes when using DuckDB for BI

Five patterns show up in nearly every team’s first deployment.

Treating DuckDB as a server. Running a long-lived DuckDB process and hitting it with concurrent queries from a multi-user BI tool will queue. DuckDB is built for parallel queries inside a process, not many concurrent processes hitting the same database. Either use a process-per-query model, route queries through MotherDuck, or use a different engine for high-concurrency dashboards.

Skipping partitioning. Querying a single 500 GB Parquet file is slow. Querying a Hive-partitioned directory of 500 MB files by date is fast. The DuckDB engine pushes filters into file scans, but only if the data is laid out so file pruning helps.

Reading from S3 without httpfs caching. The first query against an S3 dataset downloads the relevant byte ranges. Subsequent queries against the same data should hit a local cache. If you don’t configure the httpfs extension’s cache (or use a CDN like CloudFront in front of the bucket), every query pays the network round-trip.

Mixing types badly. DuckDB is strict about types. Inferred Parquet schemas with mixed integer-and-string columns or unexpected nulls will fail at query time. Validate schemas at ingestion, not at query time, especially for data going into a dashboard people refresh every morning.

Ignoring memory limits. DuckDB uses as much memory as you give it. On a shared server, a single bad query can OOM the process. Set memory_limit and threads per session or globally, and run the heaviest queries with a deliberate cap.

A decision rubric: should you put DuckDB under your BI tool?

A practical decision sequence:

  1. How big is the working set for dashboards? Under 100 GB after compression: DuckDB is almost certainly fine. 100 GB to a few TB: DuckDB works with careful partitioning, but a cloud warehouse may be simpler. Multi-TB and growing: cloud warehouse.
  2. How many concurrent BI users? Up to a few dozen: DuckDB handles it cleanly with a process-per-query model. Hundreds: use a cloud warehouse or a hosted DuckDB service like MotherDuck.
  3. Are dashboards multi-tenant? Internal only: DuckDB is fine. Embedded per-customer with naturally partitioned data: DuckDB-per-session works. Embedded with shared schemas and many tenants: cloud warehouse with RLS.
  4. What is your governance posture? Light internal governance: DuckDB is fine. SOC 2 with audit log requirements: layer logging in the BI tool, possible. HIPAA, FedRAMP, contractual isolation: cloud warehouse.
  5. What’s your data freshness requirement? Hourly or longer: DuckDB on scheduled Parquet exports works. Near-real-time: DuckDB on a Postgres extension or fall back to a warehouse, depending on volume.

If you answer “DuckDB is fine” for all five, DuckDB is probably the right backend and you can stop paying for warehouse compute for BI. If even one says “cloud warehouse,” the hybrid pattern (DuckDB for hot Parquet, warehouse for the rest) is usually right.

FAQ

Is DuckDB production-ready?

Yes. DuckDB hit 1.0 in 2024 and is used in production by Evidence, Rill, MotherDuck, Mode, Hex, dbt, and many smaller analytics products, plus internal data platforms at large companies. The engine has formal release notes, semantic versioning, and a maintained extension ecosystem. The constraint is that “production” for DuckDB means embedded analytics workloads, not “global, distributed, multi-tenant database.” Use it for what it’s good at.

Can DuckDB replace Snowflake?

For some workloads, yes; for others, no. A team with a few hundred GB of BI data, an internal analytics use case, and a small data team can replace Snowflake with DuckDB and Parquet on object storage. A team with multi-TB datasets, many concurrent users across departments, and external customer-facing analytics will still need Snowflake (or BigQuery, Redshift, or ClickHouse). The honest answer is most teams who already have Snowflake are not better off ripping it out, but most teams who don’t have it yet can probably skip it.

How does DuckDB compare to ClickHouse?

Different shapes of the same idea. ClickHouse is a distributed, multi-node OLAP database optimized for high-cardinality, high-throughput, real-time workloads like product analytics, observability, and security logging. DuckDB is a single-node embedded engine optimized for ad-hoc analytical SQL and BI workloads. ClickHouse handles billions of rows per second of ingestion; DuckDB does not. DuckDB sits inside a Python script or a BI tool; ClickHouse runs as a server. For BI dashboards over small to mid-sized data, DuckDB is simpler. For real-time analytics over high-cardinality event streams, ClickHouse is the right tool.

What about MotherDuck?

MotherDuck is a hosted DuckDB service. It runs DuckDB on shared infrastructure, lets you query the same data from a laptop or a server with the same SQL, and adds features (storage, sharing, scheduling) that an embedded library does not provide on its own. It is the path of least resistance if you like DuckDB’s economics but do not want to operate it yourself. For BI tools that connect over standard SQL, MotherDuck looks like a cloud warehouse with DuckDB-style performance and pricing.

Does DuckDB support row-level security?

Not as a database feature. You can implement RLS by giving each tenant a separate DuckDB process or database file, by parameterizing views per tenant, or by enforcing the predicate in the BI tool above DuckDB. For embedded analytics with strict isolation requirements, this works but adds operational complexity. If RLS is a hard requirement and you have many tenants, a cloud warehouse with native row access policies is usually a better fit.

Can a BI tool’s AI agent use DuckDB?

Yes. AI agents that generate SQL don’t care which engine runs the query, as long as the dialect is supported. DuckDB’s SQL is broadly compatible with Postgres, which most agents already target. The thing to watch is that the agent connects to DuckDB with the right schema in context and that any tenant or permission predicates are applied before the agent’s SQL runs. The semantic layer guide covers the wider pattern for keeping AI-generated SQL inside guardrails.

How do I get started with DuckDB for BI?

Three concrete steps:

  1. Pick one workload that currently runs against a cloud warehouse and accounts for a meaningful share of cost or latency. A daily executive dashboard or a heavily-queried customer report is a good first candidate.
  2. Materialize the underlying tables to partitioned Parquet on S3 (or local disk for internal-only use). Use a simple cron job, an Airflow DAG, or a dbt-duckdb model.
  3. Point a BI tool at DuckDB reading those Parquet files. Compare cold and warm query latency, monthly cost, and developer experience side by side with the warehouse-backed version. If it’s better, expand.

Most teams who run this experiment for one dashboard end up running it for many. Some keep the warehouse for ad-hoc workloads and move all dashboards. Others rip the warehouse out entirely. Both are reasonable outcomes for the right kind of workload.

Written by

Max Musing avatar

Max Musing

Founder and CEO of Basedash

Max Musing is the founder and CEO of Basedash, an AI-native business intelligence platform designed to help teams explore analytics and build dashboards without writing SQL. His work focuses on applying large language models to structured data systems, improving query reliability, and building governed analytics workflows for production environments.

View full author profile →

Looking for an AI-native BI tool?

Basedash lets you build charts, dashboards, and reports in seconds using all your data.