When to add a data warehouse: signals your startup has outgrown its production database
Max Musing
Max Musing Founder and CEO of Basedash · May 14, 2026

Max Musing
Max Musing Founder and CEO of Basedash · May 14, 2026

Most startups start with analytics on the same database that runs their product. For a while that is the right answer. The data is fresh, there is one source of truth, and there is no pipeline to maintain. The question is not whether to add a data warehouse eventually, but how to recognize the moment when running analytics on the production database has started to cost more than it saves.
This guide is for founders, engineering leads, and early data hires deciding whether to keep analytics on PostgreSQL or MySQL, add a read replica, or move to a real data warehouse like Snowflake, BigQuery, Redshift, or ClickHouse. It walks through the signals that actually matter, the intermediate steps before a full warehouse, and the practical migration path that does not require a dedicated data team.
Run analytics on your production database for as long as you can. Add a read replica when analytical queries start to slow down product writes. Move to a data warehouse when you need to combine production data with data from other systems (Stripe, HubSpot, Segment, Salesforce, Intercom), when transformations and historical snapshots become a regular part of reporting, or when more than a handful of people need to query data without breaking anything.
The mistake is rarely “moved too late.” It is “moved too early, then under-invested in the warehouse, and ended up with worse analytics than before.”
A surprising amount of company-stage analytics works well on a production PostgreSQL or MySQL instance. If all of the following are true, a warehouse is probably premature:
Under those conditions, the simplest workable stack is: a few well-named SQL views in the production database, a BI tool pointed at it, and a small set of dashboards. We’ve written about this pattern in how to set up business intelligence without a data team. It is not a temporary hack. Plenty of companies between $0 and $10M ARR run their entire analytics surface this way.
What changes that picture is not size, exactly. It is the specific kinds of pain that emerge when more people, more sources, and more historical questions arrive.
These are the signals that show up in roughly the order they appear. Not every team will hit all of them, but if you find yourself nodding at three or more, you are past due for a warehouse.
The first hard signal is performance. A query that joins users to events to subscriptions and aggregates over three months of data is not the same shape as the queries your product runs. It scans more rows, holds more memory, and competes with transactional traffic for I/O.
You will see this as p95 latency creeping up during the time of day when someone runs the weekly metrics report, or as a transaction lock that times out because an analyst’s query is holding a snapshot. The first response is usually a read replica, not a warehouse. A read replica isolates read traffic from writes and is straightforward to set up on most managed Postgres or MySQL providers. It buys you another six to twelve months in most cases.
A warehouse only becomes the right answer once a read replica is also struggling, or once the queries you need to run benefit from columnar storage. If you are scanning hundreds of millions of rows and aggregating, columnar engines like Snowflake, BigQuery, Redshift, and ClickHouse are an order of magnitude faster than row-oriented Postgres or MySQL.
This is the signal that pushes most teams over the edge. The product database knows about users, sessions, and transactions. But “revenue” lives in Stripe. “Pipeline” lives in Salesforce or HubSpot. “Marketing attribution” lives in Segment or Mixpanel. “Support volume” lives in Intercom or Zendesk.
Once you need to answer questions that join two or more of those systems (“how does CAC vary by acquisition source for customers who upgraded in their second month?”), you have to put the data somewhere together. That somewhere is the warehouse.
There are tools that connect Stripe data to dashboards directly without a warehouse. They work for narrow questions and stop working as soon as the question crosses systems. The warehouse is, more than anything else, a place to land data from many sources and join it consistently.
Production databases overwrite. When a customer changes their plan, the subscriptions row updates in place. When a deal moves stages, the opportunities row gets a new stage. The current state is preserved; the history is lost unless someone built explicit audit logging for that table.
Analytics rarely cares about current state alone. It cares about what changed and when: how long deals sit in each stage, how plan downgrades cluster around billing events, what fraction of churned customers downgraded first. Answering those questions requires append-only history.
You can build history into your application database with audit tables and triggers, but that is real engineering work and slows down product changes. A warehouse fed by change data capture (CDC) or scheduled snapshots gives you that history almost for free. Once historical analysis becomes a recurring need rather than a one-off, the warehouse stops being optional.
This is a softer signal than the first three but often the most expensive when ignored. Once finance, growth, customer success, and operations are all asking different questions of the same data, the costs of running everything on the production database compound:
A warehouse, paired with a BI tool that handles permissions and review, makes self-serve analytics safe. Modern tools like Basedash, Hex, Omni, and Lightdash are built around the assumption that several non-engineers will be in the data daily, with read-only access scoped to the parts of the warehouse they should see.
If you have ever found yourself defining “active customer” in a SQL view, in a Looker explore, in a Notion doc, and in a calculated field inside the BI tool, and getting four slightly different answers, you have hit the transformation problem. The warehouse, paired with a transformation tool like dbt or SQLMesh, is where this is solved.
Without a warehouse, transformations end up scattered across application code, BI tool calculated fields, and ad-hoc SQL. With a warehouse, transformations have an obvious home: dbt models that compile to tables and views inside the warehouse, with tests, documentation, and a dependency graph. We’ve written more about this in where to define business metrics.
You do not need dbt on day one of a warehouse. But the warehouse is what makes a coherent transformation layer possible later.
Most teams skip too quickly from “production database” to “Snowflake plus dbt plus a semantic layer plus a BI tool.” There are useful intermediate stops.
A read replica. A read-only copy of the production database, kept in sync via streaming replication. Solves performance isolation. Costs almost nothing if you already use a managed provider. Does not solve any of the cross-source, history, or transformation problems.
A reporting database. A Postgres instance separate from production, populated by nightly dumps or scheduled jobs. Slightly more flexible than a read replica because you can add tables that do not exist in production (transformed tables, joined tables from other sources). The reporting-database pattern works well at small scale and gets messy as soon as you have more than a few sources or care about freshness.
A small warehouse with one or two sources. This is the underrated middle option. Spin up BigQuery or a small Snowflake account. Use Fivetran, Airbyte, or Estuary to land Stripe and one or two product systems alongside a snapshot of your production database. Run a BI tool against the warehouse. No dbt yet, no semantic layer, just a warehouse with raw schemas and a few SQL views on top.
This is the configuration most companies should aim for as their first warehouse. It is not the final form, but it solves three of the five signals (cross-source data, historical snapshots, isolated read traffic) without committing to a complex stack. dbt and a semantic layer can be added later, once the warehouse has proven its value.
Use this to choose where you are right now. The right architecture depends mostly on which signals you are hitting, not on revenue or headcount.
| If this is true… | Do this |
|---|---|
| One database, one team, simple aggregations, no performance issues | Stay on production. Add SQL views. |
| Production performance is degrading from analytical queries | Add a read replica. Point analytics at it. |
| You need to join product data with Stripe, HubSpot, Segment, or Salesforce regularly | Add a warehouse with managed connectors. Skip dbt at first. |
| Reports require historical snapshots of data that the production DB overwrites | Move to a warehouse with CDC or scheduled snapshots |
| Several non-engineers need self-serve access without risk to production | Move to a warehouse and a BI tool with permissions |
| You have repeated metric-definition conflicts across dashboards | Add dbt on top of the warehouse |
| Multiple consumers (BI, AI assistants, embedded dashboards) need consistent metric answers | Add a semantic layer on top of dbt |
The matrix reads top to bottom as a maturity ladder. Most companies skip rungs and pay for it. The expensive mistake is buying Snowflake plus Fivetran plus dbt plus a semantic layer plus a BI tool before any of the underlying signals are present, then watching adoption stall because nobody can answer questions faster than they could before.
When you do move, the temptation is to replicate everything. Do not.
A useful first inventory has three buckets:
users, accounts, subscriptions, events (or its equivalent), and a handful of domain-specific tables. Add others as needed.Sources to deprioritize at the start: marketing platforms with low query volume, support tools, and HR systems. They are useful eventually but rarely on the critical path for the first warehouse.
The first dashboards in the warehouse should be the ones that previously required reaching into Stripe or stitching together exports. That is where the warehouse pays for itself fastest.
A few patterns repeat:
Not every company benefits from a warehouse, even at scale.
If your business is a single-product, single-database company with a small operations team and metrics that do not change much, a warehouse adds operational cost without much payoff. A read replica plus a BI tool plus a few well-maintained SQL views can carry a focused team to nine figures of revenue.
If your data needs are dominated by real-time operational dashboards (alerting, fraud detection, sub-minute monitoring), a warehouse is the wrong shape. You want a streaming system or an HTAP database (CockroachDB, SingleStore, ClickHouse Cloud) that can handle both writes and analytical reads with low latency.
If your team has no one who is comfortable with SQL or willing to own data infrastructure, a warehouse will be a worse experience than a focused BI tool that connects directly to your existing database. The warehouse is not a substitute for someone who understands the data.
The warehouse pays off when the costs of fragmentation across sources, across history, and across consumers exceed the costs of running another piece of infrastructure. That crossover happens at different times for different companies. The signals above are how to recognize it without guessing.
For most companies that are ready, a useful sequence looks like this:
Each step here solves a specific problem. The trap is doing all seven at once before any of them have stress-tested themselves against real questions.
AI-assisted analytics shifts the curve slightly. Tools that generate SQL against a database, built into BI products like Basedash, Hex, and Omni or used standalone, make ad-hoc questions cheaper and increase the number of people querying data. That accelerates signal #4 (more non-engineers in the data) and indirectly accelerates signal #5 (transformation drift, because more people writing queries means more inconsistent definitions).
It does not, on its own, force a warehouse. AI tools can run against a production database or a read replica, and many do well there. But the questions they make easy to ask are exactly the ones that benefit from a warehouse: cross-source, historical, and self-serve. If AI-assisted analytics is part of your roadmap, the warehouse decision usually arrives a quarter or two earlier than it otherwise would. We’ve covered the broader pattern in text-to-SQL for data warehouses.
Is a read replica enough, or do I need a real warehouse?
A read replica solves performance isolation and nothing else. It does not give you cross-source data, historical snapshots, or columnar performance on large aggregations. If the only signal you are hitting is slow queries on production, start with a replica. If any of the other four signals are also present, the replica is a temporary measure.
Do I need dbt to start with a warehouse?
No. Plenty of teams run a useful warehouse with raw schemas plus a handful of SQL views. dbt becomes valuable when transformations start having dependencies on other transformations, and when you want tests and documentation as part of the workflow. That usually happens six to eighteen months into using a warehouse, not on day one.
Can I run analytics on a Postgres read replica forever?
Some companies do. The limits are: you cannot easily land data from other sources, you cannot easily keep historical snapshots, and Postgres performance on large aggregations is worse than columnar warehouses. If your business never crosses those limits, a replica plus a BI tool is a complete stack.
Should I use Snowflake, BigQuery, Redshift, or ClickHouse?
For most general-purpose analytics workloads, Snowflake and BigQuery are the safest choices, with BigQuery favored on Google Cloud and Snowflake working anywhere. Redshift is a reasonable default if you are deeply committed to AWS. ClickHouse is the best choice when your primary workload is high-volume event data with sub-second query requirements. The bigger decision is the surrounding stack (connectors, transformation tool, BI tool), not the warehouse itself.
How big does a company have to be before a warehouse makes sense?
There is no headcount or revenue threshold. The right trigger is the signals above, not company size. Some 15-person companies need a warehouse; some 200-person companies do not. The closest size-based heuristic: most companies hit the cross-source signal somewhere between $1M and $5M ARR, and that is when a warehouse decision usually becomes concrete.
Can a BI tool replace the need for a warehouse?
Sometimes. A BI tool that connects directly to multiple sources and handles its own caching can substitute for a warehouse for a while. The substitution breaks down when transformations get complex, when historical snapshots are needed, or when multiple tools need to read the same canonical data. At that point the warehouse becomes the system of record and the BI tool becomes one of several consumers.
Written by
Founder and CEO of Basedash
Max Musing is the founder and CEO of Basedash, an AI-native business intelligence platform designed to help teams explore analytics and build dashboards without writing SQL. His work focuses on applying large language models to structured data systems, improving query reliability, and building governed analytics workflows for production environments.
Basedash lets you build charts, dashboards, and reports in seconds using all your data.