When to add a data warehouse for analytics

Most startups start with analytics on the same database that runs their product. For a while that is the right answer. The data is fresh, there is one source of truth, and there is no pipeline to maintain. The question is not whether to add a data warehouse eventually, but how to recognize the moment when running analytics on the production database has started to cost more than it saves.

This guide is for founders, engineering leads, and early data hires deciding whether to keep analytics on PostgreSQL or MySQL, add a read replica, or move to a real data warehouse like Snowflake, BigQuery, Redshift, or ClickHouse. It walks through the signals that actually matter, the intermediate steps before a full warehouse, and the practical migration path that does not require a dedicated data team.

The short answer

Run analytics on your production database for as long as you can. Add a read replica when analytical queries start to slow down product writes. Move to a data warehouse when you need to combine production data with data from other systems (Stripe, HubSpot, Segment, Salesforce, Intercom), when transformations and historical snapshots become a regular part of reporting, or when more than a handful of people need to query data without breaking anything.

The mistake is rarely “moved too late.” It is “moved too early, then under-invested in the warehouse, and ended up with worse analytics than before.”

When the production database is still the right answer

A surprising amount of company-stage analytics works well on a production PostgreSQL or MySQL instance. If all of the following are true, a warehouse is probably premature:

Your data lives mostly in one application database.
You do not yet need to combine product data with data from billing, marketing, or product analytics tools.
Your dashboards run a few times a day, not every minute.
A handful of internal users (engineers, founders, an analyst) are the only people writing queries.
Your largest tables are in the millions of rows, not billions.
You can tolerate analytical queries running on the same database that serves the product.

Under those conditions, the simplest workable stack is: a few well-named SQL views in the production database, a BI tool pointed at it, and a small set of dashboards. We’ve written about this pattern in how to set up business intelligence without a data team. It is not a temporary hack. Plenty of companies between $0 and $10M ARR run their entire analytics surface this way.

What changes that picture is not size, exactly. It is the specific kinds of pain that emerge when more people, more sources, and more historical questions arrive.

Five signals you have outgrown the production database

These are the signals that show up in roughly the order they appear. Not every team will hit all of them, but if you find yourself nodding at three or more, you are past due for a warehouse.

1. Analytics queries are slowing down product writes

The first hard signal is performance. A query that joins users to events to subscriptions and aggregates over three months of data is not the same shape as the queries your product runs. It scans more rows, holds more memory, and competes with transactional traffic for I/O.

You will see this as p95 latency creeping up during the time of day when someone runs the weekly metrics report, or as a transaction lock that times out because an analyst’s query is holding a snapshot. The first response is usually a read replica, not a warehouse. A read replica isolates read traffic from writes and is straightforward to set up on most managed Postgres or MySQL providers. It buys you another six to twelve months in most cases.

A warehouse only becomes the right answer once a read replica is also struggling, or once the queries you need to run benefit from columnar storage. If you are scanning hundreds of millions of rows and aggregating, columnar engines like Snowflake, BigQuery, Redshift, and ClickHouse are an order of magnitude faster than row-oriented Postgres or MySQL.

2. You need data from systems other than the application database

This is the signal that pushes most teams over the edge. The product database knows about users, sessions, and transactions. But “revenue” lives in Stripe. “Pipeline” lives in Salesforce or HubSpot. “Marketing attribution” lives in Segment or Mixpanel. “Support volume” lives in Intercom or Zendesk.

Once you need to answer questions that join two or more of those systems (“how does CAC vary by acquisition source for customers who upgraded in their second month?”), you have to put the data somewhere together. That somewhere is the warehouse.

There are tools that connect Stripe data to dashboards directly without a warehouse. They work for narrow questions and stop working as soon as the question crosses systems. The warehouse is, more than anything else, a place to land data from many sources and join it consistently.

3. You need historical snapshots that the production database does not keep

Production databases overwrite. When a customer changes their plan, the subscriptions row updates in place. When a deal moves stages, the opportunities row gets a new stage. The current state is preserved; the history is lost unless someone built explicit audit logging for that table.

Analytics rarely cares about current state alone. It cares about what changed and when: how long deals sit in each stage, how plan downgrades cluster around billing events, what fraction of churned customers downgraded first. Answering those questions requires append-only history.

You can build history into your application database with audit tables and triggers, but that is real engineering work and slows down product changes. A warehouse fed by change data capture (CDC) or scheduled snapshots gives you that history almost for free. Once historical analysis becomes a recurring need rather than a one-off, the warehouse stops being optional.

4. More than a few non-engineers need to query data

This is a softer signal than the first three but often the most expensive when ignored. Once finance, growth, customer success, and operations are all asking different questions of the same data, the costs of running everything on the production database compound:

Permissions get harder. Most production databases were not designed for fine-grained read access by department.
Bad queries become a real risk. A non-technical user accidentally writing a Cartesian join against a busy production database is a customer-impacting incident.
Versioning and review become impossible. Anyone editing a SQL view in a production database is editing the canonical definition, with no audit trail.

A warehouse, paired with a BI tool that handles permissions and review, makes self-serve analytics safe. Modern tools like Basedash, Hex, Omni, and Lightdash are built around the assumption that several non-engineers will be in the data daily, with read-only access scoped to the parts of the warehouse they should see.

5. Transformations are starting to live in eight different places

If you have ever found yourself defining “active customer” in a SQL view, in a Looker explore, in a Notion doc, and in a calculated field inside the BI tool, and getting four slightly different answers, you have hit the transformation problem. The warehouse, paired with a transformation tool like dbt or SQLMesh, is where this is solved.

Without a warehouse, transformations end up scattered across application code, BI tool calculated fields, and ad-hoc SQL. With a warehouse, transformations have an obvious home: dbt models that compile to tables and views inside the warehouse, with tests, documentation, and a dependency graph. We’ve written more about this in where to define business metrics.

You do not need dbt on day one of a warehouse. But the warehouse is what makes a coherent transformation layer possible later.

Intermediate steps before a real warehouse

Most teams skip too quickly from “production database” to “Snowflake plus dbt plus a semantic layer plus a BI tool.” There are useful intermediate stops.

A read replica. A read-only copy of the production database, kept in sync via streaming replication. Solves performance isolation. Costs almost nothing if you already use a managed provider. Does not solve any of the cross-source, history, or transformation problems.

A reporting database. A Postgres instance separate from production, populated by nightly dumps or scheduled jobs. Slightly more flexible than a read replica because you can add tables that do not exist in production (transformed tables, joined tables from other sources). The reporting-database pattern works well at small scale and gets messy as soon as you have more than a few sources or care about freshness.

A small warehouse with one or two sources. This is the underrated middle option. Spin up BigQuery or a small Snowflake account. Use Fivetran, Airbyte, or Estuary to land Stripe and one or two product systems alongside a snapshot of your production database. Run a BI tool against the warehouse. No dbt yet, no semantic layer, just a warehouse with raw schemas and a few SQL views on top.

This is the configuration most companies should aim for as their first warehouse. It is not the final form, but it solves three of the five signals (cross-source data, historical snapshots, isolated read traffic) without committing to a complex stack. dbt and a semantic layer can be added later, once the warehouse has proven its value.

A decision matrix

Use this to choose where you are right now. The right architecture depends mostly on which signals you are hitting, not on revenue or headcount.

If this is true…	Do this
One database, one team, simple aggregations, no performance issues	Stay on production. Add SQL views.
Production performance is degrading from analytical queries	Add a read replica. Point analytics at it.
You need to join product data with Stripe, HubSpot, Segment, or Salesforce regularly	Add a warehouse with managed connectors. Skip dbt at first.
Reports require historical snapshots of data that the production DB overwrites	Move to a warehouse with CDC or scheduled snapshots
Several non-engineers need self-serve access without risk to production	Move to a warehouse and a BI tool with permissions
You have repeated metric-definition conflicts across dashboards	Add dbt on top of the warehouse
Multiple consumers (BI, AI assistants, embedded dashboards) need consistent metric answers	Add a semantic layer on top of dbt

The matrix reads top to bottom as a maturity ladder. Most companies skip rungs and pay for it. The expensive mistake is buying Snowflake plus Fivetran plus dbt plus a semantic layer plus a BI tool before any of the underlying signals are present, then watching adoption stall because nobody can answer questions faster than they could before.

What to land in the warehouse first

When you do move, the temptation is to replicate everything. Do not.

A useful first inventory has three buckets:

Production database. Replicate the tables that the top ten dashboards touch. For most companies this is users, accounts, subscriptions, events (or its equivalent), and a handful of domain-specific tables. Add others as needed.
Billing or revenue source. Stripe, Chargebee, or whatever owns the canonical revenue numbers. This is almost always one of the first warehouse sources because revenue is the most-asked-about metric and the production database does not have the full picture.
One product analytics or attribution source. Segment, Mixpanel, Amplitude, or PostHog. Optional at first, but powerful in combination with the production data.

Sources to deprioritize at the start: marketing platforms with low query volume, support tools, and HR systems. They are useful eventually but rarely on the critical path for the first warehouse.

The first dashboards in the warehouse should be the ones that previously required reaching into Stripe or stitching together exports. That is where the warehouse pays for itself fastest.

Common mistakes

A few patterns repeat:

Skipping the read replica step. Performance isolation is the cheapest fix and gets confused with the warehouse problem. If only signal #1 is present, do not buy Snowflake.
Buying the full modern data stack on day one. Snowflake plus Fivetran plus dbt plus a semantic layer plus a BI tool plus a reverse ETL tool is a lot to operate without a data team. Start with the warehouse, the connectors, and the BI tool. Add the rest only when the pain shows up.
Treating the warehouse as a backup of production. Replicating every table without thinking about what gets used means most of the warehouse storage and complexity goes to data nobody queries. Replicate what matters.
Letting freshness expectations drift past what the warehouse provides. A warehouse fed hourly is fine for most reporting and useless for real-time operational alerting. Set expectations explicitly. If you need real-time, the warehouse is not the right tool for that specific use case.
Not deciding who owns the warehouse. Even a small warehouse needs an owner. Without one, dashboards drift, models duplicate, and within six months the warehouse is as messy as the production database it was supposed to clean up.
Migrating dashboards instead of rebuilding them. A warehouse migration is a chance to throw out dashboards nobody uses. Most teams have more dashboards than questions. Rebuild only the ones that get viewed weekly.

When a warehouse is genuinely the wrong choice

Not every company benefits from a warehouse, even at scale.

If your business is a single-product, single-database company with a small operations team and metrics that do not change much, a warehouse adds operational cost without much payoff. A read replica plus a BI tool plus a few well-maintained SQL views can carry a focused team to nine figures of revenue.

If your data needs are dominated by real-time operational dashboards (alerting, fraud detection, sub-minute monitoring), a warehouse is the wrong shape. You want a streaming system or an HTAP database (CockroachDB, SingleStore, ClickHouse Cloud) that can handle both writes and analytical reads with low latency.

If your team has no one who is comfortable with SQL or willing to own data infrastructure, a warehouse will be a worse experience than a focused BI tool that connects directly to your existing database. The warehouse is not a substitute for someone who understands the data.

The warehouse pays off when the costs of fragmentation across sources, across history, and across consumers exceed the costs of running another piece of infrastructure. That crossover happens at different times for different companies. The signals above are how to recognize it without guessing.

A practical migration path

For most companies that are ready, a useful sequence looks like this:

Pick a warehouse. BigQuery is the simplest to start with if you are on Google Cloud and care about pay-per-query economics. Snowflake is the safest default for most teams. Redshift makes sense if you are committed to AWS. ClickHouse Cloud is the right answer for high-volume event data. The choice matters less than people think; switching later is annoying but not catastrophic.
Land your production database. Use a managed CDC tool (Fivetran, Airbyte, Estuary) or your warehouse’s native ingestion. Replicate only the tables that matter at first.
Land Stripe (or your billing system). This is usually the second-highest-leverage source.
Point a BI tool at the warehouse. Pick something with strong permissions and review. The BI tool plus the warehouse will replace most of what your old setup did within a few weeks.
Rebuild the top-five dashboards. Stop here for a quarter. Use the warehouse. Find the gaps. Resist the urge to add more sources or more transformation tooling until the existing dashboards are trusted.
Add dbt when transformations start to multiply. Usually around the time you have 10-20 SQL views in the warehouse and dependencies between them are getting hard to track.
Add a semantic layer when you have multiple consumers. Embedded analytics, AI assistants, multiple BI tools: once any two of those are reading from the warehouse, a semantic layer starts to pay off. Not before.

Each step here solves a specific problem. The trap is doing all seven at once before any of them have stress-tested themselves against real questions.

How AI changes the timing

AI-assisted analytics shifts the curve slightly. Tools that generate SQL against a database, built into BI products like Basedash, Hex, and Omni or used standalone, make ad-hoc questions cheaper and increase the number of people querying data. That accelerates signal #4 (more non-engineers in the data) and indirectly accelerates signal #5 (transformation drift, because more people writing queries means more inconsistent definitions).

It does not, on its own, force a warehouse. AI tools can run against a production database or a read replica, and many do well there. But the questions they make easy to ask are exactly the ones that benefit from a warehouse: cross-source, historical, and self-serve. If AI-assisted analytics is part of your roadmap, the warehouse decision usually arrives a quarter or two earlier than it otherwise would. We’ve covered the broader pattern in text-to-SQL for data warehouses.

FAQ

Is a read replica enough, or do I need a real warehouse?

A read replica solves performance isolation and nothing else. It does not give you cross-source data, historical snapshots, or columnar performance on large aggregations. If the only signal you are hitting is slow queries on production, start with a replica. If any of the other four signals are also present, the replica is a temporary measure.

Do I need dbt to start with a warehouse?

No. Plenty of teams run a useful warehouse with raw schemas plus a handful of SQL views. dbt becomes valuable when transformations start having dependencies on other transformations, and when you want tests and documentation as part of the workflow. That usually happens six to eighteen months into using a warehouse, not on day one.

Can I run analytics on a Postgres read replica forever?

Some companies do. The limits are: you cannot easily land data from other sources, you cannot easily keep historical snapshots, and Postgres performance on large aggregations is worse than columnar warehouses. If your business never crosses those limits, a replica plus a BI tool is a complete stack.

Should I use Snowflake, BigQuery, Redshift, or ClickHouse?

For most general-purpose analytics workloads, Snowflake and BigQuery are the safest choices, with BigQuery favored on Google Cloud and Snowflake working anywhere. Redshift is a reasonable default if you are deeply committed to AWS. ClickHouse is the best choice when your primary workload is high-volume event data with sub-second query requirements. The bigger decision is the surrounding stack (connectors, transformation tool, BI tool), not the warehouse itself.

How big does a company have to be before a warehouse makes sense?

There is no headcount or revenue threshold. The right trigger is the signals above, not company size. Some 15-person companies need a warehouse; some 200-person companies do not. The closest size-based heuristic: most companies hit the cross-source signal somewhere between $1M and $5M ARR, and that is when a warehouse decision usually becomes concrete.

Can a BI tool replace the need for a warehouse?

Sometimes. A BI tool that connects directly to multiple sources and handles its own caching can substitute for a warehouse for a while. The substitution breaks down when transformations get complex, when historical snapshots are needed, or when multiple tools need to read the same canonical data. At that point the warehouse becomes the system of record and the BI tool becomes one of several consumers.

When to add a data warehouse: signals your startup has outgrown its production database