Text-to-SQL for data warehouses: querying Snowflake, BigQuery, and PostgreSQL in plain English
Max Musing
Max Musing Founder and CEO of Basedash
· March 11, 2026
Max Musing
Max Musing Founder and CEO of Basedash
· March 11, 2026
You have data spread across Snowflake, BigQuery, and PostgreSQL. You want to ask questions in plain English and get answers. The problem is that these aren’t interchangeable databases — they each speak a different dialect of SQL, structure schemas differently, handle data types with their own quirks, and optimize queries in fundamentally different ways.
A text-to-SQL tool that works on one warehouse doesn’t automatically work on another. The AI needs to know that Snowflake uses ILIKE for case-insensitive matching while BigQuery uses LOWER() with LIKE. It needs to know that PostgreSQL arrays are native types while Snowflake represents them as VARIANT. It needs to understand that a query that runs in 200ms on PostgreSQL might need a completely different structure to run efficiently on BigQuery’s columnar storage.
This guide covers how modern text-to-SQL engines handle these differences, what breaks when they don’t, and what to look for in a tool that works reliably across your data stack.
SQL is supposedly a standard. In practice, every warehouse vendor has added extensions, changed syntax, and made design decisions that make their SQL meaningfully different. When a user asks “show me revenue by month for the last year,” the AI has to produce syntactically correct SQL for whichever warehouse it’s targeting — and “correct” means different things for each one.
Date functions are where warehouse dialects diverge the most, and date operations appear in almost every analytical query.
To get revenue by month for the last year:
Snowflake:
SELECT DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS total_revenue
FROM orders
WHERE order_date >= DATEADD('year', -1, CURRENT_DATE())
GROUP BY 1 ORDER BY 1
BigQuery:
SELECT DATE_TRUNC(order_date, MONTH) AS month,
SUM(revenue) AS total_revenue
FROM orders
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR)
GROUP BY 1 ORDER BY 1
PostgreSQL:
SELECT DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS total_revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY 1 ORDER BY 1
Three warehouses, three different syntaxes for the same question. DATE_TRUNC takes arguments in reversed order between Snowflake and BigQuery. Date arithmetic uses DATEADD in Snowflake, DATE_SUB in BigQuery, and the - operator with INTERVAL in PostgreSQL. A text-to-SQL tool that generates Snowflake syntax against a BigQuery backend returns an error, not an answer.
Snowflake stores identifiers in uppercase by default. If you create a table called user_events without quoting it, Snowflake stores it as USER_EVENTS. Query it as user_events and it works because unquoted identifiers are case-insensitive. But if someone created the table with double quotes as "user_events", now it’s case-sensitive and only matches that exact casing.
BigQuery uses backticks for quoting and requires project and dataset prefixes: `project.dataset.table`. PostgreSQL uses double quotes and folds unquoted identifiers to lowercase — the opposite of Snowflake’s behavior.
A text-to-SQL engine needs to track these conventions per connection. Getting the quoting wrong doesn’t just cause query failures — it can cause the AI to reference the wrong table entirely if the warehouse has both USERS and "users" (which is more common than you’d think in migrated schemas).
Modern warehouses all support JSON and semi-structured data, but the access syntax varies:
Snowflake uses the : operator and LATERAL FLATTEN:
SELECT payload:user:name::STRING AS user_name
FROM events
BigQuery uses dot notation and JSON_EXTRACT_SCALAR:
SELECT JSON_EXTRACT_SCALAR(payload, '$.user.name') AS user_name
FROM events
PostgreSQL uses -> and ->> operators:
SELECT payload->'user'->>'name' AS user_name
FROM events
When a user asks “show me user names from the events payload,” the AI needs to know which JSON access pattern the target warehouse expects. Getting this wrong produces cryptic type errors that a non-technical user can’t debug.
The good news is that modern text-to-SQL tools have developed several strategies for handling multi-warehouse environments.
The most reliable approach is training or prompting the language model with warehouse-specific SQL syntax. When the system knows it’s connected to BigQuery, it generates BigQuery SQL from the start rather than generating generic SQL and trying to translate it.
This works because the major LLMs have seen enough SQL from each warehouse in their training data to understand dialect differences. The text-to-SQL engine’s job is to provide the right context: this connection targets Snowflake, here’s the schema, here are the available functions.
In practice, tools like Basedash connect directly to your warehouse with read-only access and introspect the schema at connection time. The system knows your exact tables, columns, data types, and which warehouse engine is on the other end. When you ask a question, the AI generates dialect-correct SQL from the first token.
Each warehouse exposes schema metadata differently:
INFORMATION_SCHEMA views and the SHOW command family. Schema metadata includes clustering keys, micro-partition info, and time travel retention — all useful context for generating efficient queries.INFORMATION_SCHEMA and the BigQuery API. Partitioning and clustering metadata is critical for cost-effective queries since BigQuery charges by bytes scanned.pg_catalog and information_schema. Index definitions, constraint details, and statistics (via pg_stats) help the AI understand query plan implications.A text-to-SQL tool that only reads basic table and column definitions misses warehouse-specific optimization opportunities. Knowing that a BigQuery table is partitioned by date means the AI can add partition filters to avoid scanning terabytes of data. Knowing that a Snowflake table is clustered by customer_id means the AI can structure WHERE clauses to leverage micro-partition pruning.
Every warehouse has a unique function library. Text-to-SQL engines maintain internal mappings between common operations and their warehouse-specific implementations:
| Operation | Snowflake | BigQuery | PostgreSQL |
|---|---|---|---|
| String concatenation | CONCAT() or || | CONCAT() | CONCAT() or || |
| Current timestamp | CURRENT_TIMESTAMP() | CURRENT_TIMESTAMP() | NOW() or CURRENT_TIMESTAMP |
| Null coalescing | NVL() or COALESCE() | IFNULL() or COALESCE() | COALESCE() |
| Array aggregation | ARRAY_AGG() | ARRAY_AGG() | ARRAY_AGG() |
| Regex matching | REGEXP_LIKE() | REGEXP_CONTAINS() | ~ operator |
| Approximate count | APPROX_COUNT_DISTINCT() | APPROX_COUNT_DISTINCT() | (not native) |
The AI doesn’t need to memorize this table — it needs to generate the correct function for the target warehouse when a user’s question requires that operation. Asking “how many unique users visited each page” should produce COUNT(DISTINCT user_id) everywhere, but might use APPROX_COUNT_DISTINCT on large Snowflake or BigQuery tables where exact counts are too slow.
Generating syntactically correct SQL is table stakes. Generating efficient SQL requires understanding how each warehouse processes queries.
Snowflake separates compute from storage. The query itself doesn’t determine cost — the warehouse size and run time do. A text-to-SQL engine generating queries for Snowflake should:
SELECT * when the user only needs a few columns, since Snowflake’s columnar storage benefits from column pruning.WHERE and GROUP BY clauses when possible. If the table is clustered by created_at, filtering on that column triggers micro-partition pruning.QUALIFY for window function filtering instead of nested subqueries, since Snowflake optimizes QUALIFY natively.BigQuery charges by bytes scanned (in the on-demand model), so query cost is directly proportional to how much data the query reads. Efficient text-to-SQL for BigQuery means:
SELECT * aggressively. BigQuery’s columnar storage means selecting fewer columns reads less data and costs less.LIMIT judiciously. Unlike row-store databases where LIMIT can short-circuit execution, BigQuery typically processes the full query before applying LIMIT, so it doesn’t reduce cost unless combined with other optimizations.PostgreSQL is a row-oriented database (ignoring extensions like Citus or TimescaleDB). Efficient queries leverage indexes:
WHERE clauses to hit index scans instead of sequential scans.EXPLAIN ANALYZE output, when available to the AI, helps it learn which query patterns perform well on a specific schema. Some advanced tools run explain plans on generated queries before executing them.If your data lives in more than one warehouse — or if you’re evaluating tools before picking a warehouse — these are the capabilities that separate tools that work from tools that break.
The tool should connect to Snowflake, BigQuery, PostgreSQL, MySQL, ClickHouse, Redshift, and other common warehouses without requiring you to consolidate everything into one system first. Each connection should be independent, with its own schema introspection and dialect handling.
Basedash connects to all major warehouses and databases with read-only access and generates dialect-correct SQL for each one. For teams that don’t maintain their own warehouse, Basedash also offers a managed data warehouse with 750+ pre-built connectors that sync data automatically — eliminating the need to set up ETL pipelines before you can start asking questions.
Some text-to-SQL tools treat each database connection as an island. Better tools maintain context across your entire data estate. When a user asks “compare website traffic from our analytics database with revenue from our warehouse,” the system should understand that this requires querying two sources and combining the results.
This doesn’t mean running cross-database joins at the SQL level (that rarely works). It means the AI understands the full picture of your data, can query each source independently, and combine results at the application layer.
A metric like “monthly recurring revenue” might be defined differently in your PostgreSQL application database (where it’s calculated from individual subscriptions) versus your Snowflake warehouse (where it might be a pre-aggregated column in a dbt model). The text-to-SQL tool’s semantic layer should allow you to define metrics per data source, not just globally.
When someone asks “what’s our MRR,” the tool should know which source to query and which calculation to apply, based on where the data lives and how the metric is defined for that context.
Before running AI-generated SQL against your warehouse, the tool should validate:
STRING to a DATE should be flagged, not executed.Warehouse schemas change. Columns get renamed, tables get refactored, new data sources get added. A text-to-SQL tool that caches schema metadata indefinitely will generate queries against stale schemas. Look for tools that refresh schema metadata periodically or on-demand, and that handle missing columns gracefully rather than producing cryptic SQL errors.
The most common mistake is treating all SQL databases as interchangeable. Teams that prototype on PostgreSQL and deploy on Snowflake discover that queries need non-trivial changes. A text-to-SQL tool that “works on PostgreSQL” doesn’t automatically work on Snowflake — you need to verify per-warehouse support explicitly.
A query that returns the right answer can still be a bad query if it costs $50 on BigQuery because it scanned an unpartitioned table. Text-to-SQL tools that don’t understand warehouse-specific cost models can generate correct but expensive queries. This is especially dangerous in self-service environments where many users run ad hoc queries simultaneously.
Each warehouse implements access control differently. Snowflake uses roles and virtual warehouses. BigQuery uses IAM and authorized views. PostgreSQL uses roles and row-level security policies. A text-to-SQL tool needs to respect the target warehouse’s security model, not just apply its own access layer on top. The best tools integrate with your warehouse’s native access controls rather than bypassing them.
If you use multiple data sources, test the text-to-SQL tool against each one with representative questions. Some tools work brilliantly on PostgreSQL (because most training data is PostgreSQL) but generate subtly wrong queries on Snowflake or BigQuery. Ask the same question against different warehouses and verify that the generated SQL is idiomatically correct for each target.
The fastest path to querying your warehouse in plain English:
Choose a tool that natively supports your warehouse. Don’t settle for “SQL support” — confirm that the tool generates dialect-correct SQL for your specific warehouse and has been tested against it.
Connect with read-only credentials. There’s no reason a text-to-SQL tool needs write access. Read-only connections eliminate an entire category of risk.
Add business context. Raw schema metadata isn’t enough. Define your key metrics, add descriptions to important columns, and document any naming conventions that aren’t self-explanatory. This context dramatically improves query accuracy.
Start with known questions. Before opening the tool to your team, test it with questions where you already know the answer. Compare the AI’s output to your existing dashboards. This builds confidence and catches configuration issues early.
Monitor query performance and cost. Once people start querying, track execution times and warehouse costs. Identify expensive queries early and work with the tool’s semantic layer or guardrails to optimize them.
Text-to-SQL technology has reached the point where non-technical users can genuinely query production data warehouses without learning SQL. The remaining challenge is handling the real-world complexity of different warehouses, messy schemas, and enterprise security requirements. Tools that handle these well — by understanding warehouse dialects, optimizing for each engine’s cost model, and enforcing governance at the query level — deliver on the promise. Tools that don’t end up generating more confusion than clarity.
Written by
Founder and CEO of Basedash
Max Musing is the founder and CEO of Basedash, an AI-native business intelligence platform designed to help teams explore analytics and build dashboards without writing SQL. His work focuses on applying large language models to structured data systems, improving query reliability, and building governed analytics workflows for production environments.
Basedash lets you build charts, dashboards, and reports in seconds using all your data.