What is a data dictionary, and how to build one for analytics
Max Musing
Max MusingFounder and CEO of Basedash
· June 28, 2026

Max Musing
Max MusingFounder and CEO of Basedash
· June 28, 2026

A data dictionary is a structured reference that defines what every field, table, and metric in your data means: its name, type, source, definition, allowed values, and who owns it. It is the answer to the question every analytics team gets ten times a week, “what does this column actually mean?”, written down once so nobody has to guess.
If you have ever seen two dashboards report different numbers for “active users,” or watched a new hire spend a week reverse-engineering what status = 3 means, that is the gap a data dictionary fills. This guide explains what belongs in one, how to build it without turning it into a documentation project that nobody maintains, and how it differs from a data catalog and a semantic layer. It is written for founders, analysts, and data leads who want their data to be readable, not just queryable.
A data dictionary describes the contents of a dataset at the level of individual fields. For each column it records the technical facts (name, data type, nullability), the business meaning (what the field represents in plain language), and the governance facts (who owns it, where it comes from, whether it contains sensitive data).
The key distinction: a data dictionary documents meaning, not just structure. A database schema already tells you that orders.amount is a numeric column. A data dictionary tells you that amount is the gross order total in cents before refunds and excluding tax, sourced from Stripe, and that you should use net_amount for revenue reporting. That sentence is the difference between a trustworthy dashboard and a support ticket.
Historically a data dictionary was a static document. In practice it now lives anywhere your team will actually read it: a table in your warehouse, a dbt schema file, a wiki page, a column-description field inside your BI tool, or a dedicated catalog. The format matters less than two things: it is correct, and it is close to where people use the data.
These three terms get used interchangeably and they are not the same thing. They sit at different scopes and solve different problems.
| Data dictionary | Data catalog | Semantic layer | |
|---|---|---|---|
| Scope | One database or dataset | The whole data estate | The metrics layer on top of tables |
| Primary job | Define what each field means | Discover and index assets across sources | Compute consistent metrics from raw tables |
| Granularity | Columns, tables, metrics | Datasets, lineage, ownership | Measures, dimensions, joins |
| Who maintains it | Analysts, data owners | Data platform team | Analysts, analytics engineers |
| Typical form | A document, table, or schema file | A platform (Alation, Atlan, DataHub) | A modeling layer (dbt, LookML, BI tool) |
| When you need it | Almost immediately | When you have many sources and teams | When metric definitions drift |
In practice they layer on top of each other. A small team starts with a data dictionary. As the number of sources grows, a data catalog automates discovery and lineage across them. A semantic layer goes one step further and turns the definitions into computed, reusable metrics so that “revenue” means the same thing in every chart. A data dictionary is the foundation the other two assume already exists.
A useful data dictionary records the following for each field. You do not need every attribute for every column, but the first five are close to mandatory for any field that shows up in reporting.
| Attribute | What it captures | Example |
|---|---|---|
| Field name | The technical column name | mrr_cents |
| Type | Data type and nullability | integer, not null |
| Definition | Plain-language meaning, in one sentence | Monthly recurring revenue in cents, excluding one-time charges |
| Source | Where the value originates | Stripe subscriptions, synced hourly |
| Owner | Person or team accountable for it | Finance / @dana |
| Allowed values | Enums, ranges, or units | Cents; non-negative |
| Sensitivity | PII or restricted-data flag | None (no PII) |
| Freshness | How current the data is | Updated hourly |
| Notes / caveats | Gotchas that cause wrong queries | Excludes refunds; use net_mrr for net |
The two attributes that do the most work are definition and notes. A good definition prevents the wrong column from being used. A good caveat (“this excludes refunds”) prevents the right column from being used the wrong way. If you only have time to write two fields per column, write those.
Beyond individual columns, a complete dictionary also documents the table itself: its grain (what one row represents), its primary key, its update cadence, and how it relates to other tables. Grain is the single most common source of double-counted metrics, so recording it explicitly is worth the effort. For more on why grain matters, see how to model data for BI.
Most data dictionaries fail because they try to be a complete technical manual. A more durable approach is to treat the dictionary as three layers, and to be honest about which layer each field actually needs.
1. The physical layer. Column names, types, nullability, keys. This layer can be generated automatically from the database schema. Do not hand-write it. Anything a query against information_schema can produce should be produced that way.
2. The business layer. Definitions, units, caveats, and the mapping from technical names to business concepts (“mrr_cents is monthly recurring revenue”). This layer is where humans add value, and it is the part worth protecting. It cannot be auto-generated reliably, because the meaning lives in people’s heads, not in the schema.
3. The governance layer. Ownership, sensitivity classification, source, and freshness. This layer matters most for anything that touches customer data, finance, or compliance. It connects directly to access control: if a field is flagged as PII here, your data governance rules should restrict who can see it.
The practical implication is that you should automate layer one, invest human effort in layer two, and treat layer three as a governance requirement rather than nice-to-have documentation. Teams that try to hand-maintain all three layers usually end up with a stale dictionary that nobody trusts.
You do not need a documentation sprint. Build it incrementally, starting with the data people actually query.
information_schema (or your warehouse’s equivalent) into a table or spreadsheet. This gives you the skeleton in minutes.A realistic first pass covers the 50 to 150 fields that actually appear in dashboards. That is a few hours of work, not a quarter-long initiative, and it captures most of the value.
The best location is wherever the people who need it already are. A dictionary that requires opening a separate tool is a dictionary that gets ignored.
schema.yml files alongside the models, and they are version-controlled with the code. This keeps definitions next to the transformations that produce them. It is the strongest option for teams that already model in dbt. See where to define business metrics for how this connects to metric definitions.The pattern that scales: generate the physical layer automatically, keep the business and governance layers as close to the data as your tooling allows, and avoid maintaining the same definition in three different places.
A stale data dictionary is worse than none, because people trust it and get burned. Three habits keep it current.
Tie definitions to schema changes. When someone adds, renames, or repurposes a column, updating the definition should be part of that change, not a follow-up task. In a dbt workflow this is natural, because the description sits in the same file as the model. Treating documentation as part of the change is the same discipline as versioning your BI as code.
Make ownership real. Every reporting table should have a named owner who is expected to keep its definitions accurate. Unowned fields rot. A short quarterly review where owners confirm or fix their definitions is enough for most teams.
Capture caveats when they are discovered. The most valuable entries are written the moment someone gets burned. When an analyst discovers that a field excludes refunds or resets at month end, that caveat should land in the dictionary immediately, while the lesson is fresh.
AI-assisted analytics raises the stakes on definitions. When a person writes a natural-language question and a tool generates SQL, the quality of the answer depends heavily on whether the model knows what the columns mean.
An AI assistant pointed at a schema with cryptic names like flg_2 and dt_x will guess, and guessing is how you get confidently wrong numbers. The same assistant pointed at columns annotated with clear definitions (“net_mrr is monthly recurring revenue after refunds, in cents”) can map a question like “what was net MRR last month?” to the correct field on the first try. Definitions are no longer just for humans; they are context that AI systems read to generate correct queries.
This is part of why AI-native BI tools treat field and metric descriptions as first-class. Basedash, for example, surfaces table and column context where analysts and AI both work, so that generated SQL and human queries draw on the same definitions rather than diverging. A good data dictionary is the difference between an AI assistant that reflects your team’s real definitions and one that invents its own.
You need a data dictionary when:
You can skip a formal one when:
Even then, the moment a second person asks “what does this column mean?”, you have crossed the threshold. The cheapest version of a data dictionary is a shared document with the attributes from the table above; the most durable version lives next to the data itself.
What is the difference between a data dictionary and a data catalog? A data dictionary defines the fields, types, and meanings within one dataset or database. A data catalog is a platform that automatically discovers and indexes metadata across your entire data estate, adding lineage, usage stats, and search. A dictionary describes one source in depth; a catalog indexes many. Small teams start with a dictionary and add a catalog when the number of sources grows.
Is a data dictionary the same as a semantic layer? No. A data dictionary documents what fields mean. A semantic layer goes further and computes consistent metrics from raw tables, so “revenue” resolves to the same calculation everywhere. The dictionary is documentation; the semantic layer is executable definitions. They complement each other.
Who should own the data dictionary? Ownership should sit with the team that produces each dataset, not the team that consumes it. Producers know the source and the caveats. A central data or analytics lead can own the structure and process, but per-field definitions should map to the people accountable for that data.
How do I create a data dictionary for a PostgreSQL or MySQL database?
Start by querying information_schema.columns to extract column names, types, and nullability into a table or spreadsheet. That gives you the physical layer automatically. Then add one-sentence definitions, owners, and caveats for the fields that appear in reports. If you use dbt, store these in schema.yml so they stay version-controlled with your models.
How often should a data dictionary be updated? Update definitions whenever the schema changes, as part of the change itself rather than as a separate task. Beyond that, a short quarterly review where data owners confirm their definitions is enough for most teams. The fields most likely to drift are calculated metrics and enums, so review those first.
Written by

Founder and CEO of Basedash
Max Musing is the founder and CEO of Basedash, an AI-native business intelligence platform designed to help teams explore analytics and build dashboards without writing SQL. His work focuses on applying large language models to structured data systems, improving query reliability, and building governed analytics workflows for production environments.
Basedash lets you build charts, dashboards, and reports in seconds using all your data.