Skip to main content

Add snapshots to your DAG

What are snapshots?

Analysts often need to "look back in time" at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. over time.

Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an orders table where the status field can be overwritten as the order is processed.

idstatusupdated_at
1pending2024-01-01

Now, imagine that the order goes from "pending" to "shipped". That same record will now look like:

idstatusupdated_at
1shipped2024-01-02

This order is now in the "shipped" state, but we've lost the information about when the order was last in the "pending" state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can "snapshot" these changes to help you understand how values in a row change over time. Here's an example of a snapshot table for the previous example:

idstatusupdated_atdbt_valid_fromdbt_valid_to
1pending2024-01-012024-01-012024-01-02
1shipped2024-01-022024-01-02null

Configuring snapshots

Previewing or compiling snapshots in IDE not supported

It is not possible to "preview data" or "compile sql" for snapshots in dbt Cloud. Instead, run the dbt snapshot command in the IDE.

Configure your snapshots in YAML files to tell dbt how to detect record changes. Define snapshots configurations in YAML files, alongside your models, for a cleaner, faster, and more consistent set up.

snapshots/orders_snapshot.yml
snapshots:
- name: string
relation: relation # source('my_source', 'my_table') or ref('my_model')
config:
database: string
schema: string
alias: string
strategy: timestamp | check
unique_key: column_name_or_expression
check_cols: [column_name] | all
updated_at: column_name
invalidate_hard_deletes: true | false
snapshot_meta_column_names: dictionary

The following table outlines the configurations available for snapshots:

ConfigDescriptionRequired?Example
databaseSpecify a custom database for the snapshotNoanalytics
schemaSpecify a custom schema for the snapshotNosnapshots
aliasSpecify an alias for the snapshotNoyour_custom_snapshot
strategyThe snapshot strategy to use. Valid values: timestamp or checkYestimestamp
unique_keyA primary keyA primary key is a non-null column in a database object that uniquely identifies each row. column or expression for the recordYesid
check_colsIf using the check strategy, then the columns to checkOnly if using the check strategy["status"]
updated_atIf using the timestamp strategy, the timestamp column to compareOnly if using the timestamp strategyupdated_at
invalidate_hard_deletesFind hard deleted records in source and set dbt_valid_to to current time if the record no longer existsNoTrue
snapshot_meta_column_namesCustomize the names of the snapshot meta fieldsNodictionary
  • In versions prior to v1.9, the target_schema (required) and target_database (optional) configurations defined a single schema or database to build a snapshot across users and environment. This created problems when testing or developing a snapshot, as there was no clear separation between development and production environments. In v1.9, target_schema became optional, allowing snapshots to be environment-aware. By default, without target_schema or target_database defined, snapshots now use the generate_schema_name or generate_database_name macros to determine where to build. Developers can still set a custom location with schema and database configs, consistent with other resource types.
  • A number of other configurations are also supported (for example, tags and post-hook). For the complete list, refer to Snapshot configurations.
  • You can configure snapshots from both the dbt_project.yml file and a config block. For more information, refer to the configuration docs.

Add a snapshot to your project

To add a snapshot to your project follow these steps. For users on versions 1.8 and earlier, refer to Configure snapshots in versions 1.8 and earlier.

  1. Create a YAML file in your snapshots directory: snapshots/orders_snapshot.yml and add your configuration details. You can also configure your snapshot from your dbt_project.yml file (docs).

    snapshots/orders_snapshot.yml
    snapshots:
    - name: orders_snapshot
    relation: source('jaffle_shop', 'orders')
    config:
    schema: snapshots
    database: analytics
    unique_key: id
    strategy: timestamp
    updated_at: updated_at

  2. Since snapshots focus on configuration, the transformation logic is minimal. Typically, you'd select all data from the source. If you need to apply transformations (like filters, deduplication), it's best practice to define an ephemeral model and reference it in your snapshot configuration.

    models/ephemeral_orders.sql
    {{ config(materialized='ephemeral') }}

    select * from {{ source('jaffle_shop', 'orders') }}
  3. Check whether the result set of your query includes a reliable timestamp column that indicates when a record was last updated. For our example, the updated_at column reliably indicates record changes, so we can use the timestamp strategy. If your query result set does not have a reliable timestamp, you'll need to instead use the check strategy — more details on this below.

  4. Run the dbt snapshot command — for our example, a new table will be created at analytics.snapshots.orders_snapshot. The schema config will utilize the generate_schema_name macro.

    $ dbt snapshot
    Running with dbt=1.9.0

    15:07:36 | Concurrency: 8 threads (target='dev')
    15:07:36 |
    15:07:36 | 1 of 1 START snapshot snapshots.orders_snapshot...... [RUN]
    15:07:36 | 1 of 1 OK snapshot snapshots.orders_snapshot..........[SELECT 3 in 1.82s]
    15:07:36 |
    15:07:36 | Finished running 1 snapshots in 0.68s.

    Completed successfully

    Done. PASS=2 ERROR=0 SKIP=0 TOTAL=1
  5. Inspect the results by selecting from the table dbt created (analytics.snapshots.orders_snapshot). After the first run, you should see the results of your query, plus the snapshot meta fields as described later on.

  6. Run the dbt snapshot command again and inspect the results. If any records have been updated, the snapshot should reflect this.

  7. Select from the snapshot in downstream models using the ref function.

    models/changed_orders.sql
    select * from {{ ref('orders_snapshot') }}
  8. Snapshots are only useful if you run them frequently — schedule the dbt snapshot command to run regularly.

Configuration best practices

 Use the timestamp strategy where possible
 Ensure your unique key is really unique
 Use a schema that is separate to your models' schema
 Use ephemeral model to clean or transform data before snapshotting

How snapshots work

When you run the dbt snapshot command:

  • On the first run: dbt will create the initial snapshot table — this will be the result set of your select statement, with additional columns including dbt_valid_from and dbt_valid_to. All records will have a dbt_valid_to = null.
  • On subsequent runs: dbt will check which records have changed or if any new records have been created:
    • The dbt_valid_to column will be updated for any existing records that have changed
    • The updated record and any new records will be inserted into the snapshot table. These records will now have dbt_valid_to = null

Note, these column names can be customized to your team or organizational conventions using the snapshot_meta_column_names config.

Snapshots can be referenced in downstream models the same way as referencing models — by using the ref function.

Detecting row changes

Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt:

  • Timestamp — Uses an updated_at column to determine if a row has changed.
  • Check — Compares a list of columns between their current and historical values to determine if a row has changed.

The timestamp strategy uses an updated_at field to determine if a row has changed. If the configured updated_at column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.

The timestamp strategy requires the following configurations:

ConfigDescriptionExample
updated_atA column which represents when the source row was last updatedupdated_at

Example usage:

snapshots/orders_snapshot.yml
snapshots:
- name: orders_snapshot_timestamp
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at

Check strategy

The check strategy is useful for tables which do not have a reliable updated_at column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.

The check strategy requires the following configurations:

ConfigDescriptionExample
check_colsA list of columns to check for changes, or all to check all columns["name", "email"]
check_cols = 'all'

The check snapshot strategy can be configured to track changes to all columns by supplying check_cols = 'all'. It is better to explicitly enumerate the columns that you want to check. Consider using a surrogate keyA surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. to condense many columns into a single column.

Example Usage

snapshots/orders_snapshot.yml
snapshots:
- name: orders_snapshot_check
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: check
check_cols:
- status
- is_cancelled

Hard deletes (opt-in)

Rows that are deleted from the source query are not invalidated by default. With the config option invalidate_hard_deletes, dbt can track rows that no longer exist. This is done by left joining the snapshot table with the source table, and filtering the rows that are still valid at that point, but no longer can be found in the source table. dbt_valid_to will be set to the current snapshot time.

This configuration is not a different strategy as described above, but is an additional opt-in feature. It is not enabled by default since it alters the previous behavior.

For this configuration to work with the timestamp strategy, the configured updated_at column must be of timestamp type. Otherwise, queries will fail due to mixing data types.

Example Usage

snapshots/orders_snapshot.yml
snapshots:
- name: orders_snapshot_hard_delete
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at
invalidate_hard_deletes: true

Snapshot query best practices

This section outlines some best practices for writing snapshot queries:

  • Snapshot source data

    Your models should then select from these snapshots, treating them like regular data sources. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data

  • Use the source function in your query

    This helps when understanding data lineageData lineage provides a holistic view of how data moves through an organization, where it’s transformed and consumed. in your project.

  • Include as many columns as possible

    In fact, go for select * if performance permits! Even if a column doesn't feel useful at the moment, it might be better to snapshot it in case it becomes useful – after all, you won't be able to recreate the column later.

  • Avoid joins in your snapshot query

    Joins can make it difficult to build a reliable updated_at timestamp. Instead, snapshot the two tables separately, and join them in downstream models.

  • Limit the amount of transformation in your query

    If you apply business logic in a snapshot query, and this logic changes in the future, it can be impossible (or, at least, very difficult) to apply the change in logic to your snapshots.

Basically – keep your query as simple as possible! Some reasonable exceptions to these recommendations include:

  • Selecting specific columns if the table is wide.
  • Doing light transformation to get data into a reasonable shape, for example, unpacking a JSONJSON (JavaScript Object Notation) is a minimal format for semi-structured data used to capture relationships between fields and values. blob to flatten your source data into columns.

Snapshot meta-fields

Snapshot tablesIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. will be created as a clone of your source dataset, plus some additional meta-fields*.

Starting in 1.9 or with dbt Cloud Versionless, these column names can be customized to your team or organizational conventions via the snapshot_meta_column_names config.

FieldMeaningUsage
dbt_valid_fromThe timestamp when this snapshot row was first insertedThis column can be used to order the different "versions" of a record.
dbt_valid_toThe timestamp when this row became invalidated.The most recent snapshot record will have dbt_valid_to set to null.
dbt_scd_idA unique key generated for each snapshotted record.This is used internally by dbt
dbt_updated_atThe updated_at timestamp of the source record when this snapshot row was inserted.This is used internally by dbt

*The timestamps used for each column are subtly different depending on the strategy you use:

For the timestamp strategy, the configured updated_at column is used to populate the dbt_valid_from, dbt_valid_to and dbt_updated_at columns.

Details for the timestamp strategy

Snapshot query results at 2024-01-01 11:00

idstatusupdated_at
1pending2024-01-01 10:47

Snapshot results (note that 11:00 is not used anywhere):

idstatusupdated_atdbt_valid_fromdbt_valid_todbt_updated_at
1pending2024-01-01 10:472024-01-01 10:472024-01-01 10:47

Query results at 2024-01-01 11:30:

idstatusupdated_at
1shipped2024-01-01 11:05

Snapshot results (note that 11:30 is not used anywhere):

idstatusupdated_atdbt_valid_fromdbt_valid_todbt_updated_at
1pending2024-01-01 10:472024-01-01 10:472024-01-01 11:052024-01-01 10:47
1shipped2024-01-01 11:052024-01-01 11:052024-01-01 11:05

For the check strategy, the current timestamp is used to populate each column. If configured, the check strategy uses the updated_at column instead, as with the timestamp strategy.

Details for the check strategy

Snapshot query results at 2024-01-01 11:00

idstatus
1pending

Snapshot results:

idstatusdbt_valid_fromdbt_valid_todbt_updated_at
1pending2024-01-01 11:002024-01-01 11:00

Query results at 2024-01-01 11:30:

idstatus
1shipped

Snapshot results:

idstatusdbt_valid_fromdbt_valid_todbt_updated_at
1pending2024-01-01 11:002024-01-01 11:302024-01-01 11:00
1shipped2024-01-01 11:302024-01-01 11:30

Configure snapshots in versions 1.8 and earlier

This section is for users on dbt versions 1.8 and earlier. To configure snapshots in versions 1.9 and later, refer to Configuring snapshots. The latest versions use an updated snapshot configuration syntax that optimizes performance.

FAQs

How do I run one snapshot at a time?
How often should I run the snapshot command?
What happens if I add new columns to my snapshot query?
Do hooks run with snapshots?
Can I store my snapshots in a directory other than the `snapshot` directory in my project?
Debug Snapshot target is not a snapshot table errors
0