Calculate source data freshness #1240

drewbanin · 2019-01-15T21:42:18Z

Feature

Calculating Data Freshness

Using the information present in sources, dbt can determine how "fresh" source data is at a given point in time. dbt should provide a command which is capable of snapshotting the data freshness (the max(loaded_at_field) for each table) at a given point in time. dbt should produce a json file which contains information about the freshness when this command is invoked.

Example usage

$ dbt source snapshot-freshness [--select [source_1, source_2, ...]] [-o target/sources.json]

Arguments

--select

This flag allows users to select specific sources to describe. It should accept multiple values, each of which is either:

The name of a source (eg. snowplow, quickbooks, etc)
The name of specific table in a source (eg. snowplow.event, quickbooks.accounts). This name is generating by concatenating the source and table with a dot.

If no sources are --selected, then dbt should calculate the freshness for all of the sources in a project.

-o

A path to a .json file (relative to the target/ directory?) to write the file to.

Calculating Freshness

with source_snapshot as (

	select
		max({{ loaded_at_field }}) as max_loaded_at
		
	from github_stars.stargazers

)

select
	max_loaded_at,
	getdate() as snapshotted_at,
	datediff(second, max_loaded_at, getdate()) as max_loaded_at_time_ago_in_s
	
from source_snapshot

This query will vary in all of the usual, unfortunate ways across databases:

inconsistent mechanisms for getting the current timestamp (getdate(), now(), current_timestamp, etc)
inconsistent data types (timestamp_tz vs. timestamp_ntz on Snowflake)
incomplete support for datediff (namely on postgres

As such, this command should be implemented using the adapter macro paradigm. Moreover, it would be convenient to support a contract of fields in this query, then let users supply their own macro to calculate the time delta. This is a nice-to-have for the first cut of this feature, but if it's easy to do, we should do it!

Output file format

All times should be UTC

{
  "meta": {
    "generated_at": "2019-01-15T19:57:51.793643Z",
    "elapsed_time": 0.314208984375
  },
  "sources": {
    # map the source unique id onto data about the source
    "source.project.source_name.table": {
        "max_loaded_at": "2018-01-01 12:00:00.123",
        "snapshotted_at": "2018-01-01 12:02:12.456",
        "max_loaded_at_time_ago_in_s": "1234".
        "state": "warn" # one of {ok|warn|error}
        "criteria": {....} # copied from the schema.yml spec
    }
  }
}

Stdout

This command should work a lot like the dbt run command, outputting a parallelized list of resource invocations to the console.

17:05:22 | Concurrency: 8 threads (target='dev')
17:05:22 |
17:05:22 | 1 of 3 START freshness of source.table_1 .......... [RUN]
17:05:22 | 2 of 3 START freshness of source.table_2................. [RUN]
17:05:22 | 3 of 3 START freshness of source.table_3................. [RUN]
17:05:22 | 1 of 3 START freshness of source.table_1................. [OK in 1.2s]
17:05:22 | 2 of 3 START freshness of source.table_2......................... [WARN in 2.4s]
17:05:22 | 3 of 3 START freshness of source.table_3......................... [ERROR in 3.5s]
17:05:22 |
17:05:22 | Finished running 3 sources in 9.29s.

Completed with 1 error and 1 warning:

Freshness Error in source table_3 (models/sources.yml)
  The table source.table_3 is 812 days out of date. Error for condition count=800 period=day.

Freshness Warning in source table_2 (models/sources.yml)
  The table source.table_2 is 12 days out of date. Warning for condition count=10, period=day.

The text was updated successfully, but these errors were encountered:

…ness Feature: source freshness (#1240)

drewbanin · 2019-02-13T22:10:50Z

fixed in #1272

drewbanin added the enhancement New feature or request label Jan 15, 2019

drewbanin added this to the Stephen Girard milestone Jan 15, 2019

drewbanin mentioned this issue Jan 15, 2019

Define source tables #814

Closed

beckjake mentioned this issue Feb 4, 2019

Feature: source freshness (#1240) #1272

Merged

beckjake added a commit that referenced this issue Feb 13, 2019

Merge pull request #1272 from fishtown-analytics/feature/source-fresh…

026c50d

…ness Feature: source freshness (#1240)

drewbanin closed this as completed Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate source data freshness #1240

Calculate source data freshness #1240

drewbanin commented Jan 15, 2019 •

edited

Loading

drewbanin commented Feb 13, 2019

Calculate source data freshness #1240

Calculate source data freshness #1240

Comments

drewbanin commented Jan 15, 2019 • edited Loading

Feature

Calculating Data Freshness

Example usage

Arguments

Calculating Freshness

Output file format

Stdout

drewbanin commented Feb 13, 2019

drewbanin commented Jan 15, 2019 •

edited

Loading