You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the information present in sources, dbt can determine how "fresh" source data is at a given point in time. dbt should provide a command which is capable of snapshotting the data freshness (the max(loaded_at_field) for each table) at a given point in time. dbt should produce a json file which contains information about the freshness when this command is invoked.
This flag allows users to select specific sources to describe. It should accept multiple values, each of which is either:
The name of a source (eg. snowplow, quickbooks, etc)
The name of specific table in a source (eg. snowplow.event, quickbooks.accounts). This name is generating by concatenating the source and table with a dot.
If no sources are --selected, then dbt should calculate the freshness for all of the sources in a project.
-o
A path to a .json file (relative to the target/ directory?) to write the file to.
Calculating Freshness
with source_snapshot as (
selectmax({{ loaded_at_field }}) as max_loaded_at
fromgithub_stars.stargazers
)
select
max_loaded_at,
getdate() as snapshotted_at,
datediff(second, max_loaded_at, getdate()) as max_loaded_at_time_ago_in_s
from source_snapshot
This query will vary in all of the usual, unfortunate ways across databases:
inconsistent mechanisms for getting the current timestamp (getdate(), now(), current_timestamp, etc)
inconsistent data types (timestamp_tz vs. timestamp_ntz on Snowflake)
incomplete support for datediff (namely on postgres
As such, this command should be implemented using the adapter macro paradigm. Moreover, it would be convenient to support a contract of fields in this query, then let users supply their own macro to calculate the time delta. This is a nice-to-have for the first cut of this feature, but if it's easy to do, we should do it!
Output file format
All times should be UTC
{
"meta": {
"generated_at": "2019-01-15T19:57:51.793643Z",
"elapsed_time": 0.314208984375
},
"sources": {
# map the source unique id onto data about the source
"source.project.source_name.table": {
"max_loaded_at": "2018-01-01 12:00:00.123",
"snapshotted_at": "2018-01-01 12:02:12.456",
"max_loaded_at_time_ago_in_s": "1234".
"state": "warn" # one of {ok|warn|error}
"criteria": {....} # copied from the schema.yml spec
}
}
}
Stdout
This command should work a lot like the dbt run command, outputting a parallelized list of resource invocations to the console.
17:05:22 | Concurrency: 8 threads (target='dev')
17:05:22 |
17:05:22 | 1 of 3 START freshness of source.table_1 .......... [RUN]
17:05:22 | 2 of 3 START freshness of source.table_2................. [RUN]
17:05:22 | 3 of 3 START freshness of source.table_3................. [RUN]
17:05:22 | 1 of 3 START freshness of source.table_1................. [OK in 1.2s]
17:05:22 | 2 of 3 START freshness of source.table_2......................... [WARN in 2.4s]
17:05:22 | 3 of 3 START freshness of source.table_3......................... [ERROR in 3.5s]
17:05:22 |
17:05:22 | Finished running 3 sources in 9.29s.
Completed with 1 error and 1 warning:
Freshness Error in source table_3 (models/sources.yml)
The table source.table_3 is 812 days out of date. Error for condition count=800 period=day.
Freshness Warning in source table_2 (models/sources.yml)
The table source.table_2 is 12 days out of date. Warning for condition count=10, period=day.
The text was updated successfully, but these errors were encountered:
Feature
Calculating Data Freshness
Using the information present in sources, dbt can determine how "fresh" source data is at a given point in time. dbt should provide a command which is capable of snapshotting the data freshness (the
max(loaded_at_field)
for each table) at a given point in time. dbt should produce a json file which contains information about the freshness when this command is invoked.Example usage
Arguments
--select
This flag allows users to select specific sources to describe. It should accept multiple values, each of which is either:
source
(eg.snowplow
,quickbooks
, etc)table
in a source (eg.snowplow.event
,quickbooks.accounts
). This name is generating by concatenating the source and table with a dot.If no sources are
--select
ed, then dbt should calculate the freshness for all of the sources in a project.-o
A path to a
.json
file (relative to thetarget/
directory?) to write the file to.Calculating Freshness
This query will vary in all of the usual, unfortunate ways across databases:
getdate()
,now()
,current_timestamp
, etc)timestamp_tz
vs.timestamp_ntz
on Snowflake)datediff
(namely on postgresAs such, this command should be implemented using the adapter macro paradigm. Moreover, it would be convenient to support a contract of fields in this query, then let users supply their own macro to calculate the time delta. This is a nice-to-have for the first cut of this feature, but if it's easy to do, we should do it!
Output file format
All times should be UTC
Stdout
This command should work a lot like the
dbt run
command, outputting a parallelized list of resource invocations to the console.The text was updated successfully, but these errors were encountered: