[connectors] Store data source columns in the data warehouse #33

rudokemper · 2024-12-11T22:11:12Z

With the recent work in guardianconnector-explorer to visualize selected Mapeo data on the Alerts dashboard, we are now working with two different data sources (Change Detection Alerts and Mapeo Data) in the same view. This has led me to want to do things like provide a header with the name of the data source for a selected feature in the sidebar, to make it clear to the user which of the two data sources the feature belongs to.¹

We could handle this entirely in GCE by providing the name of the data source in the GCE API endpoint for alerts. However, this need reminded me of recent conversations we've been having about data provenance, and accumulating metadata for incoming data being processed; and I am wondering if it's a good idea for the script to create a record in the data warehouse about the data source, which GCE (or other front ends) can then utilize directly.

As of this moment, we don't store any information about the data source. For example, in CMI's demo data warehouse, there is a table demoform_responses which are the responses to a KoboToolbox form processed by the kobotoolbox script. However, nowhere in the data warehouse is there any indication that this was a KoboToolbox form - one would have to non-deterministically deduce that from the name of the table, or the specific set of columns. (And even so, it could have been an ODK form which shares much of the same structure.)

For KoboToolbox in particular, there is a bunch of metadata stored in the https://kf.kobotoolbox.org/api/v2/assets/{form_id}/ endpoint. This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"? (See also #29)

(This work is redescribed in #29)

However, not all upstream data sources might have such metadata. In such cases, we could add a dataset_source property that stores the name of the tool / script (e.g. KoboToolbox, ODK, Mapeo). That way, we can deterministically expect a specific schema for db tables that contain data from each tool, which will be helpful for any business logic for rendering data from each source.

See also the screenshot in 48. ↩

The text was updated successfully, but these errors were encountered:

rudokemper · 2024-12-11T22:13:15Z

In response, @IamJeffG wrote (on the original issue):

This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"?

Yes it would! (modulo the edge case if user changes Kobo form name [or other identifier] upstream)

The piece you are describing is a data catalog, and I think it's what can prevent our data "lake" from decomposing into "data bog." "Data catalog" is a slightly overloaded term that is sometimes used to refer to different things, but the one I'm thinking of:

describes what data is present in the warehouse. It is the central point of reference for all data assets (blobs, SQL tables, schemas, raster and vector layers, ...)

is observable: exposes those datasets in a discoverable way

includes important metadata about how to interpret the data and the fields on those data (e.g. column names, units, provenance, etc)

There are off-the-shelf open-source data catalogs and I recommend we see if we can integrate one of those instead of building our own. Likewise I suggest we consider keeping it separate from Frizzle, so a pattern where Frizzle both writes data (to blob storage or SQL) and also writes an update to the data catalog at the same time.

Here are some to give a flavor of what I am talking about; I do not have firsthand experience with any:

DataHub (by LinkedIn)

Amundsen (by Lyft)

index of more: https://github.com/opendatadiscovery/awesome-data-catalogs

Not data catalog software per se, but related:

whale does not store its source of truth but does facilitate searching over the tables and columns in a SQL database. You can also generate markdown of each table.

STAC is a standard to catalog specifically geospatial datasets.

rudokemper added the feature New specs for new behavior label Dec 11, 2024

rudokemper added this to the GuardianConnector admin panel MVP milestone Dec 16, 2024

rudokemper self-assigned this Dec 16, 2024

rudokemper changed the title ~~[frizzle] Store data_source (and other metadata) columns in the data warehouse~~ [connectors] Store data_source (and other metadata) columns in the data warehouse Dec 16, 2024

rudokemper changed the title ~~[connectors] Store data_source (and other metadata) columns in the data warehouse~~ [connectors] Store dataset_source columns in the data warehouse Jan 2, 2025

rudokemper changed the title ~~[connectors] Store dataset_source columns in the data warehouse~~ [connectors] Store data source columns in the data warehouse Jan 2, 2025

rudokemper mentioned this issue Jan 2, 2025

Connector scripts create a data_source property to write to the db table #57

Merged

rudokemper closed this as completed in #57 Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connectors] Store data source columns in the data warehouse #33

[connectors] Store data source columns in the data warehouse #33

rudokemper commented Dec 11, 2024 •

edited

Loading

rudokemper commented Dec 11, 2024

[connectors] Store data source columns in the data warehouse #33

[connectors] Store data source columns in the data warehouse #33

Comments

rudokemper commented Dec 11, 2024 • edited Loading

Footnotes

rudokemper commented Dec 11, 2024

rudokemper commented Dec 11, 2024 •

edited

Loading