Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connectors] Store data source columns in the data warehouse #33

Closed
rudokemper opened this issue Dec 11, 2024 · 1 comment · Fixed by #57
Closed

[connectors] Store data source columns in the data warehouse #33

rudokemper opened this issue Dec 11, 2024 · 1 comment · Fixed by #57
Assignees
Labels
feature New specs for new behavior

Comments

@rudokemper
Copy link
Member

rudokemper commented Dec 11, 2024

With the recent work in guardianconnector-explorer to visualize selected Mapeo data on the Alerts dashboard, we are now working with two different data sources (Change Detection Alerts and Mapeo Data) in the same view. This has led me to want to do things like provide a header with the name of the data source for a selected feature in the sidebar, to make it clear to the user which of the two data sources the feature belongs to.1

We could handle this entirely in GCE by providing the name of the data source in the GCE API endpoint for alerts. However, this need reminded me of recent conversations we've been having about data provenance, and accumulating metadata for incoming data being processed; and I am wondering if it's a good idea for the script to create a record in the data warehouse about the data source, which GCE (or other front ends) can then utilize directly.

As of this moment, we don't store any information about the data source. For example, in CMI's demo data warehouse, there is a table demoform_responses which are the responses to a KoboToolbox form processed by the kobotoolbox script. However, nowhere in the data warehouse is there any indication that this was a KoboToolbox form - one would have to non-deterministically deduce that from the name of the table, or the specific set of columns. (And even so, it could have been an ODK form which shares much of the same structure.)

For KoboToolbox in particular, there is a bunch of metadata stored in the https://kf.kobotoolbox.org/api/v2/assets/{form_id}/ endpoint. This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"? (See also #29)

(This work is redescribed in #29)

However, not all upstream data sources might have such metadata. In such cases, we could add a dataset_source property that stores the name of the tool / script (e.g. KoboToolbox, ODK, Mapeo). That way, we can deterministically expect a specific schema for db tables that contain data from each tool, which will be helpful for any business logic for rendering data from each source.

Footnotes

  1. See also the screenshot in 48.

@rudokemper rudokemper added the feature New specs for new behavior label Dec 11, 2024
@rudokemper
Copy link
Member Author

In response, @IamJeffG wrote (on the original issue):

This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"?

Yes it would! (modulo the edge case if user changes Kobo form name [or other identifier] upstream)

The piece you are describing is a data catalog, and I think it's what can prevent our data "lake" from decomposing into "data bog." "Data catalog" is a slightly overloaded term that is sometimes used to refer to different things, but the one I'm thinking of:

  • describes what data is present in the warehouse. It is the central point of reference for all data assets (blobs, SQL tables, schemas, raster and vector layers, ...)
  • is observable: exposes those datasets in a discoverable way
  • includes important metadata about how to interpret the data and the fields on those data (e.g. column names, units, provenance, etc)

There are off-the-shelf open-source data catalogs and I recommend we see if we can integrate one of those instead of building our own. Likewise I suggest we consider keeping it separate from Frizzle, so a pattern where Frizzle both writes data (to blob storage or SQL) and also writes an update to the data catalog at the same time.

Here are some to give a flavor of what I am talking about; I do not have firsthand experience with any:

Not data catalog software per se, but related:

  • whale does not store its source of truth but does facilitate searching over the tables and columns in a SQL database. You can also generate markdown of each table.
  • STAC is a standard to catalog specifically geospatial datasets.

@rudokemper rudokemper self-assigned this Dec 16, 2024
@rudokemper rudokemper changed the title [frizzle] Store data_source (and other metadata) columns in the data warehouse [connectors] Store data_source (and other metadata) columns in the data warehouse Dec 16, 2024
@rudokemper rudokemper changed the title [connectors] Store data_source (and other metadata) columns in the data warehouse [connectors] Store dataset_source columns in the data warehouse Jan 2, 2025
@rudokemper rudokemper changed the title [connectors] Store dataset_source columns in the data warehouse [connectors] Store data source columns in the data warehouse Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New specs for new behavior
Projects
None yet
1 participant