You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the recent work in guardianconnector-explorer to visualize selected Mapeo data on the Alerts dashboard, we are now working with two different data sources (Change Detection Alerts and Mapeo Data) in the same view. This has led me to want to do things like provide a header with the name of the data source for a selected feature in the sidebar, to make it clear to the user which of the two data sources the feature belongs to.1
We could handle this entirely in GCE by providing the name of the data source in the GCE API endpoint for alerts. However, this need reminded me of recent conversations we've been having about data provenance, and accumulating metadata for incoming data being processed; and I am wondering if it's a good idea for the script to create a record in the data warehouse about the data source, which GCE (or other front ends) can then utilize directly.
As of this moment, we don't store any information about the data source. For example, in CMI's demo data warehouse, there is a table demoform_responses which are the responses to a KoboToolbox form processed by the kobotoolbox script. However, nowhere in the data warehouse is there any indication that this was a KoboToolbox form - one would have to non-deterministically deduce that from the name of the table, or the specific set of columns. (And even so, it could have been an ODK form which shares much of the same structure.)
For KoboToolbox in particular, there is a bunch of metadata stored in the https://kf.kobotoolbox.org/api/v2/assets/{form_id}/ endpoint. This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"? (See also #29)
However, not all upstream data sources might have such metadata. In such cases, we could add a dataset_source property that stores the name of the tool / script (e.g. KoboToolbox, ODK, Mapeo). That way, we can deterministically expect a specific schema for db tables that contain data from each tool, which will be helpful for any business logic for rendering data from each source.
In response, @IamJeffG wrote (on the original issue):
This includes a name field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"?
Yes it would! (modulo the edge case if user changes Kobo form name [or other identifier] upstream)
The piece you are describing is a data catalog, and I think it's what can prevent our data "lake" from decomposing into "data bog." "Data catalog" is a slightly overloaded term that is sometimes used to refer to different things, but the one I'm thinking of:
describes what data is present in the warehouse. It is the central point of reference for all data assets (blobs, SQL tables, schemas, raster and vector layers, ...)
is observable: exposes those datasets in a discoverable way
includes important metadata about how to interpret the data and the fields on those data (e.g. column names, units, provenance, etc)
There are off-the-shelf open-source data catalogs and I recommend we see if we can integrate one of those instead of building our own. Likewise I suggest we consider keeping it separate from Frizzle, so a pattern where Frizzle both writes data (to blob storage or SQL) and also writes an update to the data catalog at the same time.
Here are some to give a flavor of what I am talking about; I do not have firsthand experience with any:
whale does not store its source of truth but does facilitate searching over the tables and columns in a SQL database. You can also generate markdown of each table.
STAC is a standard to catalog specifically geospatial datasets.
rudokemper
changed the title
[frizzle] Store data_source (and other metadata) columns in the data warehouse
[connectors] Store data_source (and other metadata) columns in the data warehouse
Dec 16, 2024
rudokemper
changed the title
[connectors] Store data_source (and other metadata) columns in the data warehouse
[connectors] Store dataset_source columns in the data warehouse
Jan 2, 2025
rudokemper
changed the title
[connectors] Store dataset_source columns in the data warehouse
[connectors] Store data source columns in the data warehouse
Jan 2, 2025
With the recent work in guardianconnector-explorer to visualize selected Mapeo data on the Alerts dashboard, we are now working with two different data sources (Change Detection Alerts and Mapeo Data) in the same view. This has led me to want to do things like provide a header with the name of the data source for a selected feature in the sidebar, to make it clear to the user which of the two data sources the feature belongs to.1
We could handle this entirely in GCE by providing the name of the data source in the GCE API endpoint for alerts. However, this need reminded me of recent conversations we've been having about data provenance, and accumulating metadata for incoming data being processed; and I am wondering if it's a good idea for the script to create a record in the data warehouse about the data source, which GCE (or other front ends) can then utilize directly.
As of this moment, we don't store any information about the data source. For example, in CMI's demo data warehouse, there is a tabledemoform_responses
which are the responses to a KoboToolbox form processed by thekobotoolbox
script. However, nowhere in the data warehouse is there any indication that this was a KoboToolbox form - one would have to non-deterministically deduce that from the name of the table, or the specific set of columns. (And even so, it could have been an ODK form which shares much of the same structure.)For KoboToolbox in particular, there is a bunch of metadata stored in thehttps://kf.kobotoolbox.org/api/v2/assets/{form_id}/
endpoint. This includes aname
field for the actual name of the form provided by the user - something like "Biocultural monitoring form (example)". Wouldn't it be great to use that on the front end as the name of the data source, instead of "Kobo data"? (See also #29)(This work is redescribed in #29)
However, not all upstream data sources might have such metadata. In such cases, we could add a
dataset_source
property that stores the name of the tool / script (e.g. KoboToolbox, ODK, Mapeo). That way, we can deterministically expect a specific schema for db tables that contain data from each tool, which will be helpful for any business logic for rendering data from each source.Footnotes
See also the screenshot in 48. ↩
The text was updated successfully, but these errors were encountered: