-
Notifications
You must be signed in to change notification settings - Fork 22
System Design Document ‐ v1.0.0
The Open Science Catalogue is one of the elements contributing to an Open Science framework and infrastructure, with the scope to enhance the discoverability and use of products, data and knowledge resulting from Scientific Earth Observation exploitation studies.
The Open Science Catalogue has the capability to hold product metadata for assets that are stored externally or internally if required. Open Science Catalogue is also developing the capability to discover processes that can be deployed to and executed on remote EOEPCA platforms, using assets discovered by Open Science Catalogue.
Adhering by design to the “FAIR” (findable, accessible, interoperable, reproducible/reusable) principles, the Open Science Catalogue aims to support better knowledge discovery and innovation, and facilitate data and knowledge integration and reuse by the scientific community.
The document describes the design for the Open Science Catalogue which is driven by
- The requirements outlined in OSCATALOGUE_v2.doc provided by ESA
- The user stories derived from OSCATALOGUE_v2.doc and also those user stories added following Scrum sprint reviews.
The design is based upon the principles and framework provided by EOEPCA. The Open Science Catlogue will where possible
- Re-use EOEPCA
- Support the evolution of new EOEPCA user stories
We will base the Open Science Catalogue on the above design.
The Open Science Catalogue architecture is based around the core EOEPCA design concepts above and the re-use core of EOEPCA components as described later.
The Open Science Catalog is a deployment of several EOEPCA components, in combination with additional supplementary components. The re-used components from EOEPCA are as follows: Resource Management:
- Resource Catalogue
- Harvester
- Registrar
- User Management:
- Login-Service
- Policy Decision Point
- User Profile
The following diagram depicts the sub-systems with their respective components:
All components re-used from EOEPCA will be used as-is, but will be configured to fit into the requirements set by the Open Science Catalog.
The following figure describes how the components are connected to each other and in what manner they interact with each other:
The components will be described in detail in the following section.
The metadata repository is a git repository that stores the metadata contents of the Open Science Catalog and keeps a history of all the content changes that have been applied to it.
The metadata repository is hosted as a repository on the GitHub platform. All themes, variables, projects and products and their associated metadata are hosted in a directory structure. Each change must be handled via a Pull Request. This pull request allows for reviewers to see the changes to be applied in advance and to provide reviews as comments. If appropriate, the changes can be merged with the main branch of the repository.
When a Pull Request is merged, the continuous integration pipeline is run, which triggers the building of a STAC catalog representing the Open Science Catalog on the Static Catalog component.
Git and GitHub API
The Static Catalog hosts the directory and file structure making up the STAC catalog.
Please refer to the Readme at https://github.com/EOEPCA/open-science-catalog-frontend
- Open Science Catalog (“static catalog”)
- Resource Catalogue (used as “dynamic catalog”)
- User Management
This is the main frontend for users to use to browse the contents of the Open Science Catalogue.
Please refer to the Readme at https://github.com/EOEPCA/open-science-catalog-frontend
- Open Science Catalog (“static catalog”)
- Resource Catalogue (used as “dynamic catalog”)
- User Management
pycsw is an OARec and OGC CSW server implementation written in Python. The pycsw project is certified OGC Compliant and OGC Reference Implementation for both CSW 2.0.2 and CSW 3.0.0. It is also an early implementation of OGC API - Records - Part 1: Core and STAC API.
pycsw is the Resource Catalogue building block of EOEPCA.
Open Science Catalogue re-uses the EOEPCA Resource Catalogue. The design of the component can be found in the following document: https://eoepca.github.io/rm-resource-catalogue/SDD/
Open Science Catalogue re-uses the EOEPCA Resource Catalogue.The interfaces of the component can be found in the following document: https://eoepca.github.io/rm-resource-catalogue/ICD/
The Resource Catalogue component provides the following external interfaces as listed from the upstream pycsw project documentation:
- https://docs.pycsw.org/en/latest/introduction.html#standards-support
- https://docs.pycsw.org/en/latest/index.html
The task of the scheduler is to regularly send signals to the harvester to start the harvesting process. The configured interval can be specified in the configuration.
The scheduler is a re-used component of the View Server
software suite, and meant to function as an autonomous component.
The scheduler does not provide any outbound interfaces.
The scheduler uses a configured Redis queue to send the harvesting signals to.
The main task of the harvester is to consume a web or file-based service to detect items for registration. These possible services to interface are manifold, but in terms of the Open Science Catalog, the Harvester reads from the central static catalog and walks recursively through the provided collections in order to read all leaf items.
The Harvester is 100% re-used from the EOEPCA Common Architecture, and a component of the View Server software framework (https://gitlab.eox.at/vs/harvester). The harvester is documented in the View Server Operators Guide (https://vs.pages.eox.at/documentation/operator/main/index.html).
The Harvester does not provide any outbound interfaces in a traditional manner. It is working in a daemon mode, listening on a Redis-based task queue to listen for harvest requests.
In OSC, the harvester is configured to read the catalog.json
in order to iterate through all contained collections and items from the static catalog on https://eoepca.github.io/open-science-catalog-metadata/catalog.json
The Backend API allows for users of the OSC to contribute to the contents of the catalog. This is done in a controlled manner, in which contributions are passed through a review process called “Pull Request”, in which all changes can be reviewed by administrators before being applied to the catalog.
The Backend API is a web service application implemented with the FastAPI software framework. It is developed specifically for use within the Open Science Catalogue.
The exposed API of the Backend API can be inspected from this Swagger API document: https://open-science-catalog-backend.develop.eoepca.org/docs
The Backend API is interfacing with the GitHub API (https://docs.github.com/en/rest), in order to create Pull Requests and monitor their status.
The User Management components allow to specify user access mechanisms, store user data, allow users to login to the system with potentially an external identity provider, retrieve, validate and authorize requests with access tokens. The User Management is taken from the according EOEPCA User Management building blocks.
The design documentation of the building blocks are to be found in the according sections of the EOEPCA online documentation.
These files contain the initial metadata contents of the Open Science Catalog. They are structured as tabular data in the CSV (comma separated values) format. For each of the record types one such file exists: Products.csv, Projects.csv, Variables.csv, Themes.csv, EO Missions.csv.
Product fields
Field | Description | STAC representation |
---|---|---|
ID | Numeric identifier | |
Status | “ongoing” or “completed” | osc:status property |
Project | The project identifier | osc:project property, collection link |
Website | link | |
Product | Name | link |
Short_Name | identifier | |
Description | description property | |
Access | URL | link |
Documentation | URL | link |
Version | version property | |
DOI | Digital Object Identifier | sci:doi property and cite-as link |
Variable | Variable identifier | collection link |
Start | extent.temporal[] | |
End | extent.temporal[] | |
Region | osc:region property | |
Polygon | geometry | |
Released | created property | |
Theme1 - Theme6 | Theme identifiers | osc:themes property |
EO_Missions | Semi-colon separated list of missions | osc:missions property |
Standard_Name | cf:parameter.name property |
Project fields
Field | Description | STAC representation |
---|---|---|
Project_ID | Numeric identifier | |
Status | “ongoing” or “completed” | osc:status property |
Project_Name | Name | title property |
Short_Description | description property | |
Website | link | |
Eo4Society_link | link | |
Consortium | contacts[].name property | |
Start_Date_Project | extent.temporal[] property | |
End_Date_Project | extent.temporal[] property | |
TO | contacts[].name property | |
TO_E-mail | contacts[].emails[].value property | |
Theme1 - Theme6 | Theme identifiers | osc:themes property |
Theme fields
Field | Description | STAC representation |
---|---|---|
theme | Theme name | id |
description | Theme description | description |
link | Link to further resources | link |
Variable fields
Field | Description | STAC representation |
---|---|---|
theme | The associated theme name | osc:theme property |
theme_description | ||
link | Link to further resources | link |
variable | The variable name | id |
domain | The variables domain | |
variable description | description |
These files can be transferred by the builder to the format of the metadata repository to create its original content.
The metadata repository follows a static STAC catalog scheme. The root catalog (henceforward known as root) is the entrypoint to the whole catalog.
On the first level the root catalog branches into the following catalogs:
- Products
- Projects
- Themes
- Variables
- EO-Missions Each of those first level catalogs reference STAC Collections/Catalogs depicting a record of their respective types.
Each Product is represented by a STAC Collection, containing both core and extension metadata. The product's are laid out as follows:
Relation type | “rel” keyword | Description |
---|---|---|
Website | via | The “Website” link from the original CSVs |
Access | via | The “Access” link from the original CSVs. Has the title=”Access” property. |
Documentation | via | The “Documentation” link from the original CSVs. Has the title=”Documentation” property. |
Root | root | Link to the root catalog.json |
Parent | parent | Link to the parent catalog.json |
The following metadata fields in the STAC Collection are present, but are not restricted to.
Field name | Field type | Description |
---|---|---|
type | string | “Collection” |
id | string | The collections unique identifier |
title | string | A short title for display |
stac_version | string | Fixed to “1.0.0” |
stac_extensions | array | A list of used extensions. Containing at least the OSC and the scientific extension identifiers |
description | string | A longer description for the Product |
links | array | A list of link objects as described by the table |
themes | array | A list of theme objects. This is used to define the known vocabulary used for this product and will be the place where associated themes, variables and EO missions are encoded. |
created | string | An ISO 8601 datetime string depicting the time instant the product was created |
extent | object | The spatio-temporal extent of the product |
osc:type | string | Used to distinguish between types. Fixed to “product”. |
osc:project | string | The project name this product is associated with |
osc:variables | array | The names of the variables this product is comprised of |
osc:missions | array | The names of the EO Missions this product uses data of |
osc:status | string | "planned", “ongoing” or “completed” |
osc:region | string | A short title of the geographic region this product is associated with |
contacts | array | A list of contact information points. Used to describe the consortium members and technical officer of the science projects. |
themes | array | A list of theme objects |
The model of the static catalog is a static STAC Catalog with STAC Collections representing Themes and Variables and STAC Items representing Products and Projects.
The STAC Items for Products and Projects are re-used from the Metadata Repository, only the links between Themes, Variables, Products and Projects are inserted as appropriate. The following tables show which links are added for which types:
Relation type | “rel” keyword | Description |
---|---|---|
Website | via | The “Website” link from the original CSVs |
Access | via | The “Access” link from the original CSVs. Has the title=”Access” property. |
Documentation | via | The “Documentation” link from the original CSVs. Has the title=”Documentation” property. |
Root | root | Link to the root catalog.json |
Variable | collection | Link to the associated variable. |
Themes |
Requirement: ”The Catalogue shall enable Data and Metadata management from the frontend. The Catalogue shall provide means to control, review and monitor the data and the data lifecycle”
This workflow allows users to review a metadata record (Product, Project, Variable and Theme) and apply changes. For this purpose, the User uses the Frontend Application to query the Catalogue for the Record he wants to submit changes and uses the GUI to apply those changes. The Fronted then communicates with the Backend API and in turn creates a new git branch with the changes on the metadata repository and issues the creation of a Pull Request.
When changes are published, the User can view his pending submissions via the Frontend. They are retrieved from the Backend API which in turn queries the metadata repository.
In this workflow, a data administrator takes the decision to publish a pending change submission to the actual Open Science Catalog. In order to do so, he uses either the Frontend or the GUI of the metadata repository to review the changes made in the branch. If the changes are deemed correct, the branch is merged in the main branch of the metadata repository. Now a continuous integration process is run that transforms the contents of the metadata repository into a STAC catalog with associated metadata files, which is subsequently published to a public HTTP server. Here, the Harvester component of the OSC, which runs in regular and configurable intervals, can traverse the contents of the STAC catalog and in turn forward each and every STAC collection and item file representing the Products, Projects, Variables and Themes to the registrar, which in turn will publish the records into the catalogue.
The dynamic components of the Open Science Catalog are deployed within a kubernetes cluster. This cluster is pre-configured and connected via the Flux continuous integration software. It regularly adjusts the state of the cluster with the deployment repository, a git repository in which the configuration files are stored under version control. Any commit to that repository will trigger an adjustment within the cluster.
By default, two instances of the Open Science Catalog are deployed via the same repository, one for live and one for testing. Each is a directory in the configuration repository. It is assumed that new features will be first developed on the developers environment, then transferred to the testing environment and the, finally, be released in the live environment.
In kubernetes terms, next to the system internals, there are two distinct namespaces, live and testing, where all OSC components are deployed into. On the deployment repository, the environments are grouped into two directory structures.
The following listing describes the directory structure of the kubernetes configuration directory in the deployment repository:
kubernetes/
├── live
│ ├── backend-api
│ │ └── ...
│ ├── frontend
│ │ └── ...
│ ├── metadata-proxy
│ │ └── ...
│ ├── resource-management
│ │ └── ...
│ ├── user-management
│ │ └── ...
│ └── ...
├── testing
│ ├── backend-api
│ │ └── ...
│ ├── frontend
│ │ └── ...
│ ├── metadata-proxy
│ │ └── ...
│ ├── resource-management
│ │ └── ...
│ ├── user-management
│ │ └── ...
│ └── ...
└── system
└── ...
Each component is installed using its Helm chart, when available. In other cases, the component is deployed using the kubernetes structures, such as ingress, service, and deployment configurations.
Description | Domain Name | IP Address | Environment |
---|---|---|---|
Home Page | opensciencedata.esa.int | 185.52.192.220 | Live |
Home Page | https://staging.opensciencedata.esa.int/ | 185.52.192.220 | Test |
Login | https://auth.staging.opensciencedata.esa.int | Test | |
Registration | https://auth.staging.opensciencedata.esa.int/identity/register.htm | Test | |
Home Page | NA - Use Developer Machine | NA - Use Developer Machine | Dev |
Endpoint | Target |
---|---|
opensciencedata.esa.int | Live Frontend |
auth.opensciencedata.esa.int | Live User Management |
resource-catalogue.opensciencedata.esa.int | Live Dynamic Catalog |
backend-api.opensciencedata.esa.int | Live Backend API |
metadata.opensciencedata.esa.int | Live Metadata Proxy |
testing.opensciencedata.esa.int | Testing Frontend |
auth.testing.opensciencedata.esa.int | Testing User Management |
resource-catalogue.testing.opensciencedata.esa.int | Testing Dynamic Catalog |
backend-api.testing.opensciencedata.esa.int | Testing Backend API |
metadata.testing.opensciencedata.esa.int | Testing Metadata Proxy |
Endpoint | Target |
---|---|
github.com/EOEPCA/open-science-catalog-metadata | Live Metadata Repository |
eoepca.github.io/open-science-catalog-metadata | Live Static Catalog |
github.com/EOEPCA/open-science-catalog-metadata-staging | Testing Metadata Repository |
eoepca.github.io/open-science-catalog-metadata-staging | Live Static Catalog |