Skip to content

System Design Document ‐ v1.0.0

Fabian Schindler edited this page Dec 19, 2023 · 6 revisions

Introduction

The Open Science Catalogue is one of the elements contributing to an Open Science framework and infrastructure, with the scope to enhance the discoverability and use of products, data and knowledge resulting from Scientific Earth Observation exploitation studies.

The Open Science Catalogue has the capability to hold product metadata for assets that are stored externally or internally if required. Open Science Catalogue is also developing the capability to discover processes that can be deployed to and executed on remote EOEPCA platforms, using assets discovered by Open Science Catalogue.

Adhering by design to the “FAIR” (findable, accessible, interoperable, reproducible/reusable) principles, the Open Science Catalogue aims to support better knowledge discovery and innovation, and facilitate data and knowledge integration and reuse by the scientific community.

The document describes the design for the Open Science Catalogue which is driven by

  • The requirements outlined in OSCATALOGUE_v2.doc provided by ESA
  • The user stories derived from OSCATALOGUE_v2.doc and also those user stories added following Scrum sprint reviews.

The design is based upon the principles and framework provided by EOEPCA. The Open Science Catlogue will where possible

  • Re-use EOEPCA
  • Support the evolution of new EOEPCA user stories

High Level Design

image7

We will base the Open Science Catalogue on the above design.

The Open Science Catalogue architecture is based around the core EOEPCA design concepts above and the re-use core of EOEPCA components as described later.

Detailed Design

System Components

image8

The Open Science Catalog is a deployment of several EOEPCA components, in combination with additional supplementary components. The re-used components from EOEPCA are as follows: Resource Management:

  • Resource Catalogue
    • Harvester
    • Registrar
  • User Management:
    • Login-Service
    • Policy Decision Point
    • User Profile

The following diagram depicts the sub-systems with their respective components:

image3

All components re-used from EOEPCA will be used as-is, but will be configured to fit into the requirements set by the Open Science Catalog.

The following figure describes how the components are connected to each other and in what manner they interact with each other:

image6

The components will be described in detail in the following section.

Metadata Repository

Intro

The metadata repository is a git repository that stores the metadata contents of the Open Science Catalog and keeps a history of all the content changes that have been applied to it.

Design

The metadata repository is hosted as a repository on the GitHub platform. All themes, variables, projects and products and their associated metadata are hosted in a directory structure. Each change must be handled via a Pull Request. This pull request allows for reviewers to see the changes to be applied in advance and to provide reviews as comments. If appropriate, the changes can be merged with the main branch of the repository.

When a Pull Request is merged, the continuous integration pipeline is run, which triggers the building of a STAC catalog representing the Open Science Catalog on the Static Catalog component.

Interfaces Provided

Git and GitHub API

Static Catalog

Intro

The Static Catalog hosts the directory and file structure making up the STAC catalog.

Design

Please refer to the Readme at https://github.com/EOEPCA/open-science-catalog-frontend

Interfaces Consumed

  • Open Science Catalog (“static catalog”)
  • Resource Catalogue (used as “dynamic catalog”)
  • User Management

Client Web Portal / Frontend

Intro

This is the main frontend for users to use to browse the contents of the Open Science Catalogue.

Design

Please refer to the Readme at https://github.com/EOEPCA/open-science-catalog-frontend

Interfaces Consumed

  • Open Science Catalog (“static catalog”)
  • Resource Catalogue (used as “dynamic catalog”)
  • User Management

Resource Catalogue

Intro

pycsw is an OARec and OGC CSW server implementation written in Python. The pycsw project is certified OGC Compliant and OGC Reference Implementation for both CSW 2.0.2 and CSW 3.0.0. It is also an early implementation of OGC API - Records - Part 1: Core and STAC API.

pycsw is the Resource Catalogue building block of EOEPCA.

Design

Open Science Catalogue re-uses the EOEPCA Resource Catalogue. The design of the component can be found in the following document: https://eoepca.github.io/rm-resource-catalogue/SDD/

Interfaces Provided

Open Science Catalogue re-uses the EOEPCA Resource Catalogue.The interfaces of the component can be found in the following document: https://eoepca.github.io/rm-resource-catalogue/ICD/

Interfaces Consumed

The Resource Catalogue component provides the following external interfaces as listed from the upstream pycsw project documentation:

Scheduler

Intro

The task of the scheduler is to regularly send signals to the harvester to start the harvesting process. The configured interval can be specified in the configuration.

Design

The scheduler is a re-used component of the View Server software suite, and meant to function as an autonomous component.

Interfaces Provided

The scheduler does not provide any outbound interfaces.

Interfaces Consumed

The scheduler uses a configured Redis queue to send the harvesting signals to.

Harvester

Intro

The main task of the harvester is to consume a web or file-based service to detect items for registration. These possible services to interface are manifold, but in terms of the Open Science Catalog, the Harvester reads from the central static catalog and walks recursively through the provided collections in order to read all leaf items.

Design

The Harvester is 100% re-used from the EOEPCA Common Architecture, and a component of the View Server software framework (https://gitlab.eox.at/vs/harvester). The harvester is documented in the View Server Operators Guide (https://vs.pages.eox.at/documentation/operator/main/index.html).

Interfaces Provided

The Harvester does not provide any outbound interfaces in a traditional manner. It is working in a daemon mode, listening on a Redis-based task queue to listen for harvest requests.

Interfaces Consumed

In OSC, the harvester is configured to read the catalog.json in order to iterate through all contained collections and items from the static catalog on https://eoepca.github.io/open-science-catalog-metadata/catalog.json

Backend API

Intro

The Backend API allows for users of the OSC to contribute to the contents of the catalog. This is done in a controlled manner, in which contributions are passed through a review process called “Pull Request”, in which all changes can be reviewed by administrators before being applied to the catalog.

Design

The Backend API is a web service application implemented with the FastAPI software framework. It is developed specifically for use within the Open Science Catalogue.

Interfaces Provided

The exposed API of the Backend API can be inspected from this Swagger API document: https://open-science-catalog-backend.develop.eoepca.org/docs

Interfaces Consumed

The Backend API is interfacing with the GitHub API (https://docs.github.com/en/rest), in order to create Pull Requests and monitor their status.

User Management

Intro

The User Management components allow to specify user access mechanisms, store user data, allow users to login to the system with potentially an external identity provider, retrieve, validate and authorize requests with access tokens. The User Management is taken from the according EOEPCA User Management building blocks.

Design

The design documentation of the building blocks are to be found in the according sections of the EOEPCA online documentation.

Metadata Models

Original Data

These files contain the initial metadata contents of the Open Science Catalog. They are structured as tabular data in the CSV (comma separated values) format. For each of the record types one such file exists: Products.csv, Projects.csv, Variables.csv, Themes.csv, EO Missions.csv.

Product fields

Field Description STAC representation
ID Numeric identifier  
Status “ongoing” or “completed” osc:status property
Project The project identifier osc:project property, collection link
Website   link
Product Name link
Short_Name   identifier
Description   description property
Access URL link
Documentation URL link
Version   version property
DOI Digital Object Identifier sci:doi property and cite-as link
Variable Variable identifier collection link
Start   extent.temporal[]
End   extent.temporal[]
Region   osc:region property
Polygon   geometry
Released   created property
Theme1 - Theme6 Theme identifiers osc:themes property
EO_Missions Semi-colon separated list of missions osc:missions property
Standard_Name   cf:parameter.name property

Project fields

Field Description STAC representation
Project_ID Numeric identifier  
Status “ongoing” or “completed” osc:status property
Project_Name Name title property
Short_Description   description property
Website   link
Eo4Society_link   link
Consortium   contacts[].name property
Start_Date_Project   extent.temporal[] property
End_Date_Project   extent.temporal[] property
TO   contacts[].name property
TO_E-mail   contacts[].emails[].value property
Theme1 - Theme6 Theme identifiers osc:themes property

Theme fields

Field Description STAC representation
theme Theme name id
description Theme description description
link Link to further resources link

Variable fields

Field Description STAC representation
theme The associated theme name osc:theme property
theme_description    
link Link to further resources link
variable The variable name id
domain The variables domain  
variable description   description

These files can be transferred by the builder to the format of the metadata repository to create its original content.

Metadata Repository Models

The metadata repository follows a static STAC catalog scheme. The root catalog (henceforward known as root) is the entrypoint to the whole catalog.

On the first level the root catalog branches into the following catalogs:

  • Products
  • Projects
  • Themes
  • Variables
  • EO-Missions Each of those first level catalogs reference STAC Collections/Catalogs depicting a record of their respective types.

image5

Products

Each Product is represented by a STAC Collection, containing both core and extension metadata. The product's are laid out as follows:

Relation type “rel” keyword Description
Website via The “Website” link from the original CSVs
Access via The “Access” link from the original CSVs. Has the title=”Access” property.
Documentation via The “Documentation” link from the original CSVs. Has the title=”Documentation” property.
Root root Link to the root catalog.json
Parent parent Link to the parent catalog.json

The following metadata fields in the STAC Collection are present, but are not restricted to.

Field name Field type Description
type string “Collection”
id string The collections unique identifier
title string A short title for display
stac_version string Fixed to “1.0.0”
stac_extensions array A list of used extensions. Containing at least the OSC and the scientific extension identifiers
description string A longer description for the Product
links array A list of link objects as described by the table
themes array A list of theme objects. This is used to define the known vocabulary used for this product and will be the place where associated themes, variables and EO missions are encoded.
created string An ISO 8601 datetime string depicting the time instant the product was created
extent object The spatio-temporal extent of the product
osc:type string Used to distinguish between types. Fixed to “product”.
osc:project string The project name this product is associated with
osc:variables array The names of the variables this product is comprised of
osc:missions array The names of the EO Missions this product uses data of
osc:status string "planned", “ongoing” or “completed”
osc:region string A short title of the geographic region this product is associated with
contacts array A list of contact information points. Used to describe the consortium members and technical officer of the science projects.
themes array A list of theme objects

Static Catalog Models

The model of the static catalog is a static STAC Catalog with STAC Collections representing Themes and Variables and STAC Items representing Products and Projects.

The STAC Items for Products and Projects are re-used from the Metadata Repository, only the links between Themes, Variables, Products and Projects are inserted as appropriate. The following tables show which links are added for which types:

Relation type “rel” keyword Description
Website via The “Website” link from the original CSVs
Access via The “Access” link from the original CSVs. Has the title=”Access” property.
Documentation via The “Documentation” link from the original CSVs. Has the title=”Documentation” property.
Root root Link to the root catalog.json
Variable collection Link to the associated variable.
Themes    

Workflows

Life Cycle Management Workflow

Requirement: ”The Catalogue shall enable Data and Metadata management from the frontend. The Catalogue shall provide means to control, review and monitor the data and the data lifecycle”

Apply changes

This workflow allows users to review a metadata record (Product, Project, Variable and Theme) and apply changes. For this purpose, the User uses the Frontend Application to query the Catalogue for the Record he wants to submit changes and uses the GUI to apply those changes. The Fronted then communicates with the Backend API and in turn creates a new git branch with the changes on the metadata repository and issues the creation of a Pull Request.

image1

List pending submissions

When changes are published, the User can view his pending submissions via the Frontend. They are retrieved from the Backend API which in turn queries the metadata repository.

image2

Apply submitted changes to the OSC

In this workflow, a data administrator takes the decision to publish a pending change submission to the actual Open Science Catalog. In order to do so, he uses either the Frontend or the GUI of the metadata repository to review the changes made in the branch. If the changes are deemed correct, the branch is merged in the main branch of the metadata repository. Now a continuous integration process is run that transforms the contents of the metadata repository into a STAC catalog with associated metadata files, which is subsequently published to a public HTTP server. Here, the Harvester component of the OSC, which runs in regular and configurable intervals, can traverse the contents of the STAC catalog and in turn forward each and every STAC collection and item file representing the Products, Projects, Variables and Themes to the registrar, which in turn will publish the records into the catalogue.

image4

Deployment of OSC

The dynamic components of the Open Science Catalog are deployed within a kubernetes cluster. This cluster is pre-configured and connected via the Flux continuous integration software. It regularly adjusts the state of the cluster with the deployment repository, a git repository in which the configuration files are stored under version control. Any commit to that repository will trigger an adjustment within the cluster.

By default, two instances of the Open Science Catalog are deployed via the same repository, one for live and one for testing. Each is a directory in the configuration repository. It is assumed that new features will be first developed on the developers environment, then transferred to the testing environment and the, finally, be released in the live environment.

In kubernetes terms, next to the system internals, there are two distinct namespaces, live and testing, where all OSC components are deployed into. On the deployment repository, the environments are grouped into two directory structures.

The following listing describes the directory structure of the kubernetes configuration directory in the deployment repository:

kubernetes/
├── live
│   ├── backend-api
│   │   └── ...
│   ├── frontend
│   │   └── ...
│   ├── metadata-proxy
│   │   └── ...
│   ├── resource-management
│   │   └── ...
│   ├── user-management
│   │   └── ...
│   └── ...
├── testing
│   ├── backend-api
│   │   └── ...
│   ├── frontend
│   │   └── ...
│   ├── metadata-proxy
│   │   └── ...
│   ├── resource-management
│   │   └── ...
│   ├── user-management
│   │   └── ...
│   └── ...
└── system
    └── ...

Each component is installed using its Helm chart, when available. In other cases, the component is deployed using the kubernetes structures, such as ingress, service, and deployment configurations.

OSC Environments

Description Domain Name IP Address Environment
Home Page opensciencedata.esa.int 185.52.192.220 Live
       
Home Page https://staging.opensciencedata.esa.int/ 185.52.192.220 Test
Login https://auth.staging.opensciencedata.esa.int   Test
Registration https://auth.staging.opensciencedata.esa.int/identity/register.htm   Test
       
       
Home Page NA - Use Developer Machine NA - Use Developer Machine Dev

Internal Endpoints

Endpoint Target
opensciencedata.esa.int Live Frontend
auth.opensciencedata.esa.int Live User Management
resource-catalogue.opensciencedata.esa.int Live Dynamic Catalog
backend-api.opensciencedata.esa.int Live Backend API
metadata.opensciencedata.esa.int Live Metadata Proxy
testing.opensciencedata.esa.int Testing Frontend
auth.testing.opensciencedata.esa.int Testing User Management
resource-catalogue.testing.opensciencedata.esa.int Testing Dynamic Catalog
backend-api.testing.opensciencedata.esa.int Testing Backend API
metadata.testing.opensciencedata.esa.int Testing Metadata Proxy

External Endpoints

Endpoint Target
github.com/EOEPCA/open-science-catalog-metadata Live Metadata Repository
eoepca.github.io/open-science-catalog-metadata Live Static Catalog
github.com/EOEPCA/open-science-catalog-metadata-staging Testing Metadata Repository
eoepca.github.io/open-science-catalog-metadata-staging Live Static Catalog