From 5aa5aca72074eca360566cdfba7ebb569cb7198d Mon Sep 17 00:00:00 2001 From: Ryan Abernathey Date: Wed, 10 Feb 2021 00:19:07 -0500 Subject: [PATCH] Update README.md --- README.md | 102 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 79 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 8dce0dd..2b8e74c 100644 --- a/README.md +++ b/README.md @@ -3,50 +3,106 @@ # Pangeo Forge public roadmap In this repository, you can find the the [Pangeo Forge project roadmap](https://github.com/pangeo-forge/roadmap/projects/2). -The roadmap is where you can learn about Pangeo Forge project, its subprojects (i.e. _pangeo-smithy_) and how they fit together, and the road ahead. +The roadmap is where you can learn about Pangeo Forge project, its subprojects, how they fit together, and the road ahead. Pangeo Forge is just getting started so please open [issues](https://github.com/pangeo-forge/roadmap/issues) to ask questions or to propose changes and/or additions to the roadmap itself. +Pangeo Forge has grown out of the [Pangeo Project](http://pangeo.io/), an open-source community promoting open, reproducible, and scalable science. -## Background +## Inspiration -The idea of Pangeo Forge is to copy the very successful pattern of [Conda Forge](https://conda-forge.org/) for crowdsourcing the curation of an analysis-ready data library. +Pangeo Forge is inspired to copy the very successful pattern of [Conda Forge](https://conda-forge.org/). +Conda Forge makes it easy for anyone to create a [conda package](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/packages.html), a binary software package that can be installed with the conda package manager. In Conda Forge, a maintainer contributes [a recipe](https://conda-forge.org/#add_recipe) which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository. -In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, CI downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them using xarray, writes out the Zarr file, and uploads to cloud storage. -Pangeo Forge has grown out of the [Pangeo Project](http://pangeo.io/), an open source community promoting open, reproducible, and scalable science. +By automating the difficult parts of package creation, Conda Forge has enabled the open-source community to collaboratively maintain a huge and dynamic library of software packages. + +## Vision + +Pangeo Forge aspires to be like Conda Forge, but for data--specifically, Analysis Ready, Cloud Optimized (ARCO) data. +(For a detailed working definiton of ARCO data, see our preprint [Cloud Native Repositories for Big Scientific Data](https://www.authorea.com/doi/full/10.22541/au.160443768.88917719/v2).) +We envision a vibrant, dynamic library of open-access ARCO data stored in public clouds, shared among thousands of scientists and directly accessible to data-proximate computing. +However, maunally populating such a library would be prohibitively difficult and tedious. +Instead, we are building Pangeo Forge to automate the production of ARCO data and enable the croudsourcing of such a data library. + +In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, Pangeo Forge cloud-based automation downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them into one coherent dataset (e.g. using xarray), and writes the data in a cloud optimized format (e.g. Zarr) to cloud storage in a streaming fashion. + +## Technical Concepts and Architecture + +:exclamation: **Warning!** Pangeo Forge doesn't actually "work" yet. The integration and development of these compoments is work in progress. + +### Recipes + +A recipe defines how to transform data in one format / location into another format / location. +The primary way people contribute to Pangeo Forge is by writing / maintaining recipes. +Recipes are python objects generated by the [pangeo_forge](https://pangeo-forge.readthedocs.io/en/latest/) package. +These recipes can be used in a standalone fashion, without integration with the Pangeo Forge cloud automation infrastructure. +Or they can be turned into feedstocks and become part of the library. + +### Feedstocks + +Feedstocks are recipes that are managed and executed by Pangeo Forge cloud automation. +Feedstocks are stored in GitHub repositories in the [pangeo-forge GitHub organization](https://github.com/pangeo-forge/). +The community develops and maintains recipes through interaction with these repositories. + +### Bakeries + +Bakeries turn recipes into data. +They do the heavy lifting of actually executing the recipes: extracting data from its source, transforming it, and loading it into its target destination. +Bakeries are controlled by triggers from GitHub workflows. +Bakeries can run in cloud or on-premises compute nodes; they should be placed in close network proximity to data sources and / or targets. +We hope that eventually there will be Pangeo Forge bakeries running in most regions of major cloud providers. + +![diagram](pangeo-forge-diagram.png) + ## Subprojects -Pangeo Forge brings together a number of smaller subprojects to enable automatic the automatic production and publication of cloud-optimized datasets. Those subprojects are described briefly below: +Pangeo Forge brings together a number of smaller subprojects to implement this vision. +The currently-active subprojects are ### pangeo-forge -[Pangeo-forge](https://github.com/pangeo-forge/pangeo-forge) provides a central workflow manager and API for the productions of cloud-optimized datasets. -It is being designed to include a high-level Pipeline API (built on top of [Prefect](https://www.prefect.io/)) that will be useful inside and outside of pangeo-forge infrastructure. -Read about the pangeo-forge roadmap [here](./subprojects/pangeo-forge.md). + -### pangeo-smithy +![CI](https://github.com/pangeo-forge/pangeo-forge/workflows/CI/badge.svg) +![Codecov](https://img.shields.io/codecov/c/github/pangeo-forge/pangeo-forge) +[![Documentation Status](https://readthedocs.org/projects/pangeo-forge/badge/?version=latest)](https://pangeo-forge.readthedocs.io/en/latest/?badge=latest) +[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -[Pangeo-smithy](https://github.com/pangeo-forge/pangeo-smithy) is a tool for managing pangeo-forge feedstocks. -It combines a pangeo-forge recipes with the Continuous Integration and Continuous Deployment (CI/CD) services. -Read about the pangeo-smithy roadmap [here](./subprojects/pangeo-smithy.md). +The `pangeo_forge` python package provides the core API for creating Recipes. +All of the "business logic" for how to extract, transform, and load data lives in this library; as such, it is the focal point of Pangeo Forge development. ### staged-recipes -[Staged-recipes](https://github.com/pangeo-forge/staged-recipes) is a GitHub repository that manages the submission of new pangeo-forge recipes. You can think of this as a holding area for new feedstocks. -Read about the staged-recipes roadmap [here](./subprojects/staged-recipes.md). + + +Staged-recipes is a GitHub repository that manages the submission of new Pangeo Forge recipes. +You can think of this as a holding area for new feedstocks. +This repo contains the automation components of Pangeo Forge. + +### bakery + +Coming sooon... + +### Pangeo Forge Website + + + +Once the system is operating, the data library catalog will be viewable on this vue.js website. +However, we aren't quite ready for this front-end work yet. ## Contributing Pangeo-forge is just getting started. There's lots of work to do and lots of room for contributors to engage. -Here are a few ways you may consider getting involved: - -1. [Document an example recipie](https://github.com/pangeo-forge/staged-recipes/issues/new?assignees=&labels=example&template=example-pipeline.md&title=Example+pipeline+for+%5BDataset+Name%5D) -2. Contribute to any of the subprojects above. At the time of writing (8/11/2020), the pangeo-forge API is the most active area of development. -3. Comment on the project road map in this repository. +Overall progress on the project can be tracked via two project boards: +- The [Recipe Implementation project board](https://github.com/pangeo-forge/staged-recipes/projects/1). + This tracks the progress of implementing the recipes outlined in staged-recipes. +- The [software development project board](https://github.com/orgs/pangeo-forge/projects/1) shows the progress of the `pangeo_forge` python package, defines what sort of recipes Pangeo Forge can support. -## Definitions +At this stage, there are a few ways you may consider getting involved. -- **Pipeline**: A Python object that defines the steps to aquire, convert, and publish a dataset. -- **Feedstock**: A GitHub repository in the pangeo-forge GitHub organization that is managed by pangeo-smithy. +1. Scientists and data managers can [document an example recipie](https://github.com/pangeo-forge/staged-recipes/issues/new?assignees=&labels=example&template=example-pipeline.md&title=Example+pipeline+for+%5BDataset+Name%5D). Gathering use cases very helpful for defining the technical needs of pangeo-forge. You don't have to write any code to do this; you just have to understand the dataset you want to work with. +2. Python software developers can contribute to the code base. The [software development project board](https://github.com/orgs/pangeo-forge/projects/1) is a great place to start. +3. Anyone can comment on the project road map in this repository. +4. Eventually (but not yet), organizations can provide support for operating the bakeries (or run their own). ------