Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy GCP bakery into Columbia-owned project #19

Closed
Tracked by #33
rabernat opened this issue Aug 24, 2021 · 33 comments
Closed
Tracked by #33

Deploy GCP bakery into Columbia-owned project #19

rabernat opened this issue Aug 24, 2021 · 33 comments

Comments

@rabernat
Copy link

Thanks for all of the quick work on this repo!

We (myself and @cisaacstern) will need to actually deploy our own GCP bakery into our own project. This will be the main "production" bakery for Pangeo forge when we first launch.

This bakery will have to manage TWO public buckets:

  • A GCS US-CENTRAL1 bucket
  • Our OSN bucket at NCSA (S3 protocol)

Let us know when you think this repo is in shape for us to try to deploy. We are happy to serve as guinea pigs to work out any kinks in the deployment process.

cc @sharkinsspatial @tracetechnical

@tracetechnical
Copy link
Contributor

Awesome! Glad it is going to be useful.

Tentatively, give it a go now and let me know if you hit any bumps in the road. I am keen to take an iterative approach on this if you are keen?

I have run it through on my vanilla instance here using my own instructions, so that should mean it is 98% there.

@rabernat
Copy link
Author

Ok I'm going to create a new GCP project for this. Charles stand by for details.

@rabernat
Copy link
Author

Charles, I created a new GCP project called pangeo-forge-4967 and added you as a project owner. You will need to log into https://console.cloud.google.com/ using you uni [email protected]

@cisaacstern
Copy link
Member

Logged in. So next step is I follow the instructions in
https://github.com/pangeo-forge/pangeo-forge-gcs-bakery#readme
to deploy a bakery in this project?

@tracetechnical
Copy link
Contributor

@cisaacstern Yarp :)

@tracetechnical
Copy link
Contributor

How is it going on your side @cisaacstern? Anything I can do to help?

@cisaacstern
Copy link
Member

Thanks for checking in, @tracetechnical. I've been working on some other stuff and haven't gotten a chance to start this yet. Will check back with you soon!

@rabernat
Copy link
Author

rabernat commented Sep 9, 2021

One comment comes to mind here.

I expect that the bakery code will evolve rapidly over the next year. For for development purposes, it would actually probably be good to have two bakeries:

  • a "production" bakery - the stable one we are actually using for real users
  • a "staging" bakery where we can deploy updates to the bakery and test them before moving them to production

@cisaacstern
Copy link
Member

cisaacstern commented Sep 9, 2021

Following the README now, and adding/checking items to/off a list here as I complete them:

Setup bucket

  • A bucket for the terraform state:
Name pangeo-forge-columbia-staging-bakery-terraform-tfstate-gcp
Region us-central1
Class Standard
Prevent public access ☑️ Uniform
Advanced Settings No change to defaults. (Google-managed encryption, no retention policy, no labels.)
  • whose name is updated in terraform/providers.tf

Local tooling installs

  • Terraform CLI
  • Google Cloud tools

    README might mention that you choose the GCP project and default compute region when running gcloud init

  • Make
  • Kubectl

    Can't configure until I have a cluster https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/#verify-kubectl-configuration

  • Docker

    I believe the README notes about groupadd are not relevant for macOS users without root/non-root concerns?

  • Lens

    Same config note as for Kubectl

  • Helm

    Also can't config without a Kubernetes cluster first (same as Kubectl and Lens)

@cisaacstern
Copy link
Member

@tracetechnical, could we jump on video chat sometime today or tomorrow to review .env? I'm unclear on what values should be added for most of these fields!

On a related note, looks like the pangeo-forge-recipes and pangeo-notebook versions are hardcoded in there

BAKERY_IMAGE="pangeo/pangeo-forge-bakery-images:pangeonotebook-2021.06.05_prefect-0.14.22_pangeoforgerecipes-0.4.0"

... which seems ... incorrect? ADR 2 specifies that recipe contributors should provide the pangeo-forge-recipe version in the recipe-specific meta.yaml. (xref pangeo-forge/staged-recipes#78)

Also, @sharkinsspatial, I noticed in looking into this that ADR 2 doesn't seem to mention anything about pangeo_notebook_version. Should that be added to the Top Level Data section of the spec?

@tracetechnical
Copy link
Contributor

@cisaacstern Sure! If you compile me a quick list, I may be able to explain them on here.
I will open an issue to get it added to the documentation once I've got a list.

@cisaacstern
Copy link
Member

Questions indented below each line.

BAKERY_NAMESPACE=""

What is this?

BAKERY_IMAGE="pangeo/pangeo-forge-bakery-images:pangeonotebook-2021.06.05_prefect-0.14.22_pangeoforgerecipes-0.4.0"

Referring back to #19 (comment), why do we need to provide global pangeo-notebook and pangeo-forge-recipes versions here if they are specified in the per-recipe meta.yaml?

STORAGE_SERVICE_ACCOUNT_NAME="<ACCOUNT 1 HERE>"

Where do I find this?

CLUSTER_SERVICE_ACCOUNT_NAME="<ACCOUNT 2 HERE>"

Where do I find this?

PROJECT_NAME=""

Is this the name of the Google Cloud Project pangeo-forge-4967 mentioned in #19 (comment)?

STORAGE_NAME=""

Is this the name of a target storage bucket to write data to? As noted in #19 (comment), we will be writing to two different buckets, one of which is on S3. What value to a I assign here?

CLUSTER_NAME=""

I do have the Kubernetes API enables in the Google Cloud Project. Where do I find this name?

Thanks in advance for your help, @tracetechnical.

@tracetechnical
Copy link
Contributor

tracetechnical commented Sep 13, 2021

Replies in bold below

Questions indented below each line.

BAKERY_NAMESPACE=""

What is this?
This is the name of the Kubernetes namespace where your Prefect agent and related jobs will live

BAKERY_IMAGE="pangeo/pangeo-forge-bakery-images:pangeonotebook-2021.06.05_prefect-0.14.22_pangeoforgerecipes-0.4.0"

Referring back to #19 (comment), why do we need to provide global pangeo-notebook and pangeo-forge-recipes versions here if they are specified in the per-recipe meta.yaml?
This image is used for the prefect agent, templated in by envsubst here(

)

STORAGE_SERVICE_ACCOUNT_NAME="<ACCOUNT 1 HERE>"

Where do I find this?
This is created for you by the terraform, so pick any name you like, and the terraform will create a service account with this name

CLUSTER_SERVICE_ACCOUNT_NAME="<ACCOUNT 2 HERE>"

Where do I find this?
As above

PROJECT_NAME=""

Is this the name of the Google Cloud Project pangeo-forge-4967 mentioned in #19 (comment)?
Yes

STORAGE_NAME=""

Is this the name of a target storage bucket to write data to? As noted in #19 (comment), we will be writing to two different buckets, one of which is on S3. What value to a I assign here?
I will need to look into this further, but I believe this is only used in the terraform to create a storage account for you to use for caching/flow storage

CLUSTER_NAME=""

I do have the Kubernetes API enables in the Google Cloud Project. Where do I find this name?
This is the name which terraform assigns to the cluster it creates, as above, pick any name you like

Thanks in advance for your help, @tracetechnical.
No problem :)

@rabernat
Copy link
Author

Is it possible to use different buckets for cache vs. production data? That way permissions can be set on the bucket level, which is way simpler. The cache data should probably be private, while the production data should be public (perhaps w/ requester-pays).

@tracetechnical
Copy link
Contributor

tracetechnical commented Sep 13, 2021

As far as I understand, the selection of cache and target buckets is done at recipe level. The terraform setup of the storage is done purely for convenience of the bakery owner. @sharkinsspatial Please correct me if this is wrong.

@rabernat
Copy link
Author

Ah ok so STORAGE_NAME is only for dockerized prefect flows?

How then should a bakery operator be configuring their actual data storage locations? Do those have to be created outside of terraform?

@tracetechnical
Copy link
Contributor

StorageName is used by the terraform to set up a storage account + containers for flow and cache storage, and this is in turn used in the test recipe bundled in the repo.
It is then up to you whether or not you use the above storage in production.
My understanding is that the bakery operator would use one set of cache storage for everyone, but that is based on my uninformed viewpoint on how the bakeries are used IRL.

May be worth running through some user journeys to see if there is some more thinking/documentation needed around this.

@sharkinsspatial
Copy link
Contributor

sharkinsspatial commented Sep 13, 2021

@cisaacstern With regards to BAKERY_IMAGE. There was originally some question around this since this was in flux in Prefect but with previous versions of Prefect, the agent would unpickle the Flow from storage (thus it requires matching dependencies). I'm not sure if the Prefect agent implementation has changed or not, but in an ideal world, the image used by the agent would only need Prefect and then the scheduler and workers could use any https://github.com/pangeo-forge/pangeo-forge-bakery-images as long as its Prefect version was =< the agent's Prefect version.

In reference to STORAGE_NAME all the bakeries also require storage for storing serialized Flows. In the other bakeries this key is configured using the bakery identifier https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/main/terraform/storage.tf#L2. @tracetechnical is working on aligning the GCS bakery with what the structure of the Azure bakery code #21. @tracetechnical can we remove this setting when using the stack identifier to create the storage key name in GCS?

@tracetechnical
Copy link
Contributor

@sharkinsspatial I will address this as a separate PR to the Azure stuff, but yeah, I think that is wise.

@tracetechnical
Copy link
Contributor

Addressed in #24

@sharkinsspatial
Copy link
Contributor

@cisaacstern As another note on this ongoing deployment, we will also need to update pangeo-forge-prefect to support a GCS k8s cluster type https://github.com/pangeo-forge/pangeo-forge-prefect/blob/master/pangeo_forge_prefect/flow_manager.py#L152. This is a small lift and we can create a PR and run integration tests when you have a bakery deployed.

@cisaacstern
Copy link
Member

print("Hello, World!")

We were paused on this due to the lack of a Prefect account, and then (for the last few weeks) by other work.

Starting back into this today! Can't wait to get it all plugged together. 🎉

@sharkinsspatial
Copy link
Contributor

@cisaacstern As a reference, it looks like we still haven't seen any movement from Prefect on their serialization memory issues PrefectHQ/prefect#5004 (comment) yet.

@cisaacstern
Copy link
Member

We are now blocked by #29. @tracetechnical @sharkinsspatial, I eagerly await any insight you may have in resolving this issue!

@cisaacstern
Copy link
Member

Status update:

☑️ I believe I now have all of the infrastructure deployed
☑️ I'm able to successfully register a Prefect flow with make test-flow
☑️ I can run the registered flows from the pangeo-forge Prefect Cloud web interface
🤔 The test flow run fails. Prefect provides this schematic, fwiw:

Screen Shot 2022-02-02 at 8 48 03 PM

I am able to load the logs in Loki, but for now the Prefect Cloud interface seems like an easier place to browse them. One relevant (and recurring) traceback is as follows:

Task 'MappedTaskWrapper[466]': Exception encountered during task execution!
Task 'MappedTaskWrapper[466]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/opt/oisst_recipe.py", line 27, in wrapper
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/rechunker/executors/prefect.py", line 30, in run
    return self.stage.func(key)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 124, in cache_input
    input_cache.cache_file(fname, **fsspec_open_kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 153, in cache_file
    _copy_btw_filesystems(input_opener, target_opener)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 30, in _copy_btw_filesystems
    with output_opener as target:
  File "/srv/conda/envs/notebook/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 111, in open
    with self.fs.open(full_path, **kwargs) as f:
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1010, in open
    f = self._open(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 1026, in _open
    return GCSFile(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 1129, in __init__
    det = getattr(self, "details", {})  # only exists in read mode
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1357, in details
    self._details = self.fs.info(self.path)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 610, in _info
    out = await self._ls(path, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 646, in _ls
    out = await self._list_objects(path, prefix=prefix)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 434, in _list_objects
    return [await self._get_object(path)]
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 388, in _get_object
    res = await self._call("GET", "b/{}/o/{}", bucket, key, json_out=True)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 330, in _call
    status, headers, info, contents = await self._request(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/retry.py", line 110, in retry_request
    return await func(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 322, in _request
    validate_response(status, contents, path)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/retry.py", line 89, in validate_response
    raise FileNotFoundError
FileNotFoundError

Looks like this has something to do with caching, though what, specifically, I haven't quite figured out yet. Open to any suggestions, but I imagine it should be relatively approachable to debug with a bit more time.

@rabernat
Copy link
Author

rabernat commented Feb 3, 2022

Can you turn the pangeo_forge_recipes debugging logs on?

@cisaacstern
Copy link
Member

#37 gets pangeo_forge_recipes debugging logged on the workers. I then moved one step lower than this, and replicated the worker recipe execution context by running a local docker container with

docker run -it --platform linux/amd64 -v "`pwd`/kubernetes/storage_key.json":/opt/storage_key.json -e GOOGLE_APPLICATION_CREDENTIALS="/opt/storage_key.json" 5766d92a4b8d /bin/bash

where 5766d92a4b8d is the IMAGE ID associated with the bakery worker image

$ docker images
REPOSITORY                          TAG                                                                  IMAGE ID       CREATED        SIZE
pangeo/pangeo-forge-bakery-images   pangeonotebook-2021.06.05_prefect-0.14.22_pangeoforgerecipes-0.4.0   5766d92a4b8d   2 months ago   4.41GB

and mounting the Google creds is referenced from

docker run -it \
-v "$FLOW_FILE":"/opt/$FLOW_FILENAME" \
-v "$STORAGE_KEY":/opt/storage_key.json \
-e GOOGLE_APPLICATION_CREDENTIALS="/opt/storage_key.json" \

Within the worker image container, I then opened a python3 interpreter session and manually ran

the /test/recipes/oisst_recipe.py recipe definition & storage assignments

input_url_pattern = (
"https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation"
"/v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)
dates = pd.date_range("2019-09-01", "2021-01-05", freq="D")
input_urls = [
input_url_pattern.format(
yyyymm=day.strftime("%Y%m"), yyyymmdd=day.strftime("%Y%m%d")
)
for day in dates
]
pattern = pattern_from_file_sequence(input_urls, "time", nitems_per_file=1)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=20)
register_recipe(recipe)

storage_name = os.environ["STORAGE_NAME"]
fs_remote = GCSFileSystem(
project= os.environ["PROJECT_NAME"],
bucket= os.environ["STORAGE_NAME"],
)
target = FSSpecTarget(
fs_remote,
root_path=f"{storage_name}/target",
)
recipe.target = target
recipe.input_cache = CacheFSSpecTarget(
fs_remote,
root_path=(
f"{storage_name}/cache"
),
)
recipe.metadata_cache = target

followed by (still within the worker image container)

for input_name in recipe.iter_inputs():
    recipe.cache_input(input_name)

which produced this

Traceback
INFO:pangeo_forge_recipes.recipes.xarray_zarr:Caching input '(0,)'
INFO:pangeo_forge_recipes.storage:Caching file 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/201909/oisst-avhrr-v02r01.20190901.nc'
INFO:pangeo_forge_recipes.storage:Coping remote file 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/201909/oisst-avhrr-v02r01.20190901.nc' to cache
DEBUG:pangeo_forge_recipes.storage:entering fs.open context manager for pfcsb-bucket/cache/f11a58c4987c8c3af6c16145253b2a51-https_www.ncei.noaa.gov_data_sea-surface-temperature-optimum-interpolation_v2.1_access_avhrr_201909_oisst-avhrr-v02r01.20190901.nc
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 124, in cache_input
    input_cache.cache_file(fname, **fsspec_open_kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 153, in cache_file
    _copy_btw_filesystems(input_opener, target_opener)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 30, in _copy_btw_filesystems
    with output_opener as target:
  File "/srv/conda/envs/notebook/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pangeo_forge_recipes/storage.py", line 111, in open
    with self.fs.open(full_path, **kwargs) as f:
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1010, in open
    f = self._open(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 1026, in _open
    return GCSFile(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 1129, in __init__
    det = getattr(self, "details", {})  # only exists in read mode
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1357, in details
    self._details = self.fs.info(self.path)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 610, in _info
    out = await self._ls(path, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 646, in _ls
    out = await self._list_objects(path, prefix=prefix)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 434, in _list_objects
    return [await self._get_object(path)]
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 388, in _get_object
    res = await self._call("GET", "b/{}/o/{}", bucket, key, json_out=True)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 330, in _call
    status, headers, info, contents = await self._request(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/retry.py", line 110, in retry_request
    return await func(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/core.py", line 322, in _request
    validate_response(status, contents, path)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/gcsfs/retry.py", line 89, in validate_response
    raise FileNotFoundError
FileNotFoundError

that looks quite similar to what we saw on the Prefect Cloud logs copied in #19 (comment)

@cisaacstern
Copy link
Member

cisaacstern commented Feb 3, 2022

My current guess is that for some reason the worker container's GCSFileSystem object can't write to the GCS bucket? Going to try writing an arbitrary text file to GCS using gcsfs within the container now...

@cisaacstern
Copy link
Member

Going to try writing an arbitrary text file to GCS using gcsfs within the container now...

🤔 Ok, so this works fine. I'll try manually running the caching with _copy_btw_filesystems now.

@cisaacstern
Copy link
Member

cisaacstern commented Feb 4, 2022

Status update:

I'll try manually running the caching with _copy_btw_filesystems now.

Calling pangeo_forge_recipes.storage._copy_btw_filesystems directly from within the current default bakery image

BAKERY_IMAGE="pangeo/pangeo-forge-bakery-images:pangeonotebook-2021.06.05_prefect-0.14.22_pangeoforgerecipes-0.4.0"

predictably raises the same FileNotFoundError we saw on both the Prefect Cloud logs and when calling recipe.cache_input from within a locally deployed container based on the same the worker image. It's only after seeing this third appearance of the same error, that I realized, 🤦 this bakery image is running pangeo-forge-recipes 0.4.0 which was released on... June 25, 2021.

So, rather than try to debug this long-outdated version (anything could be happening here!), I moved on to trying out what appears to be the latest (stable) bakery image release.

Running this container locally, I can 🎉 get a modified version of the test recipe to cache inputs to GCS. So, next action points are:

  • make destroy the current cluster and re-make deploy after updating to the latest bakery image in .env
  • maybe add an extra trap in deploy.sh around here to warn users if they are not using the latest bakery image?
  • update test/recipes/oisst_recipe.py to use pangeo-forge-recipes 0.6.1 (rather than 0.4.0) syntax (the pre-1.0 API has been in flux. I'm not entirely sure if there are actual breaking changes between those two versions, but I have a modified version of the test recipe which works, so I'll use that)
  • try make test-flow again!

@cisaacstern
Copy link
Member

cisaacstern commented Feb 4, 2022

One additional thought:

  • At one point a few hours ago, I thought I'd need to make a new pangeo-forge-recipes release (and associated bakery image release) to make this all work. It now looks like that's not the case (i.e., I think the latest image release will work fine for now). But that thought experiment certainly reinforced the value of Automatically build images? pangeo-forge-bakery-images#7 to me! We can revisit that once the first bakery is deployed.

@cisaacstern
Copy link
Member

I think the latest image release will work fine for now

I spoke too soon. The 0.6.1 image turned out to have incompatible fsspec and gcsfs versions, which was preventing the test recipe from writing to GCS. I fixed that in pangeo-forge/pangeo-forge-bakery-images#25 and pushed the update to Docker Hub. With this updated 0.6.1 image, plus #39, we can now execute make test-flow and run a successful Prefect Flow

Screen Shot 2022-02-04 at 11 21 25 AM

which builds the pruned NOAA OISST Zarr store to the bakery's GCS bucket

import fsspec
import xarray as xr

m = fsspec.get_mapper("gs://pfcsb-bucket/target")  # "pfcsb" stands for "pangeo forge columbia staging bakery"
ds = xr.open_zarr(m, consolidated=True)
ds
<xarray.Dataset>
<xarray.Dataset>
Dimensions:  (time: 2, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 2019-09-01T12:00:00 2019-09-02T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    cdm_data_type:              Grid
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    creator_email:              [email protected]
    creator_url:                https://www.ncei.noaa.gov/
    date_created:               2020-01-18T10:07:00Z
    ...                         ...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    standard_name_vocabulary:   CF Standard Name Table (v40, 25 January 2017)
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    time_coverage_end:          2019-09-01T23:59:59Z
    time_coverage_start:        2019-09-01T00:00:00Z
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...

So I'm tempted to say this mega-issue is now ... closed? It certainly looks like we have a functioning bakery. One last check I will try is manually registering a recipe (without make test-flow, but borrowing its syntax) to build a Zarr store onto our Pangeo Forge OSN bucket.

@cisaacstern
Copy link
Member

One last check I will try is manually registering a recipe (without make test-flow, but borrowing its syntax) to build a Zarr store onto our Pangeo Forge OSN bucket.

Ok! I'm convinced. I ended up just using make test-flow with a locally-edited version of test/recipes/oisst_recipe.py, which made writing to OSN as simple as dropping our credentialed s3fs.S3FileSystem (and a new root_path) in here

recipe.target = MetadataTarget(
fs_remote,
root_path=f"{storage_name}/target",

Then with a call to make test-flow and a click on QUICK RUN from the Prefect UI, we get

Screen Shot 2022-02-04 at 12 24 13 PM

followed by

import fsspec
import xarray as xr

m = fsspec.get_mapper(
    "s3://Pangeo/pangeo-forge/pfcsb-test/noaa-oisst-pruned/",  # the new `root_path`
    client_kwargs=dict(endpoint_url="https://ncsa.osn.xsede.org"),
    anon=True,
)
ds = xr.open_zarr(m, consolidated=True)
<xarray.Dataset>
<xarray.Dataset>
Dimensions:  (time: 2, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 2019-09-01T12:00:00 2019-09-02T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    cdm_data_type:              Grid
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    creator_email:              [email protected]
    creator_url:                https://www.ncei.noaa.gov/
    date_created:               2020-01-18T10:07:00Z
    ...                         ...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    standard_name_vocabulary:   CF Standard Name Table (v40, 25 January 2017)
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    time_coverage_end:          2019-09-01T23:59:59Z
    time_coverage_start:        2019-09-01T00:00:00Z
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...

Thanks to everyone for your assistance on this, it was truly a community effort! Especially @sharkinsspatial and @tracetechnical for the foundation (and answering so many questions along the way), @rabernat, and @sgibson91 for your epic save on #29. This one's a wrap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants