Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible refactoring of interface between recipe and file patterns #99

Closed
rabernat opened this issue Apr 12, 2021 · 2 comments · Fixed by #101
Closed

Possible refactoring of interface between recipe and file patterns #99

rabernat opened this issue Apr 12, 2021 · 2 comments · Fixed by #101

Comments

@rabernat
Copy link
Contributor

Having worked with pangeo_forge recipes extensively over the past month, I am now considering some potential internal refactors to simplify the code base.

Current situation

Currently, most of the logic lives in the recipe module, while the patterns module has a few simple routines to generate filenames. There is a lot of implicit logic in the recipe classes about how files are organized. That's why we have separate classes for NetCDFtoZarrSequentialRecipe and NetCDFtoZarrMultiVarSequentialRecipe. I have come to feel that this is not a clear separation fo concerns

Proposal: move all logic about how files are organized in the the patterns module

Instead, we could imagine having a Pattern object represent everything about _how a particular set of files are organized. It would explain

  • What are the "keys" used to generate the filenames
  • How to format the filenames
  • How different keys are related. For example time might be a "concat" key, while variable might be a merge key. This would be similar to ncml

A recipe could then look at a Pattern and decide what to do. (It might decide it can't support that pattern and raise an error.) But then we would only need one XarrayZarrRecipe.

@cisaacstern
Copy link
Member

As I mentioned in today's coordination meeting, from my perspective, this general direction feels more approachable than the existing design. In a way, these Pattern objects then become, in some sense, a form of configuration, which seems appropriate given their expected high degree of variability and the sheer number of them we anticipate seeing over time. All of this is, of course, the viewpoint of someone entirely new to the library; others more familiar with the internals may have very good reasons that it should not change.

@martindurant
Copy link
Contributor

Here is the file pattern thing in Intake https://www.anaconda.com/blog/intake-parsing-data-from-filenames-and-paths . Currently this is only implemented for CSV, but there's no reason ti shouldn't be universal, at least for data containers supporting labels.

@rabernat rabernat mentioned this issue Apr 14, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants