Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing versioned data #32

Closed
c42f opened this issue Nov 24, 2021 · 0 comments · Fixed by #33
Closed

Addressing versioned data #32

c42f opened this issue Nov 24, 2021 · 0 comments · Fixed by #33

Comments

@c42f
Copy link
Contributor

c42f commented Nov 24, 2021

Suppose I want to use git to store and version a dataset. How would I open a particular version of that dataset?

One option mentioned elsewhere by @pfitzseb would be to have dataset("name") load the latest version and have some syntax like dataset("name@v2") / dataset("name#hk98s2") load a specific version/hash, much like Pkg.

Another similar idea would be to add keyword arguments to dataset(). But I do think there's some benefits to using syntax within the string rather than keywords. URLs show how useful a standard string representation of resources-with-parameters can be.

The URN RFC is a good source of inspiration here. In particular they specify three sets of parameters, the r-component, q-component and f-component - see https://datatracker.ietf.org/doc/html/rfc8141#page-12 :

  • r-component - parameters passed to the name resolver. (For us, this corresponds to passing parameters to the AbstractDataProject.)
  • q-component - parameters passed to either the named resource or a system that can supply the requested service. The q-component is specified to have the same syntax as the query part of a URL. (For us, I guess this corresponds to passing parameters to the storage backend when it open()s the dataset.)
  • f-component - interpreted by the client as a specification for a location within, or region of, the named resource; similar to the fragment of a URL. (For us, this would be parameters applied to the object which comes from open()ing a dataset. For example, to supply a relative path within a BlobTree.)

While I think the URN RFC has some useful concepts I'm not super keen on their syntax which is like URI syntax but confusingly subtly different, with the normal query part prefixed with ?= as name?+rcomponent?=qcomponent#fcomponent.

But I'm also not sure the Pkg syntax is quite what we want. For packages it's useful to make versioning very central in the syntax to the extent of taking up two different types of syntax just to specify versions. Unlike Pkg I think there could be other parameters we might want to pass when addressing data storage, not just a version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant