-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model: need to avoid a "pure" Union or "serialize" it with type info #333
Comments
…generic See #333 for more information === Do not change lines below === { "chain": [], "cmd": "sed -i -e 's/str, AnyUrl/AnyUrl, str/g' {inputs}", "exit": 0, "extra_inputs": [], "inputs": [ "dandi/models.py" ], "outputs": [ "dandi/models.py" ], "pwd": "." } ^^^ Do not change lines above ^^^
thanks @yarikoptic for this detailed issue. there are a few additional things - this is true in json schema as well, not just pydantic, where you can specifiy union of allowed types in the schema. unfortunately there is no type information in the data itself. the validator then can determine if the data fits any of the schemas. one thing we can do is to add "type" (as is the case in jsonld) as a property in every object and make this a readonly value. here is the schema.org jsonld example for book with identifier. {
"@context": "https://schema.org/",
"@type": "Book",
"name": "Library linked data in the cloud : OCLC's experiments with new models of resource description",
"author": "Carol Jean Godby",
"isbn": "9781627052191"
"identifier": {
"@type": "PropertyValue",
"propertyID": "OCoLC",
"value": "889647468"
},
"sameAs": "http://www.worldcat.org/oclc/889647468"
} this will not address trivial types like AnyURL and str, but should handle the more complex types. |
The relevant notion we are dancing around here is that of the tagged union. Unfortunately, it seems that pydantic explicitly doesn't yet support this. Here is someone's attempt at a custom validator for Pydantic that uses I only have rather vague advice for this situation, which boils down to: let's keep things as simple as possible. In this case, simplicity meshes with explicitness (making it a twofer in terms of Python Zen), requiring something like a It may feel wrong because we shouldn't have to do it, and it makes the schemata themselves bulkier, but IMO that would be made up for by forcing the extra complexity on the machine, and detaching it from our human selves (keeping in mind that we are trying, as a design goal, not to expose DANDI users to the raw schemata). I believe this would also allow an easier path to UI customization (should that be necessary). |
Just want to point out that it's important that if we are going to include a type tags on unions, it specifically needs to be e.g.
in order for the client UI component to recognize it first hand. Otherwise, we'd still have to transform the schema on the client, which is not something we want to do. |
@AlmightyYakob and @waxlamp - i did a bunch of work on this yesterday to better understand the issues and i'm getting closer to a newer model. indeed certain specific types of union are what is causing trouble. not all unions. and the other piece is the union in identifier. i hope to resolve both of these things today or by tomorrow. |
…generic See dandi/dandi-cli#333 for more information === Do not change lines below === { "chain": [], "cmd": "sed -i -e 's/str, AnyUrl/AnyUrl, str/g' {inputs}", "exit": 0, "extra_inputs": [], "inputs": [ "dandi/models.py" ], "outputs": [ "dandi/models.py" ], "pwd": "." } ^^^ Do not change lines above ^^^
this is to continue on our metadata meetup with @satra @jwodder @AlmightyYakob @waxlamp discussing the difficulties of establishing sensible while generic and scalable UI for (dandiset) metadata editing.
One of the problematic cases is the use of
Union
type/construct in the model, and then relying on pydantic to "do the right choice" of the underlying model while consuming data from serialized (yaml or json) data.Here is the
Union
s we have ATM (0.10.0-40-gcee1cbd):The core assumption is that our types would have some attributes unique to them, so the choice of the type becomes unambiguous. It is not the case ATM (if ever could be achieved), and here are some thoughts/observations:
Order of types should not matter but it does, preventing round trip to .json and back
ATM pydantic would choose the first type (in the list within Union) which 'satisfies' the data. Unfortunately IMHO this is a mis-feature and ideally we should not rely on it. It makes it virtually impossible to define/use two types which have only a semantic difference by being different types and otherwise having the same attributes. E.g.,
about
above; Disorder does have an attributedxdate
but it is optional), since then former would always take "precedence" when we just read in serialized data without any type annotation. So ATM we cannot do R/T on those into/from serialized form which has no type annotations. Yes, we could "improve" the schema models by making them more 'specialized' (magic word to mention here: "Ontologies") but IMHO it would be "artificial"/duplicate/fragile (depending on the case ;)), since semantically we already know that they are different types, since we defined them separately.Union[str, AnyUrl]
. We would need to swap the order because otherwise no R/T is possible sinceAnyUrl
serializes just into a URLquick demo
results in
name
,url
,email
are provided) (see another example below).So my question is ...
is there is some existing already type construct we could use instead of
Union
or in addition (alike a "decorator" for each value then which would disclose the "type") so that serialized data (given the schema) could be unambigously deserialized?E.g. taking for example the top of https://github.com/dandi/dandi-api-datasets/blob/master/000004/dandiset.yaml (in current schema):
which corresponds to
where it takes my unquestionably high expertise in the matter to say that it is an
Anatomy
and not e.g.Disorder
... (Exercising on the case of theList[Union]
not just pureUnion
but I think it should generalize). I could see a following serialization which would disambiguateBoth would make serialization only a bit heavier, but clear(er) to humans and computers (UI) and thus allow for R/T. The latter one is IMHO more "generic" if for every value participating in a
Union
ed type we export the_schemaType
. That would then apply only to values which are part of the union, and make it easily "upgradeable": if we make some attribute from a single to Union type, it will just add that "protected" attribute to an existing record without changing actual "layout".BUT it would make it quite ugly in case of that simple case of
AnyUrl, str
. My answer would be: that is a reasonable compromise to achieve unambiguous R/T at a minor cost of visual/human readability. But may be we could do better, and just establish a "programmable" rule that in the case of the first Type chosen in the Union, we omit explicit_schemaType
? then it could be coded in any schema handling.Yet another (wild idea) alternative is to forget about JSON and use its superset YAML for serialization, and then provide type annotations in the comments
even though it would likely inflict more developer pains (in particular in JS world; and I have no clue if it would be feasible to achieve with pydantic ATM) while making it more user-friendly. That would also keep opportunities for later using additional features of YAML: anchors (
&
) and references (*
) in case of those circular entities (not the topic of this "post") we touched upon.The text was updated successfully, but these errors were encountered: