Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use metadata schemas #82

Closed
wants to merge 4 commits into from
Closed

Conversation

petrelharp
Copy link
Contributor

The interface for metadata is unchanged, but under the hood uses the new metadata schemas in tskit. We shouldn't merge this until SLiM writes out metadata schemas to the tables: mostly, everything is the same, except that previously, we allowed non-SLiM metadata (eg empty metadata), but that will no longer be allowed; this may break previous code that relied on checking the presence of metadata to see if mutations or nodes were added by SLiM or if populations were used.

Furthermore, msprime.mutate( ) will no longer return a trivially SlimTreeSequence-able tree sequence, because newly added mutations will need SLiM metadata. I suppose this calls for a pyslim.mutate( ) method, and previous code, which did pyslim.SlimTreeSequence(msprime.mutate(...)) will break - unless we have pyslim.SlimTreeSequence( ) add default metadata to entries in need of such. We'll probably want to write a pyslim.mutate( ) method using the slim mutation generation ability of msprime, anyhow.

Note that the code in this PR already adds default metadata for populations without metadata.

Similarly, pyslim.recapitate( ) will need to be modified to add metadata to newly added nodes.

It would also be very nice to ditch the metadata processing we do here enitrely - leave the metadata as dicts as returned by tskit instead of translating them to objects. That would be a more serious change, although if we're making breaking changes, then perhaps it's OK - it would only require changing code from e.g. node.metadata.slim_id to node.metadata['slim_id'].

@petrelharp
Copy link
Contributor Author

Whoops - there's a number of remaining errors. I see that one problem is that SLiM itself is counting on empty metadata as signifying "not really a population", so can't reload these tree sequences. We'll have to come up with a way to specify "not really a population" in the metadata.

@benjeffery
Copy link
Member

This is interesting that SLiM uses empty metadata as a signal. It would be possible to modify tskit's metadata handling to allow b'' to be decoded to None. The metaschema would support this by allow the top-level type to be object or null.

@petrelharp
Copy link
Contributor Author

This is interesting that SLiM uses empty metadata as a signal.

This wasn't a design choice we thought much about, but it makes a bit of sense - it avoids us having to make up e.g. sex ratios for populations that didn't exist.

Allowing null values would avoid some breakage out there in the wild; in particular everyone's scripts that are doing pyslim.SlimTreeSequence(msprime.mutate(ts, ...)). If this doesn't seem bad to others, I'd be in favor of it.

@jeromekelleher
Copy link
Member

This sounds like a good idea to me - could we specify that the top-level object can be null directly for SLiM's schema, rather than in the metaschema?

@benjeffery
Copy link
Member

benjeffery commented Jun 1, 2020

Yes, JSON schema allows union types so how I saw this working is that the metaschema allows the schema to specify a union of ["null", "object"] at the top-level only.

I'm not sure we have discussed the restriction on "fixed objects" (all keys present) enough. It's related to this as I can imagine software modifying a tree sequence only adding a particular metadata key to some rows, and the current stuct codec forcing it to pick a "na" value for the other rows. It would be possible to extend the struct codec to cope with missing fields, at a cost of some complexity and perf (hopefully only perf in the case of optional keys).

@jeromekelleher
Copy link
Member

Yes, JSON schema allows union types so how I saw this working is that the metaschema allows the schema to specify a union of ["null", "object"] at the top-level only.

Sounds good to me. I think it's fine to have all keys required in the struct codec for now (with this one exception).

@benjeffery
Copy link
Member

Filed as tskit-dev/tskit#659

@petrelharp
Copy link
Contributor Author

Thanks a bunch. I'll have a go at this.

@petrelharp petrelharp mentioned this pull request Aug 12, 2020
@petrelharp
Copy link
Contributor Author

Closing in favor of #89

@petrelharp petrelharp closed this Aug 12, 2020
@petrelharp petrelharp deleted the tskit_metadata branch September 6, 2020 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants