Use metadata schemas #82

petrelharp · 2020-05-28T06:10:34Z

The interface for metadata is unchanged, but under the hood uses the new metadata schemas in tskit. We shouldn't merge this until SLiM writes out metadata schemas to the tables: mostly, everything is the same, except that previously, we allowed non-SLiM metadata (eg empty metadata), but that will no longer be allowed; this may break previous code that relied on checking the presence of metadata to see if mutations or nodes were added by SLiM or if populations were used.

Furthermore, msprime.mutate( ) will no longer return a trivially SlimTreeSequence-able tree sequence, because newly added mutations will need SLiM metadata. I suppose this calls for a pyslim.mutate( ) method, and previous code, which did pyslim.SlimTreeSequence(msprime.mutate(...)) will break - unless we have pyslim.SlimTreeSequence( ) add default metadata to entries in need of such. We'll probably want to write a pyslim.mutate( ) method using the slim mutation generation ability of msprime, anyhow.

Note that the code in this PR already adds default metadata for populations without metadata.

Similarly, pyslim.recapitate( ) will need to be modified to add metadata to newly added nodes.

It would also be very nice to ditch the metadata processing we do here enitrely - leave the metadata as dicts as returned by tskit instead of translating them to objects. That would be a more serious change, although if we're making breaking changes, then perhaps it's OK - it would only require changing code from e.g. node.metadata.slim_id to node.metadata['slim_id'].

petrelharp · 2020-05-28T06:18:16Z

Whoops - there's a number of remaining errors. I see that one problem is that SLiM itself is counting on empty metadata as signifying "not really a population", so can't reload these tree sequences. We'll have to come up with a way to specify "not really a population" in the metadata.

benjeffery · 2020-05-31T01:08:02Z

This is interesting that SLiM uses empty metadata as a signal. It would be possible to modify tskit's metadata handling to allow b'' to be decoded to None. The metaschema would support this by allow the top-level type to be object or null.

petrelharp · 2020-05-31T04:04:27Z

This is interesting that SLiM uses empty metadata as a signal.

This wasn't a design choice we thought much about, but it makes a bit of sense - it avoids us having to make up e.g. sex ratios for populations that didn't exist.

Allowing null values would avoid some breakage out there in the wild; in particular everyone's scripts that are doing pyslim.SlimTreeSequence(msprime.mutate(ts, ...)). If this doesn't seem bad to others, I'd be in favor of it.

jeromekelleher · 2020-06-01T09:09:35Z

This sounds like a good idea to me - could we specify that the top-level object can be null directly for SLiM's schema, rather than in the metaschema?

benjeffery · 2020-06-01T09:34:14Z

Yes, JSON schema allows union types so how I saw this working is that the metaschema allows the schema to specify a union of ["null", "object"] at the top-level only.

I'm not sure we have discussed the restriction on "fixed objects" (all keys present) enough. It's related to this as I can imagine software modifying a tree sequence only adding a particular metadata key to some rows, and the current stuct codec forcing it to pick a "na" value for the other rows. It would be possible to extend the struct codec to cope with missing fields, at a cost of some complexity and perf (hopefully only perf in the case of optional keys).

jeromekelleher · 2020-06-01T09:43:39Z

Yes, JSON schema allows union types so how I saw this working is that the metaschema allows the schema to specify a union of ["null", "object"] at the top-level only.

Sounds good to me. I think it's fine to have all keys required in the struct codec for now (with this one exception).

benjeffery · 2020-06-01T09:48:04Z

Filed as tskit-dev/tskit#659

petrelharp · 2020-06-01T16:01:50Z

Thanks a bunch. I'll have a go at this.

petrelharp · 2020-08-12T20:41:03Z

Closing in favor of #89

petrelharp mentioned this pull request May 28, 2020

Is struct code sufficiently rich for SLiM and fwdpy11? tskit-dev/tskit#635

Closed

benjeffery mentioned this pull request Jun 1, 2020

Allow top-level metadata type to be a union of null, object. tskit-dev/tskit#659

Closed

petrelharp force-pushed the tskit_metadata branch from 0af6ae1 to e44f268 Compare June 8, 2020 19:33

petrelharp force-pushed the master branch from 5c4fea1 to b254393 Compare July 8, 2020 21:43

petrelharp added 4 commits August 9, 2020 15:22

switched over to new methods

1c3ff81

dump reference sequence; closes tskit-dev#84

fd2f165

more changes

246d828

.

c0254d4

petrelharp force-pushed the tskit_metadata branch from e44f268 to c0254d4 Compare August 9, 2020 22:22

petrelharp mentioned this pull request Aug 12, 2020

Update tskit #89

Merged

petrelharp closed this Aug 12, 2020

petrelharp deleted the tskit_metadata branch September 6, 2020 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use metadata schemas #82

Use metadata schemas #82

petrelharp commented May 28, 2020

petrelharp commented May 28, 2020

benjeffery commented May 31, 2020

petrelharp commented May 31, 2020

jeromekelleher commented Jun 1, 2020

benjeffery commented Jun 1, 2020 •

edited

Loading

jeromekelleher commented Jun 1, 2020

benjeffery commented Jun 1, 2020

petrelharp commented Jun 1, 2020

petrelharp commented Aug 12, 2020

Use metadata schemas #82

Use metadata schemas #82

Conversation

petrelharp commented May 28, 2020

petrelharp commented May 28, 2020

benjeffery commented May 31, 2020

petrelharp commented May 31, 2020

jeromekelleher commented Jun 1, 2020

benjeffery commented Jun 1, 2020 • edited Loading

jeromekelleher commented Jun 1, 2020

benjeffery commented Jun 1, 2020

petrelharp commented Jun 1, 2020

petrelharp commented Aug 12, 2020

benjeffery commented Jun 1, 2020 •

edited

Loading