-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update tskit #101
update tskit #101
Conversation
"table collections loaded from files no longer live on top of the underlying kastore memory" -so this fixes #72 ? Or is that a different issue? |
"It would be nice to have some developer documentation that says how to run the tests." Well, the problem is that this is pretty technical and not very automated, at present. You can run the -eidosTest and -slimTest tests, which are documented, but they don't exercise tree-sequence recording significantly; they are more tests of APIs and such than tests that complex models work. Then there's the https://github.com/MesserLab/SLiM-Tests repository, but since the tests there are complex models, their results are stochastic, and there's not a well-defined "right" answer for most of them; you have to compare current results against past results, and when there's a discrepancy, track down the cause and verify that it's an intended change in behavior, not a bug. This is an ongoing problem; testing code that is stochastic is tricky. I'm not saying it's not solvable, but I haven't had the time to solve it in a way that is documentable. Sorry! This is why we need a funding line for a QA/testing person who can build better tests that automatically validate the software; it's a hard problem, and is really a full-time job for someone. |
These diffs include the new hook for external sorting of the edge table, right? I want to plug into that. Yes, we can do the file format change, but of course it's tricky since we want to preserve backward compatibility. Let's chat about that. Anyway, regarding the testing, once the PR feels final, I'll just take it and then I'll do the testing on my side. Not great, but it's what we've got. |
Yes, this will close #72. |
So: about the metadata - I forget how much of this you know already, but: First, tskit now has the sort of thing we do with metadata built in - one specifies a metadata schema, which is a new (optional) property of tables, and then tskit will automatically validate and decode it for you (much like pyslim already does). This happens at the python level, so this will only affect us in that we'll add appropriate metadata schemas to the tables. We could also check that metadata schema are correct when loading in tables, but this seems unnecessary. Second, there is table collection-level metadata also. We should make up a schema for SLiM, and put the things we need to know but currently are pulling out of provenance in there: WF/nonWF, SLiM generation, not sure what else. Third, there's the mutation time. Currently we store origin_generation in metadata; the time column is in tskit time-ago. We could leave everything the way it is, which would be convenient, since we wouldn't have to translate back and forth. But we really should not be storing the same thing in more than one place. So, we should remove origin_generation from the mutation metadata struct, and get the information out of the mutation time column. This should be straightforward, except for dealing with the discrepancy between tskit_generation and generation. |
Checks are failing because the current GitHub head fails a couple of -eidosTest unit tests due to changes in progress under the hood. I can commit a fix that will resolve all of those issues, and then this PR should be able to pass checks, once you rebase it to the current head. All the changes are in Eidos, not SLiM, so you should have zero conflicts. Shall I go ahead and do that commit? |
Oh, and: yes, this does have the new sorting hooks in! Here's an example of how to do it: https://github.com/tskit-dev/tskit/blob/master/c/examples/cpp_sorting_example.cpp |
Sure, go ahead, but no hurry. I can ignore the Eidos tests for now. |
I forget how much of it I know already, too. :->
OK. Checking that the schemas are what we expect them to be on load actually seems like a good idea to me – this would be a way to check that everything is in "SLiM format", without relying on the rest of the load code to catch discrepancies/mismatches when it isn't. It would just be a simple string equality test, right? We should define our schemas in
Yep, this makes sense. Some decisions to be made regarding what belongs in the metadata versus the provenance, probably. Can pyslim insulate end users from the fact that these things are moving around? I.e., can the old way of accessing all this metadata continue to work (perhaps designed as deprecated/legacy APIs, if you like), so that people don't need to dance around this in their Python code? The alternative, I guess, if people want to write Python code that works with both old and new .trees files, would be that they would have to check the version of the file somehow, and use different pyslim APIs depending on the file version, which would be really gross.
I'm a little bit leery of this. The origin_generation field is an integer, defined/managed by SLiM, and actually SLiM even allows models to manipulate it for their own purposes if they want to – it can be used as an extra It is also a floating-point number, right? Since tskit time is continuous? And it would potentially contain fractional values, even for SLiM models, because of the way we play with the tree sequence time to compensate for early()/late(), etc., right? All of this would make it very complicated to use that column anyway. And then – not that I expect this to be an issue for anybody any time soon – there is the fact that the range of 64-bit integers is much larger than the range of integer values that can be fully represented by double-precision floats; they only get 54 bits, IIRC, plus their exponent and sign, so above 54 bits they can no longer represent each successive integer exactly; the set of integers they can represent starts to have gaps. So a model that ran for more than 2^54 generations would lose origin generation precision. Again, I don't expect that to bite anyone any time soon, realistically; but I don't like stuffing a 64-bit int value into a double, it sets off alarm bells in my brain, and "oh, this will never bite anybody" is a common source of problems that do eventually bite someone. :-> |
Ok, let's do it. I think you're right, it'll be just string comparison. (There are probably equivalent schema that differ in eg whitespace, but we can ignore that.)
Yes, totally.
Oh goodness.
Hm, good point. It would contain fractional time values for msprime-added neutral mutations, unless we require people to add mutations using the discrete-time Wright-Fisher model, which we don't want to do. I don't think we use fractional values, at present, to denote early/late, however?
Hm. As you say, unlikely to hit anyone anytime soon... 10^15 generations is firmly beyond what anyone is doing. So, ok - we'll keep this mostly-but-not-entirely redundancy in there. But we better document it! |
Good.
Great.
Yeah, allowing that might have been a bad decision. :->
Ah, not a fractional value for early/late, but the tree-seq time counter flips at a different time than SLiM's counter does in the generation cycle, IIRC. But we do use fractional times in SLiM to resolve time conflicts involved in successive calls to addSubpopSplit() within a single generation; search on
Yep. Let me know when the ball is in my court. Thanks for dealing with this stuff, happy to help when help is needed. I'll commit those Eidos fixes soon. Right now, going back to assembling IKEA furniture. :-O |
It might be worth making a list of overly-flexible things to disallow in SLiM 4.0... |
Sure. Want to kick it off with a new issue? :-> |
It looks to me from ReadProvenanceTable that the only things we get from provenance are:
... so, these should go in metadata, instead (this will be a file format bump, but greatly simplify things, since we won't have to parse provenance any more. So, it'll look like this:
For reference, here's the other things written to provenance. Would it also be helpful to say whether we've got a reference sequence? Anything else? We also need to say "this is a SLiM tree sequence", but I think we'll do this by checking whether the metadata schema - ie the thing that says what should be in the metadata - is a SLiM metadata schema (but, I'm checking with Ben J about best practices here...). |
This sounds good. Of course we can't actually get rid of the provenance-parsing code, because we need to be able to read the old-format files for backward compatibility; but it'll certainly streamline the code path taken by up-to-date users.
Hmm. I don't think there's anything else we need, in the sense that we seem to be getting by fine with what we had. But perhaps we should write out a flag saying whether the model was declared as nucleotide-based or not (and if it was, that would imply that a reference sequence is present, since a reference sequence is required by SLiM for all nucleotide-based models). And perhaps we should write out the spatial dimensionality of the model, as "", "x", "xy", or "xyz", and write out which dimensions are periodic (which would be "", "x", "y", "xy", "xz", "yz", or "xyz"). And perhaps we should write out whether it is a separate-sex model (i.e., initializeSex() was called) or not, with a boolean flag. I don't think we need any of those things, and there's no need to do anything with any of them on read, for now; but since we have a convenient/correct place to put them now, there's no harm in writing them out, and perhaps we or someone else will find them useful in future, and perhaps it will even prevent us from having to revise the file format again some time down the road. If you're not sure where to get the necessary info from, I can add these things after your PR is accepted, of course.
Perhaps also by checking that the tree sequence metadata is present, and that its length matches the expected length? But yeah, checking the schema makes sense to me. |
Oh, and: I was going to fix Eidos so the checks don't fail, but I realized that that's waiting on the quantile() function getting fixed; at present it fails its unit tests, so the check for this PR is going to fail until that gets fixed. |
Ok, so:
|
The other Ben suggests putting all this stuff under a top-level "slim" key, so that other programs can add other stuff. So:
And: thinking more, I don't think we should just do string comparison to check for slim-tree-sequence-ey-ness: really, all we need to do is check for presence of certain metadata; that means checking for presence of certain keys in the schema. If other stuff is there, that's OK, so it leaves the door open for some other program to make a SLiM tree sequence with extra information in it that SLiM doesn't use. Since you've already got json-parsing code in there, this will be easy. So, we just need to check if the top-level metadata has model_type and generation; if the mutation metadata has selection_coeff and origin_generation and the other things, etcetera. |
OK, the build is green on Travis-CI again now, all tests passing including those for quartile(). So if you rebase this PR onto the current head, hopefully Travis will say this PR is green, too. |
b10a94b
to
876b2c7
Compare
So, the Travis build is now succeeding on Xenial but failing on macOS. The macOS fail has to do with I think the culprit is the
Messing with these kinds of internal math library definitions seems pretty evil to me (although I realize it's trying to work around a compiler bug), especially when done with preprocessor redefinitions like this, and this seems directly related to the build errors on Travis, which look like this:
Whatever bug in clang >= 6.0 that stuff in the header is trying to fix, it breaks the build of C++ files that (directly or indirectly) include More broadly, I would suggest that it might be a bad idea for But it may be that fixing/removing that Anyway, apart from that issue, it looks like the self-tests are passing on Xenial, so it looks like you fixed the bug that was biting you yesterday, so that's great. I'm a little nervous about the nature of the fix, though. This whole "missing data" thing seems like a problem to me; I find it worrisome that tskit might be inferring that "missing data" is present in a SLiM tree sequence just because of a particular pattern of ancestry – a pattern that is (as I understand it) omnipresent in forward simulation, no less! I'm not sure it's the right decision to ship tskit 0.3 with this issue present, and I'm concerned about shipping SLiM 3.5 with this issue present, too. I know Jerome wants to get 0.3 out the door, but I'm concerned that letting this problem go out in a major release may prove to be a mistake. I'd like to see whether the full R test suite for SLiM passes with crosscheck enabled. If it does, then perhaps I'll relax about this, but "ship it wrong and we'll think about it later" just feels like a bad idea to me. |
It looks like the clang bug in question was fixed in January of 2019; I don't know how to tell what version of clang/llvm that means the fix was shipped in. If removing that fix is not possible, and moving the math-y stuff to the |
I think I'll tag @jeromekelleher here, since it's already Friday afternoon in the UK and @petrelharp probably won't be online for a couple hours. Jerome, see the two comments above; this is a high-priority bug since it breaks C++ code that includes |
I'll find some input on the library problem - it'd be nice to fix it in a way that works upstream, too, and I don't know what I'm doing here. About missing data: an isolated sample with no mutations above it it is a genome that we don't know anything about, so I think it does make sense that this is 'missing data' I do think that one that has mutations above it shouldn't be missing. At least the behavior is optional; we should at least put a big note about setting
hah! |
Your chief weapon is surprise! |
I don't think I agree. It's common to start a SLiM simulation with, e.g., 1000 individuals (2000 genomes), one of which contains a mutation (a beneficial sweep mutation, say), the other 1999 of which are empty. We know the same amount of information about all of them: we know that the one genome contains the sweep mutation, and we know that the other 1999 don't. None of them should be taken to be "missing data" – they are all in a known, deliberately chosen state, and we have definitive information about the presence or absence of the sweep mutation for all 2000 of them. I am very concerned about designating any of them as being "missing" according to some heuristic; I think to be designated as "missing" ought to involve some kind of explicit declaration that says "this node is not just empty, it is actually missing". Empty nodes happen all the time in forward simulation. Even if it is presently harmless for tskit to consider empty first-generation individuals "missing", that may not remain true forever; at some point, it may make sense for tskit to treat "missing" nodes differently from other nodes in this or that respect, and at that point, when the "missing" versus "not missing" distinction is made more consequential, I am concerned that that would then expose this past decision as being the wrong choice – but it would then be too late to fix it easily, because it would have become ossified into explicit policy, other code would depend upon it, etc. "Missing data" is not a concept that exists in SLiM, and so I think it is a mistake to allow tskit to infer that any node in SLiM is "missing". Even if it's harmless now, I think it may well come back to bite us later. |
Hey. Made some comments on your last commit. Maybe I should have done them as a "review"; does it matter? What's the difference? Anyway, looking pretty good. I should do a more comprehensive code review once you've got things final-ish. Thanks! |
I'm still fuzzy on the distinction also, but I think that 'reviewing' lets the comments come through in one discrete chunk (like, I don't get them 'til you're done with the review, so you can wait 'til you're done going through entirely to send them to me), and also gives a more formal way to "approve" the changes. Either way works. |
Ok - I think everything is working? I've got things synced with pyslim and passing tests over there, at tskit-dev/pyslim#89 . Now'd be a good time to have a look, @bhaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Peter. Great stuff, looks like you had to wade through a lot of tricky things!
I think I've addressed all these comments. Have a look? You should read the metadata schema over here: https://github.com/tskit-dev/pyslim/pull/89/files#diff-e1abd0a63c6a0ff314e018d205d42ca4R17 . |
I think I've replied to your replies as they came at me. Is there a place to see only the loose ends that remain? When I view "Files changed" now it's still showing me lots of conversations that I think we have resolved. What's left? |
The "Files changed" tab gives you all the associated changes (don't forget to refresh it!), and if you go to the "Commits" tab you can click on particular commits; all my edits are in this commit except for adding the meaning of values in this commit. |
Ok - I think I got all those issues, in 32747cd. |
And renaming the constants, in 3cdf4e8 |
3cdf4e8
to
ca8bbb0
Compare
CMakeLists.txt
Outdated
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -std=c11 -Wno-deprecated-register -Wno-attributes -Wunused-label -Wimplicit -Wunused-variable -Wunused-value -Wno-pragmas -Wimplicit-fallthrough -Wempty-body -Wshadow -Wparentheses -Wdangling-else -Wmissing-prototypes -Wswitch -Wpointer-sign -Wsign-compare -Wstrict-prototypes -Wconversion -Wno-sign-conversion -Wuninitialized") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wno-deprecated-register -Wno-attributes -Wunused-label -Wimplicit -Wunused-variable -Wunused-value -Wno-pragmas -Wimplicit-fallthrough -Wempty-body -Wshadow -Wparentheses -Wdangling-else -Wmissing-prototypes -Wswitch -Wpointer-sign -Wsign-compare -Wstrict-prototypes -Wconversion -Wno-sign-conversion -Wuninitialized") | ||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -std=c11 -Wno-attributes -Wunused-label -Wimplicit -Wunused-variable -Wunused-value -Wno-pragmas -Wimplicit-fallthrough -Wempty-body -Wshadow -Wparentheses -Wdangling-else -Wmissing-prototypes -Wswitch -Wpointer-sign -Wsign-compare -Wstrict-prototypes -Wno-sign-conversion -Wuninitialized") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wno-attributes -Wunused-label -Wunused-variable -Wunused-value -Wno-pragmas -Wimplicit-fallthrough -Wempty-body -Wshadow -Wparentheses -Wdangling-else -Wswitch -Wsign-compare -Wno-sign-conversion -Wuninitialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm about to commit these same changes on the master branch; I'm trying to build on Ubuntu now and the warnings are bugging me. So you can just revert this file back to the state of master.
. time adjustments passes treerec/tests! genotypes fix! . top-level metadata first pass at metadata schemas schemas schemas more schemas . fighting with metadata pyslim passes tests update to remove spaces minor update of tskit C edits added possible values fixes rename constants change migration_rec_count_ to uint32_t fix two warnings clean up warnings on gcc schema update update is_null type in schema
e585cbc
to
5133daf
Compare
OK, I rebased and squashed down a bunch of those commits. I've done the same for pyslim. This could be merged, if that's the plan. |
Awesome! I'll do it tomorrow morning, since if there's fallout I don't want to deal with it tonight. Thanks! |
OK, passes all tests with crosscheck on. Deleting the branch. |
Changes (see the tskit changelog):
tsk_mutation_table_X
methods need to use it.tsk_edge_table_X
andtsk_migration_table_X
now have metadata and metadata length arguments. But, we don't use these at all, so this is trivial.TODO:
Remove mutation time from the mutation metadata, and use the mutation time column instead. (I think we should do this? This is a file format change!)(not doing this)This hasn't been tested at all yet. It would be nice to have some developer documentation that says how to run the tests.