-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file offset #2501
Add file offset #2501
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/seqan/seqan3/456mKAcGmKHXF7jYv8knkvRet379 |
Codecov Report
@@ Coverage Diff @@
## master #2501 +/- ##
==========================================
+ Coverage 98.21% 98.37% +0.15%
==========================================
Files 271 276 +5
Lines 10829 10947 +118
==========================================
+ Hits 10636 10769 +133
+ Misses 193 178 -15
Continue to review full report at Codecov.
|
@joshuak94 I put this decision in our next strategy meeting @seqan/core I put that into triage board, please remember me that we have to discuss this :) |
@marehr Sounds good! |
By the way, I've done some more reading on Using Edit: I found this quote here
So now my question is, do we need to know the shift states for BAM files? I'm inclined to say we don't. |
The thing is that you need both (offset + state) to make sure to get the same absolute position, and you are on the other hand right, that we always use a mode that should be agnostic to this (mostly, because we always try to read byte's). I can understand that it might be a waste of space if you store many of them (I tried it out and I'm in slight favour for
Can you explain this a bit more? Why is |
I mean that |
8eb2208
to
56c77b8
Compare
Core-Meeting 2021-04-12Hi all, we have the following idea to make the change smaller in scope, but as usable as you want it to be:
I hope this example is enough to show how the interface should be defined: Before: sam_file.seek(position);
for (auto && record: sam_file)
{
record.file_offset();
} Now: auto it = sam_file.begin(); // reads in header, is now at the position of the first record
auto end = sam_file.end();
for (; it != end; ++it)
{
// the follwing should have no "effect"
// (of course it re-seeks internally and is a waste of time, but the returned record is the same :))
it.seek_to(it.file_position());
*it; // returns record
*it.seek_to(it.file_position()); // returns the record, i.e. seek_to() -> iterator&; returns the iterator itself
it.seek_to(some_other_position); // the iterator will be seeked to that position and ++it will go on from there
} That means the iterator gets two new functions
In any case, if you have any question, we can talk about the changes :) |
@@ -333,6 +333,15 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data | |||
'\x12', '\x48', '\x00', '\x02', '\x02', '\x03', '\x62', '\x48', '\x48', '\x31', '\x41', '\x45', '\x33', | |||
'\x30', '\x00' | |||
}; | |||
|
|||
std::vector<std::streampos> file_offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: this does not test the real BAM file, but an already decompressed (un'bgzf'ed) BAM file.
And thus does not test file offset seeking
@marehr One question! For the tests, where should I put them? The in_file_iterator_test only has a dummy type, which I'm pretty sure will not work with the Actually, is there any way within the |
You are right, it seems that we need to adapt that test to actually handle a seqan3 "input_file". It seems that iterator implementation requires the following things now:
I looked at it, and you would need to change the underlying data structure from
That seems to be missing right now in our test infrastructure. I would suggest adding a new test case (I would create a new file) that specifically tests this on a manually crafted BAM file that has maybe 3 alignments stored, where each alignment is in its own BGZF-Block. I.e. a BAM File with one BGZF-Block for the header, three BGZF-Blocks for three alignments and one STOP-BGZF-Block. I can help you with that if you need help :) (That will show that the expected positions are not byte-positions, but a special addressing scheme)
|
Right, is there a place for me to specifically require these things? From what I see of the iterator class,
I might need help with creating this file...! |
@marehr I'm probably doing this incorrectly, but here's what my current implementation looks like: //!\brief Returns the current position in the file via `std::streampos`.
std::streampos const file_position()
{
assert(host != nullptr);
return host->tellg();
}
//!\brief Low level API. Sets the current position of the iterator to the given position, throws if at end of file.
in_file_iterator & seek_to(std::streampos const & pos)
{
assert(host != nullptr);
host->seekg(pos);
host->read_next_record();
if (host->fail())
{
throw std::exception("Seeking to file position failed!\n");
}
return *this;
} I"m not sure why host needs to have a secondary stream. To be honest, I'm not quite sure how the secondary_stream/primary_stream are interacting with one another. Are you saying that in my above code, I should have // In file_position()
return host->secondary_stream->tellg();
// In seek_to()
host->secondary_stream->seekg(pos); Like this? |
You are right,
Can you give me a sam_file that you want to test? I can convert it into such a bam file. We could also re-use a sam_file from the test. Maybe this one seqan3/test/unit/io/sam_file/format_sam_test.cpp Lines 34 to 40 in a3a4267
This looks right :) |
bf8f947
to
a4cb5ea
Compare
@marehr Here is the first implementation of this... I'm 100% sure the new test file is not the best way to do that. |
a4cb5ea
to
54ae402
Compare
@marehr I'm not sure what the best way is to change the tests such that they all pass... You probably have a better idea! |
@marehr I rebased it... Still having the issues with the CI tests though, if you have any solutions! |
54bb6f2
to
91bc126
Compare
b492dee
to
44a24a2
Compare
@marehr Thanks so much for your help! Just wondering regarding the custom SAM hash function, why was it necessary? And was it related to the file position stuff? |
It fixes a different problem and could be its own PR. Maybe I should split it from this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two style things maybe, the rest looks good to me!
test/include/seqan3/test/fixture/io/sam_file/simple_three_verbose_reads_fixture.hpp
Outdated
Show resolved
Hide resolved
…ing the position_buffer of the different files.
@Irallia Sorry you were too fast, and I rebased in the meantime to clean up the history. I hope this is fine. I added a new seek test for the sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGFM
@SGSSGene Can you do 2nd review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marehr does a change like this require a changelog entry?
No, currently it is "noapi" / detail code and is only a "feature" for us. |
Co-authored-by: Simon Gene Gottlieb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (sorry for delaying this PR so much)
Adds a new functionality to
sam_file_input
, namely the ability to store the file offset of a read, and then later to seek to the given file offset. This is done by addingseqan3::field::file_offset
as a field, and adding the functionsam_file_input::seek
.