Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file offset #2501

Merged
merged 11 commits into from
Sep 23, 2021
Merged

Add file offset #2501

merged 11 commits into from
Sep 23, 2021

Conversation

joshuak94
Copy link
Contributor

Adds a new functionality to sam_file_input, namely the ability to store the file offset of a read, and then later to seek to the given file offset. This is done by adding seqan3::field::file_offset as a field, and adding the function sam_file_input::seek.

@vercel
Copy link

vercel bot commented Mar 29, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/seqan/seqan3/456mKAcGmKHXF7jYv8knkvRet379
✅ Preview: https://seqan3-git-fork-joshuak94-addfileoffset-seqan.vercel.app

@joshuak94 joshuak94 requested a review from marehr March 29, 2021 17:00
@joshuak94 joshuak94 marked this pull request as draft March 29, 2021 17:01
@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #2501 (a916a35) into master (cd2b2d7) will increase coverage by 0.15%.
The diff coverage is 98.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2501      +/-   ##
==========================================
+ Coverage   98.21%   98.37%   +0.15%     
==========================================
  Files         271      276       +5     
  Lines       10829    10947     +118     
==========================================
+ Hits        10636    10769     +133     
+ Misses        193      178      -15     
Impacted Files Coverage Δ
include/seqan3/io/sam_file/format_bam.hpp 96.37% <ø> (+0.08%) ⬆️
include/seqan3/io/sam_file/format_sam.hpp 97.47% <ø> (+0.06%) ⬆️
include/seqan3/io/sam_file/input.hpp 100.00% <ø> (ø)
...nclude/seqan3/io/sam_file/input_format_concept.hpp 100.00% <ø> (ø)
...e/seqan3/io/sequence_file/input_format_concept.hpp 100.00% <ø> (ø)
include/seqan3/io/detail/in_file_iterator.hpp 96.96% <90.90%> (-3.04%) ⬇️
include/seqan3/io/sam_file/header.hpp 100.00% <100.00%> (ø)
include/seqan3/io/sequence_file/format_embl.hpp 100.00% <100.00%> (ø)
include/seqan3/io/sequence_file/format_fasta.hpp 98.70% <100.00%> (+0.01%) ⬆️
include/seqan3/io/sequence_file/format_fastq.hpp 100.00% <100.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd2b2d7...a916a35. Read the comment docs.

@joshuak94 joshuak94 marked this pull request as ready for review March 30, 2021 10:56
@marehr
Copy link
Member

marehr commented Mar 30, 2021

@joshuak94 I put this decision in our next strategy meeting

@seqan/core I put that into triage board, please remember me that we have to discuss this :)

@joshuak94
Copy link
Contributor Author

@marehr Sounds good!

@joshuak94
Copy link
Contributor Author

joshuak94 commented Mar 30, 2021

By the way, I've done some more reading on std::streampos and I was wondering if you could explain std::streampos vs std::streamoff. I see that the former stores also the "shift state", but I'm not really sure what that means and if it's really necessary for our case, because then I think I'd rather use std::streamoff here.

Using std::streamoff is nice because the size is defined as long long, so I could store the file offsets in an index using an int64_t. Interestingly, BAM index format stores the offsets int uint64_t, even though -1 is a valid value for it.

Edit: I found this quote here

To illustrate shift state and shift sequences, suppose we decide that the sequence 0200 (just one byte) enters Japanese mode, in which pairs of bytes in the range from 0240 to 0377 are single characters, while 0201 enters Latin-1 mode, in which single bytes in the range from 0240 to 0377 are characters, and interpreted according to the ISO Latin-1 character set. This is a multibyte code that has two alternative shift states (“Japanese mode” and “Latin-1 mode”), and two shift sequences that specify particular shift states.

So now my question is, do we need to know the shift states for BAM files? I'm inclined to say we don't.

@marehr
Copy link
Member

marehr commented Mar 31, 2021

By the way, I've done some more reading on std::streampos and I was wondering if you could explain std::streampos vs std::streamoff. I see that the former stores also the "shift state", but I'm not really sure what that means and if it's really necessary for our case, because then I think I'd rather use std::streamoff here.

The thing is that you need both (offset + state) to make sure to get the same absolute position, and you are on the other hand right, that we always use a mode that should be agnostic to this (mostly, because we always try to read byte's).

I can understand that it might be a waste of space if you store many of them (I tried it out and sizeof(std::streampos) == sizeof(std::streamoff) + 8 which doubles the size on my system), but you are always able to convert it into a std::streamoff.

I'm in slight favour for std::streampos as return type, as this has the specific semantic of an absolute position into a file (which might be different from the number of bytes read). Furthermore, I'm not too sure if relative offset-positions will always be correctly implemented as functions like seekg primarily expect absolute positions (our bgzf-stream for example only supports stream offsets from the beginning of a file, other io implementations might only support some certain offset-directions, but all of them should support absolute positioning).

Interestingly, BAM index format stores the offsets int uint64_t, even though -1 is a valid value for it.

Can you explain this a bit more? Why is -1 a valid value? I mean -1 will underflow and produces 2^64-1, of course it is a valid value, but is it an unexpected value or is it a special value?

@joshuak94
Copy link
Contributor Author

I mean that std::streampos and std::streamoff both can have a value of -1, according to cppreference. But in the SAM format standard, they store their file offsets as uint64_t types.

@marehr
Copy link
Member

marehr commented Apr 12, 2021

Core-Meeting 2021-04-12

Hi all, we have the following idea to make the change smaller in scope, but as usable as you want it to be:

  • We want to use std::streampos instead of std::streamoff, as some implementations, like BGZF streams, have no meaningful "offset" implementation. Thus, we want the naming file_position instead of file_offset, as it is an absolute position into a file and not some offset between two positions.
  • We want to have the seek/position feature as a low-level API that sits directly in the iterator.
    (If we ever want to support memory mapping, setting a global position in the file itself seems like some unnecessary global state to store)
  • We found the hidden requirement that all "secondary streams" (e.g., BGZF uncompressing) must support global positioning.

I hope this example is enough to show how the interface should be defined:

Before:

sam_file.seek(position);
for (auto && record: sam_file)
{
  record.file_offset();
}

Now:

auto it = sam_file.begin(); // reads in header, is now at the position of the first record
auto end = sam_file.end();
for (; it != end; ++it)
{
    // the follwing should have no "effect"
    // (of course it re-seeks internally and is a waste of time, but the returned record is the same :))
    it.seek_to(it.file_position());
    *it; // returns record
     
    *it.seek_to(it.file_position()); // returns the record, i.e. seek_to() -> iterator&; returns the iterator itself

    it.seek_to(some_other_position); // the iterator will be seeked to that position and ++it will go on from there
}

That means the iterator gets two new functions

  • We think you can add the functionality to include/seqan3/io/detail/in_file_iterator.hpp
  • As that iterator is currently non-public API, we keep that functionality as a "hidden" feature.
  • it.file_position() -> std::streampos gives the current file (absolute) position of the record
    • document the function as a low-level API
  • it.seek_to(std::streampos{50}) -> iterator & seek to the given position and read / buffer the record
  • Any suggestions? Have we missed something?

In any case, if you have any question, we can talk about the changes :)

@@ -333,6 +333,15 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data
'\x12', '\x48', '\x00', '\x02', '\x02', '\x03', '\x62', '\x48', '\x48', '\x31', '\x41', '\x45', '\x33',
'\x30', '\x00'
};

std::vector<std::streampos> file_offsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: this does not test the real BAM file, but an already decompressed (un'bgzf'ed) BAM file.

And thus does not test file offset seeking

@joshuak94
Copy link
Contributor Author

joshuak94 commented Apr 12, 2021

@marehr One question! For the tests, where should I put them? The in_file_iterator_test only has a dummy type, which I'm pretty sure will not work with the tellg() function... In sam_file_input_test it seems the tests are all done with an istringstream as input, not a file (outside of empty files). Do we have any tests which parse actual files?

Actually, is there any way within the file_position() function to make sure host is an ifstream?

@marehr
Copy link
Member

marehr commented Apr 13, 2021

@marehr One question! For the tests, where should I put them? The in_file_iterator_test only has a dummy type, which I'm pretty sure will not work with the tellg() function...

You are right, it seems that we need to adapt that test to actually handle a seqan3 "input_file". It seems that iterator implementation requires the following things now:

  • file_type has a member function host->read_next_record()
  • file_type has a member host->record_buffer
  • file_type has a member host->at_end
  • file_type has a member host->secondary_stream which is a std::istream (NEW requirement)

I looked at it, and you would need to change the underlying data structure from std::vector<int> to std::istream and a bit how the fake_file is being constructed. I can help you if you want :)

In sam_file_input_test it seems the tests are all done with an istringstream as input, not a file (outside of empty files). Do we have any tests which parse actual files?

That seems to be missing right now in our test infrastructure. I would suggest adding a new test case (I would create a new file) that specifically tests this on a manually crafted BAM file that has maybe 3 alignments stored, where each alignment is in its own BGZF-Block. I.e. a BAM File with one BGZF-Block for the header, three BGZF-Blocks for three alignments and one STOP-BGZF-Block. I can help you with that if you need help :)

(That will show that the expected positions are not byte-positions, but a special addressing scheme)

Actually, is there any way within the file_position() function to make sure host is an ifstream?

@joshuak94
Copy link
Contributor Author

@marehr One question! For the tests, where should I put them? The in_file_iterator_test only has a dummy type, which I'm pretty sure will not work with the tellg() function...

You are right, it seems that we need to adapt that test to actually handle a seqan3 "input_file". It seems that iterator implementation requires the following things now:

* file_type has a member function `host->read_next_record()`

* file_type has a member `host->record_buffer`

* file_type has a member `host->at_end`

* file_type has a member `host->secondary_stream` which is a `std::istream` (NEW requirement)

Right, is there a place for me to specifically require these things? From what I see of the iterator class, file_type doesn't seem to have any explicit requirements.

I looked at it, and you would need to change the underlying data structure from std::vector<int> to std::istream and a bit how the fake_file is being constructed. I can help you if you want :)

In sam_file_input_test it seems the tests are all done with an istringstream as input, not a file (outside of empty files). Do we have any tests which parse actual files?

That seems to be missing right now in our test infrastructure. I would suggest adding a new test case (I would create a new file) that specifically tests this on a manually crafted BAM file that has maybe 3 alignments stored, where each alignment is in its own BGZF-Block. I.e. a BAM File with one BGZF-Block for the header, three BGZF-Blocks for three alignments and one STOP-BGZF-Block. I can help you with that if you need help :)

(That will show that the expected positions are not byte-positions, but a special addressing scheme)

I might need help with creating this file...!

@joshuak94
Copy link
Contributor Author

@marehr I'm probably doing this incorrectly, but here's what my current implementation looks like:

     //!\brief Returns the current position in the file via `std::streampos`.
     std::streampos const file_position()
     {
         assert(host != nullptr);
         return host->tellg();
     }

     //!\brief Low level API. Sets the current position of the iterator to the given position, throws if at end of file.
     in_file_iterator & seek_to(std::streampos const & pos)
     {
         assert(host != nullptr);
         host->seekg(pos);
         host->read_next_record();
         if (host->fail())
         {
             throw std::exception("Seeking to file position failed!\n");
         }
         return *this;
     }

I"m not sure why host needs to have a secondary stream. To be honest, I'm not quite sure how the secondary_stream/primary_stream are interacting with one another. Are you saying that in my above code, I should have

        // In file_position()
        return host->secondary_stream->tellg();
        
        // In seek_to()
        host->secondary_stream->seekg(pos);

Like this?

@marehr
Copy link
Member

marehr commented Apr 13, 2021

Right, is there a place for me to specifically require these things? From what I see of the iterator class, file_type doesn't seem to have any explicit requirements.

You are right, seqan3::detail::in_file_iterator is a non-public API, and file_type has this hard requirement only implicitly. The description of that class is also completely out-dated. xD

I might need help with creating this file...!

Can you give me a sam_file that you want to test? I can convert it into such a bam file. We could also re-use a sam_file from the test. Maybe this one

std::string simple_three_reads_input{
R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)"};
?

@marehr I'm probably doing this incorrectly, but here's what my current implementation looks like:
I"m not sure why host needs to have a secondary stream. To be honest, I'm not quite sure how the secondary_stream/primary_stream are interacting with one another. Are you saying that in my above code, I should have

        // In file_position()
        return host->secondary_stream->tellg();
        
        // In seek_to()
        host->secondary_stream->seekg(pos);

Like this?

This looks right :)

@marehr
Copy link
Member

marehr commented Apr 13, 2021

@joshuak94
Copy link
Contributor Author

@marehr Here is the first implementation of this... I'm 100% sure the new test file is not the best way to do that.

@joshuak94
Copy link
Contributor Author

@marehr I'm not sure what the best way is to change the tests such that they all pass... You probably have a better idea!

@eseiler eseiler requested a review from marehr April 20, 2021 14:35
@joshuak94
Copy link
Contributor Author

@marehr I rebased it... Still having the issues with the CI tests though, if you have any solutions!

@joshuak94 joshuak94 force-pushed the add_file_offset branch 2 times, most recently from b492dee to 44a24a2 Compare June 14, 2021 11:35
@marehr marehr requested a review from Irallia July 26, 2021 08:21
@joshuak94
Copy link
Contributor Author

@marehr Thanks so much for your help! Just wondering regarding the custom SAM hash function, why was it necessary? And was it related to the file position stuff?

@marehr
Copy link
Member

marehr commented Jul 26, 2021

@marehr Thanks so much for your help! Just wondering regarding the custom SAM hash function, why was it necessary? And was it related to the file position stuff?

It fixes a different problem and could be its own PR. Maybe I should split it from this PR.

Copy link
Contributor

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two style things maybe, the rest looks good to me!

@marehr marehr force-pushed the add_file_offset branch from 691cda5 to 7cf2c10 Compare July 27, 2021 15:00
@marehr marehr force-pushed the add_file_offset branch from 7cf2c10 to 652dc2f Compare July 27, 2021 15:06
@marehr
Copy link
Member

marehr commented Jul 27, 2021

@Irallia Sorry you were too fast, and I rebased in the meantime to clean up the history. I hope this is fine. I added a new seek test for the sequences.

@marehr marehr requested a review from Irallia July 27, 2021 15:08
Copy link
Contributor

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM

@Irallia Irallia requested a review from eseiler July 28, 2021 08:24
@marehr marehr requested review from SGSSGene and removed request for eseiler July 28, 2021 09:05
@marehr
Copy link
Member

marehr commented Jul 28, 2021

@SGSSGene Can you do 2nd review?

Copy link
Contributor

@SGSSGene SGSSGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marehr does a change like this require a changelog entry?

include/seqan3/io/detail/in_file_iterator.hpp Outdated Show resolved Hide resolved
@marehr
Copy link
Member

marehr commented Aug 31, 2021

@marehr does a change like this require a changelog entry?

No, currently it is "noapi" / detail code and is only a "feature" for us.

@marehr marehr requested a review from SGSSGene August 31, 2021 12:49
Copy link
Contributor

@SGSSGene SGSSGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (sorry for delaying this PR so much)

@marehr marehr merged commit 2e6f31f into seqan:master Sep 23, 2021
@joshuak94 joshuak94 deleted the add_file_offset branch October 14, 2021 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants