-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand CRAM API a bit to cope with new samtools cram_size command. #1546
Conversation
- cram_block_get_content_id is updated to return -1 for CORE block, removing the need to add a content_type API and share that enum. - New cram_block_get_method method - New cram_codec_get_content_ids, which externalises the existing cram_codec_to_id (with a hopefully better API and name). - Make cram_decode_compression_header and cram_free_compression_header external. The struct itself is opaque. - Added cram_update_cid2ds_map, cram_cid2ds_query and cram_cid2ds_free to produce, query and free a block Content-ID to Data-Series mapping from the meta-data held within a CRAM Compression Header. - Added a new cram_codec iterator to sequentially step through all cram_codec structures produced by decoding a compression header. This new code is entirely internal to htslib and is solely used by the new cram_update_cid2ds_map function. - Add a codec->describe method. This produces a text description of codec parameters, such as EXTERNAL(id=1) HUFFMAN(codes={5,1,3},lengths={1,2,2}) BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)},val_codec={EXTERNAL(id=37)} Adds the external API: int cram_codec_describe(cram_codec *c, kstring_t *ks) TODO: Some of the proposed CRAM 4.0 codecs don't yet have descriptions, but these aren't official codecs anyway. - Add a function to describe all encodings in a compression header block: int cram_describe_encodings(cram_block_compression_hdr *hdr, kstring_t *ks); - Add cram_container_get_num_records and cram_container_get_num_bases functions. These are two new external APIs for querying sequence meta-data from containers, which in theory could be used for a (non-filtering) "view -c" cmd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the functions in the header file could do with a bit more documentation.
Looking at the associated samtools PR samtools/samtools#1777, I wonder if more of the code should be moved here. A good candidate would be the cram_block_expand_method()
function as it would allow the messy grubbing around in data structures to be safely encapsulated in HTSlib. And if we did that, I might also suggest returning a struct
so that the codec description could be broken out to make it easier to interpret...
ceb0b3c
to
20063d6
Compare
I've made a public version of I'll think on this, but will probably end up with something in htscodecs and a returned structure as you suggest. I may even add the short and long string form as elements to that struct too, or write dedicated functions. Undecided atm. |
6726303
to
2cf80b9
Compare
Well, following advice I've moved code from the samtools cram_size code into where it fits best (htscodecs rather than htslib given it's data codec querying rather than specifically CRAM), but it's messy! Specifically samtools has no way of using htscodecs directly, so it'll require a whole raft of build system hacks. Sigh. (For now I have it copied in there so I can check if it compiles and passes CI, but it will be fixed.) I don't really want to bless the cram_size code as a canonical enum for an expanded list of codecs numbers, because it's rather arbitrary and possibly subject to change. Putting it in htslib or htscodecs means we can't change it, unless we end up copying the whole of cram_size there and make it just fill out a kstring, but that's also wrong as we're punting one-use application code to the library. I'm starting to think I should have left it all as it was. |
2cf80b9
to
794fa70
Compare
794fa70
to
f0ba60a
Compare
I resolved it by backing out the changes to htscodecs and punted them to htslib instead. Not ideal as it doesn't quite feel right, but it makes things vastly simpler for samtools and I don't think the minimal difference justifies the extra complexity with putting those functions in htscodecs. |
cram_block_get_content_id is updated to return -1 for CORE block, removing the need to add a content_type API and share that enum.
New cram_block_get_method method
New cram_codec_get_content_ids, which externalises the existing cram_codec_to_id (with a hopefully better API and name).
Make cram_decode_compression_header and cram_free_compression_header external. The struct itself is opaque.
Added cram_update_cid2ds_map, cram_cid2ds_query and cram_cid2ds_free to produce, query and free a block Content-ID to Data-Series mapping from the meta-data held within a CRAM Compression Header.
Added a new cram_codec iterator to sequentially step through all cram_codec structures produced by decoding a compression header. This new code is entirely internal to htslib and is solely used by the new cram_update_cid2ds_map function.
Add a codec->describe method. This produces a text description of codec parameters, such as
EXTERNAL(id=1)
HUFFMAN(codes={5,1,3},lengths={1,2,2})
BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)},val_codec={EXTERNAL(id=37)}
Adds the external API: int cram_codec_describe(cram_codec *c, kstring_t *ks) TODO: Some of the proposed CRAM 4.0 codecs don't yet have descriptions, but these aren't official codecs anyway.
Add a function to describe all encodings in a compression header block: int cram_describe_encodings(cram_block_compression_hdr *hdr, kstring_t *ks);
Add cram_container_get_num_records and cram_container_get_num_bases functions. These are two new external APIs for querying sequence meta-data from containers, which in theory could be used for a (non-filtering) "view -c" cmd.