Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand CRAM API a bit to cope with new samtools cram_size command. #1546

Merged
merged 3 commits into from
Jan 20, 2023

Conversation

jkbonfield
Copy link
Contributor

  • cram_block_get_content_id is updated to return -1 for CORE block, removing the need to add a content_type API and share that enum.

  • New cram_block_get_method method

  • New cram_codec_get_content_ids, which externalises the existing cram_codec_to_id (with a hopefully better API and name).

  • Make cram_decode_compression_header and cram_free_compression_header external. The struct itself is opaque.

  • Added cram_update_cid2ds_map, cram_cid2ds_query and cram_cid2ds_free to produce, query and free a block Content-ID to Data-Series mapping from the meta-data held within a CRAM Compression Header.

  • Added a new cram_codec iterator to sequentially step through all cram_codec structures produced by decoding a compression header. This new code is entirely internal to htslib and is solely used by the new cram_update_cid2ds_map function.

  • Add a codec->describe method. This produces a text description of codec parameters, such as

    EXTERNAL(id=1)
    HUFFMAN(codes={5,1,3},lengths={1,2,2})
    BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)},val_codec={EXTERNAL(id=37)}

    Adds the external API: int cram_codec_describe(cram_codec *c, kstring_t *ks) TODO: Some of the proposed CRAM 4.0 codecs don't yet have descriptions, but these aren't official codecs anyway.

  • Add a function to describe all encodings in a compression header block: int cram_describe_encodings(cram_block_compression_hdr *hdr, kstring_t *ks);

  • Add cram_container_get_num_records and cram_container_get_num_bases functions. These are two new external APIs for querying sequence meta-data from containers, which in theory could be used for a (non-filtering) "view -c" cmd.

- cram_block_get_content_id is updated to return -1 for CORE block,
  removing the need to add a content_type API and share that enum.

- New cram_block_get_method method

- New cram_codec_get_content_ids, which externalises the existing
  cram_codec_to_id (with a hopefully better API and name).

- Make cram_decode_compression_header and cram_free_compression_header
  external.  The struct itself is opaque.

- Added cram_update_cid2ds_map, cram_cid2ds_query and cram_cid2ds_free
  to produce, query and free a block Content-ID to Data-Series mapping
  from the meta-data held within a CRAM Compression Header.

- Added a new cram_codec iterator to sequentially step through all
  cram_codec structures produced by decoding a compression header.
  This new code is entirely internal to htslib and is solely used by
  the new cram_update_cid2ds_map function.

- Add a codec->describe method.
  This produces a text description of codec parameters, such as

    EXTERNAL(id=1)
    HUFFMAN(codes={5,1,3},lengths={1,2,2})
    BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)},val_codec={EXTERNAL(id=37)}

  Adds the external API: int cram_codec_describe(cram_codec *c, kstring_t *ks)
  TODO: Some of the proposed CRAM 4.0 codecs don't yet have
  descriptions, but these aren't official codecs anyway.

- Add a function to describe all encodings in a compression header block:
    int cram_describe_encodings(cram_block_compression_hdr *hdr, kstring_t *ks);

- Add cram_container_get_num_records and cram_container_get_num_bases functions.
  These are two new external APIs for querying sequence meta-data from
  containers, which in theory could be used for a (non-filtering) "view -c" cmd.
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the functions in the header file could do with a bit more documentation.

Looking at the associated samtools PR samtools/samtools#1777, I wonder if more of the code should be moved here. A good candidate would be the cram_block_expand_method() function as it would allow the messy grubbing around in data structures to be safely encapsulated in HTSlib. And if we did that, I might also suggest returning a struct so that the codec description could be broken out to make it easier to interpret...

cram/cram_external.c Outdated Show resolved Hide resolved
cram/cram_external.c Outdated Show resolved Hide resolved
cram/cram_external.c Outdated Show resolved Hide resolved
htslib/cram.h Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor Author

I've made a public version of enum cram_block_method, avoiding the problems we had before with shortened enums that could clash. This matches the CRAM specs and doesn't expand upon it to describe any of the specialisations of the codecs.

I'll think on this, but will probably end up with something in htscodecs and a returned structure as you suggest. I may even add the short and long string form as elements to that struct too, or write dedicated functions. Undecided atm.

@jkbonfield
Copy link
Contributor Author

Well, following advice I've moved code from the samtools cram_size code into where it fits best (htscodecs rather than htslib given it's data codec querying rather than specifically CRAM), but it's messy! Specifically samtools has no way of using htscodecs directly, so it'll require a whole raft of build system hacks. Sigh. (For now I have it copied in there so I can check if it compiles and passes CI, but it will be fixed.)

I don't really want to bless the cram_size code as a canonical enum for an expanded list of codecs numbers, because it's rather arbitrary and possibly subject to change. Putting it in htslib or htscodecs means we can't change it, unless we end up copying the whole of cram_size there and make it just fill out a kstring, but that's also wrong as we're punting one-use application code to the library.

I'm starting to think I should have left it all as it was.

@jkbonfield
Copy link
Contributor Author

I resolved it by backing out the changes to htscodecs and punted them to htslib instead. Not ideal as it doesn't quite feel right, but it makes things vastly simpler for samtools and I don't think the minimal difference justifies the extra complexity with putting those functions in htscodecs.

@daviesrob daviesrob self-assigned this Jan 17, 2023
cram/cram_external.c Outdated Show resolved Hide resolved
cram/cram_external.c Show resolved Hide resolved
htslib/cram.h Outdated Show resolved Hide resolved
htslib/cram.h Outdated Show resolved Hide resolved
@daviesrob daviesrob merged commit 826ceea into samtools:develop Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants