Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzip max compression level should be 12 when compiled with libdeflate #1477

Closed
ghuls opened this issue Jul 20, 2022 · 2 comments · Fixed by #1488
Closed

bgzip max compression level should be 12 when compiled with libdeflate #1477

ghuls opened this issue Jul 20, 2022 · 2 comments · Fixed by #1488
Assignees

Comments

@ghuls
Copy link

ghuls commented Jul 20, 2022

bgzip max compression level should be 12 when compiled with libdeflate:

❯  bgzip -h

Version: 1.15.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT           decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout               write on standard output, keep original files unchanged
   -d, --decompress           decompress
   -f, --force                overwrite files without asking
   -g, --rebgzip              use an index file to bgzip a file
   -h, --help                 give this help
   -i, --index                compress and create BGZF index
   -I, --index-name FILE      name of BGZF index file [file.gz.gzi]
   -k, --keep                 don't delete input files during operation
   -l, --compress-level INT   Compression level to use when compressing; 0 to 9, or -1 for default [-1]
   -r, --reindex              (re)index compressed file
   -s, --size INT             decompress INT bytes (uncompressed size)
   -t, --test                 test integrity of compressed file
   -@, --threads INT          number of compression threads to use [1]

For this reason, the commonly used zlib library provides nine compression levels. Level 1 is the fastest but provides the worst compression; level 9 provides the best compression but is the slowest. It defaults to level 6. libdeflate uses this same design but is designed to improve on both zlib's performance and compression ratio at every compression level. In addition, libdeflate's levels go up to 12 to make room for a minimum-cost-path based algorithm (sometimes called "optimal parsing") that can significantly improve on zlib's compression ratio.

https://github.com/ebiggers/libdeflate#compression-levels

@jkbonfield
Copy link
Contributor

We could look into revising the level to go to 12 directly, but there may also be issues with assumptions of a single digit in places such as the format string ("z9" etc). So potentially just remapping 1-9 to 1-12 is the easier fix. I assumed it already did infact!

CRAM does this when using libdeflate, with a mapping of:

1 1
2 2
3 3
4 4
5 6
6 7
7 9
8 10
9 12

@ghuls
Copy link
Author

ghuls commented Jul 20, 2022

Remapping is fine I guess. Recompressing a gzipped FASTQ file (71G) with bgzip with default compression level, made it 73.1G due this larger libdeflate level range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants