-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764
Conversation
06e6545
to
f2fe1cf
Compare
f2fe1cf
to
5d0a78e
Compare
I can confirm the code works and this function is considerably faster. Curiously despite using ssse3 intrinsics, that code uses different instructions if I enable -mavx2. I can also get it a bit faster on older platforms (but less different on modern ones) by interleaving more loads together (2 128bit lines) and removing intruction dependency latency. I'm guessing newer CPUs have fewer cycles for some of these or quicker memory loads so the latency vanishes. I didn't test AMD either. It's largely irrelevant though as another 10-20% speed up of that function doesn't matter once it's not a significant bottleneck in the overall performance. However it's clumsy with portability.
I don't know what the appropriate HTS_COMPILER_HAS would be to check for the ssse3 cpu support though. Maybe it just needs an equivalent clang version checker (which appears to need clang >= 3.9) maybe. Heaven help anyone attempting to use older pre-clang iccs, but it's not a huge issue as it just falls back to the old code anyway. We'll probably be putting a new release together soon, but feel this is a bit bleeding edge still to incorporate. Thank you for the improvements. It will get merged, but please don't be put off by it not likely being in the next release. |
@jkbonfield Thank you for the feedback. As you mention, this does only significantly speed up some use cases (uBAM + long reads) and only in the lower double digit percentages. So it is not as critical to get this out ASAP. In fact, I have already implemented this in my own code that only parses uBAM. I figured it would be nice to contribute it back to htslib so that all users may enjoy it eventually.
I did. It works. This should be faster on any ssse3 enabled platform simply due to the fact that memory lookup is almost always going to be slower than register action. In this case the shufb instruction has 1 cycle of latency, so memory can never beat that. L1 cache latency is typically 3 cycles. Thank you for all your feedback on the previous PR as well. This urged me to learn dynamic dispatch and it is great! It enabled an AVX2 routine for one of my projects. |
7b2ee37
to
1c5c763
Compare
1c5c763
to
4f82c67
Compare
This is expected. It will probably use ymm registers rather than xmm registers due to these being practically the same.
I now have done the appropriate clang checks as well as according to https://clang.llvm.org/docs/LanguageExtensions.html#builtin-cpu-supports
|
I had a look at the code. Using a broadcast instruction the 16 bytes can be loaded into two 128bit vectors within in a 256-bit vector. Instead of two sequences of operations, only one sequence of operations is needed. The code will look like cleaner at the cost of supporting less computers. In theory it should be faster, but I think the boost in application performance compared to what is there now is negligible. For now I think it is better to merge the code as is. I may try it out on some other projects I work on before I update the htslib code in a new PR if it turns out to be worth it. |
Thanks. I'll take another look at this soon. |
Sorry for the spam, but I want to close of the AVX2 possibilities. AVX2 is definitely not worth it. Tested on roughly 750MB of bases (in BAM format). Speedup is 20ms on the Skylake Intel CPU I am currently working on. It also saves 6 lines of code. Since SSSE3 is supported much more broadly and gets more than 90% of the speed gain, it is better to stick with that. |
I'm happy with fixing the low hanging fruit only. Once it's "fast enough" polishing the remainder is just adding complexity for little gain. Thanks for the updates. I'll check them this week and hopefully get it merged. |
It is never fast enough. As long as bioinformatics pipelines take longer than my coffee breaks, the performance is still subpar as far as I am concerned. Having said that, I don't mind taking a 20ms longer coffee break. ;-) |
Thanks. I've had a look at the algorithm which is pretty simple and well documented but very effective. The CPU detection is a bit of a change on how we've done it before, but I like the compiler auto-detection methods and automatic dispatch. As you say they're very powerful and maybe we can use these for other bits of code in the future where custom logic will help. Thanks. |
Thank you for not merging the initial PR, which encouraged me to look into dynamic dispatching. I have used that a few times now in other projects. It is especially useful to enable the pshufb instructions. These basically allow 16-entry lookup tables. Since a lot of code contains 256-entry or 128-entry (for ASCII text) lookup tables these can be many times be written with shuffle instructions. For example, dna only contains A,C,G,T. So rather than using some sort of lookup, there can be some quick checking using vector compare instructions with A,C,G,T. Then the resulting masks can be merged with bitwise OR, so we have an ACGT-only mask. If everything is ACGT, which we just checked with only a few instructions, we can do a bitwise AND with 0b1111, resulting in 4-bit indices that range from 0-15. Since A,C,G,T are distinct in the last few bits, and we verified our vector to be only be A,C,G,T we can then apply the shuffle. A bit verbose, but in terms of CPU instructions it saves a lot of work. I used that for converting DNA to twobit representation in Sequali. See the code here: https://github.com/rhpvorderman/sequali/blob/a39ceee8a5b25668e068c0812719429386673dc3/src/sequali/function_dispatch.h#L259 |
Interesting discussion in the last couple of comments… 🤔 It is probably fine and the assignment to |
@jmarshall That seems to be an accurate observation. I never considered that. Some locking structure in the dispatch would not hurt the performance. |
A very good point and generally something I've thought of too (hence the plethora of pthread_once things around). Given the way this is implemented Yes it ideally does need locking, although as you say it's probably "atomic enough" to never cause an issue in practice. On x86-64 that is, which in this case is what matters due to the ifdefs. In general though, it is not acceptable to assume aligned 64-bit writes on a 64-bit system will be atomic. Eg: https://godbolt.org/z/vso1K7b64 (Edit: although that's still one store instruction, so maybe... It also depends on the memory back end and whether CPU instructions themselves are happening asynchronously) |
Ah interesting. I'd like to add that while x86-64 compilers do compile a pointer write to one store instruction, that does not guarantee that the memory is written in one go by the CPU. If it writes byte by byte, and reads byte by byte on the backend then an issue might still occur. if I understood correctly the smallest memory unit that is retrieved and written to memory is the cacheline, which is 64 bytes, so there might not be a problem at all. Adding mutexes won't hurt however. |
See #1677 for prior discussion.
I made the PR such that nibble2base is dynamically dispatched on x86-64 cpus with SSSE3 instructions.
No build options need to be changed and nibble2base gets a nice speed up on the majority of the install base.