Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

rhpvorderman · 2024-04-02T10:20:47Z

See #1677 for prior discussion.

I made the PR such that nibble2base is dynamically dispatched on x86-64 cpus with SSSE3 instructions.

No build options need to be changed and nibble2base gets a nice speed up on the majority of the install base.

jkbonfield · 2024-04-04T15:02:06Z

I can confirm the code works and this function is considerably faster. Curiously despite using ssse3 intrinsics, that code uses different instructions if I enable -mavx2. I can also get it a bit faster on older platforms (but less different on modern ones) by interleaving more loads together (2 128bit lines) and removing intruction dependency latency. I'm guessing newer CPUs have fewer cycles for some of these or quicker memory loads so the latency vanishes. I didn't test AMD either. It's largely irrelevant though as another 10-20% speed up of that function doesn't matter once it's not a significant bottleneck in the overall performance.

However it's clumsy with portability.

__builtin_cpu_supports("ssse3") does indeed need gcc 4.8 and above, but clang always pretends to be GCC 4.2 for those sorts of checks so it doesn't use the optimised function. We normally check two ways, eg:

#if HTS_COMPILER_HAS(aligned) || HTS_GCC_AT_LEAST(4,3)

I don't know what the appropriate HTS_COMPILER_HAS would be to check for the ssse3 cpu support though. Maybe it just needs an equivalent clang version checker (which appears to need clang >= 3.9) maybe. Heaven help anyone attempting to use older pre-clang iccs, but it's not a huge issue as it just falls back to the old code anyway.

We'll probably be putting a new release together soon, but feel this is a bit bleeding edge still to incorporate. Thank you for the improvements. It will get merged, but please don't be put off by it not likely being in the next release.

rhpvorderman · 2024-04-05T13:40:10Z

@jkbonfield Thank you for the feedback. As you mention, this does only significantly speed up some use cases (uBAM + long reads) and only in the lower double digit percentages. So it is not as critical to get this out ASAP. In fact, I have already implemented this in my own code that only parses uBAM. I figured it would be nice to contribute it back to htslib so that all users may enjoy it eventually.

I didn't test AMD either

I did. It works. This should be faster on any ssse3 enabled platform simply due to the fact that memory lookup is almost always going to be slower than register action. In this case the shufb instruction has 1 cycle of latency, so memory can never beat that. L1 cache latency is typically 3 cycles.

Thank you for all your feedback on the previous PR as well. This urged me to learn dynamic dispatch and it is great! It enabled an AVX2 routine for one of my projects.

rhpvorderman · 2024-04-22T12:01:45Z

I can confirm the code works and this function is considerably faster. Curiously despite using ssse3 intrinsics, that code uses different instructions if I enable -mavx2.

This is expected. It will probably use ymm registers rather than xmm registers due to these being practically the same.
Technically AVX2 machines can use _mm256_shuffle_epi8 and this might code into something even more compact.

However it's clumsy with portability.

I now have done the appropriate clang checks as well as according to https://clang.llvm.org/docs/LanguageExtensions.html#builtin-cpu-supports

target + __builtin_cpu_supports is very powerful. I am glad that clang also has it.

rhpvorderman · 2024-04-22T12:10:29Z

I had a look at the code. Using a broadcast instruction the 16 bytes can be loaded into two 128bit vectors within in a 256-bit vector. Instead of two sequences of operations, only one sequence of operations is needed.

The code will look like cleaner at the cost of supporting less computers. In theory it should be faster, but I think the boost in application performance compared to what is there now is negligible. For now I think it is better to merge the code as is. I may try it out on some other projects I work on before I update the htslib code in a new PR if it turns out to be worth it.

jkbonfield · 2024-04-22T12:23:21Z

Thanks. I'll take another look at this soon.

rhpvorderman · 2024-04-22T12:41:16Z

Sorry for the spam, but I want to close of the AVX2 possibilities. AVX2 is definitely not worth it. Tested on roughly 750MB of bases (in BAM format). Speedup is 20ms on the Skylake Intel CPU I am currently working on. It also saves 6 lines of code. Since SSSE3 is supported much more broadly and gets more than 90% of the speed gain, it is better to stick with that.

jkbonfield · 2024-04-29T14:28:13Z

I'm happy with fixing the low hanging fruit only. Once it's "fast enough" polishing the remainder is just adding complexity for little gain.

Thanks for the updates. I'll check them this week and hopefully get it merged.

rhpvorderman · 2024-04-30T08:58:22Z

It is never fast enough. As long as bioinformatics pipelines take longer than my coffee breaks, the performance is still subpar as far as I am concerned. Having said that, I don't mind taking a 20ms longer coffee break. ;-)

jkbonfield · 2024-05-09T15:13:31Z

Thanks. I've had a look at the algorithm which is pretty simple and well documented but very effective. The CPU detection is a bit of a change on how we've done it before, but I like the compiler auto-detection methods and automatic dispatch. As you say they're very powerful and maybe we can use these for other bits of code in the future where custom logic will help.

Thanks.

rhpvorderman · 2024-05-10T07:57:27Z

Thank you for not merging the initial PR, which encouraged me to look into dynamic dispatching. I have used that a few times now in other projects. It is especially useful to enable the pshufb instructions. These basically allow 16-entry lookup tables. Since a lot of code contains 256-entry or 128-entry (for ASCII text) lookup tables these can be many times be written with shuffle instructions. For example, dna only contains A,C,G,T. So rather than using some sort of lookup, there can be some quick checking using vector compare instructions with A,C,G,T. Then the resulting masks can be merged with bitwise OR, so we have an ACGT-only mask. If everything is ACGT, which we just checked with only a few instructions, we can do a bitwise AND with 0b1111, resulting in 4-bit indices that range from 0-15. Since A,C,G,T are distinct in the last few bits, and we verified our vector to be only be A,C,G,T we can then apply the shuffle.

A bit verbose, but in terms of CPU instructions it saves a lot of work. I used that for converting DNA to twobit representation in Sequali. See the code here: https://github.com/rhpvorderman/sequali/blob/a39ceee8a5b25668e068c0812719429386673dc3/src/sequali/function_dispatch.h#L259

jmarshall · 2024-05-14T23:30:20Z

Interesting discussion in the last couple of comments… 🤔

It is probably fine and the assignment to nibble2base is probably “atomic enough” on the platforms for which the code is activated, but is there a multithreading issue here?

rhpvorderman · 2024-05-15T07:50:12Z

@jmarshall That seems to be an accurate observation. I never considered that. Some locking structure in the dispatch would not hurt the performance.
I doubt it will cause problems though. It should only cause problems when the pointer address is not written in one go, so there will be a partial address at the location. I think all the bits should be changed at once with the update as the pointer will be a native integer. I am not entirely sure though.

jkbonfield · 2024-05-15T08:22:05Z

A very good point and generally something I've thought of too (hence the plethora of pthread_once things around). Given the way this is implemented nibble2base_dispatch can do locking and doesn't need to be super efficient as it's a once-only affair by design.

Yes it ideally does need locking, although as you say it's probably "atomic enough" to never cause an issue in practice. On x86-64 that is, which in this case is what matters due to the ifdefs. In general though, it is not acceptable to assume aligned 64-bit writes on a 64-bit system will be atomic. Eg:

https://godbolt.org/z/vso1K7b64

(Edit: although that's still one store instruction, so maybe... It also depends on the memory back end and whether CPU instructions themselves are happening asynchronously)

rhpvorderman · 2024-05-15T09:15:23Z

Ah interesting. I'd like to add that while x86-64 compilers do compile a pointer write to one store instruction, that does not guarantee that the memory is written in one go by the CPU. If it writes byte by byte, and reads byte by byte on the backend then an issue might still occur. if I understood correctly the smallest memory unit that is retrieved and written to memory is the cacheline, which is 64 bytes, so there might not be a problem at all.

Adding mutexes won't hurt however.

Enable SSSE3 intrinsics for nibble parsing

0840b8d

rhpvorderman force-pushed the ssse3nibble branch from 06e6545 to f2fe1cf Compare April 2, 2024 11:13

Make a dynamic dispatcher function for nibble2base

5d0a78e

rhpvorderman force-pushed the ssse3nibble branch from f2fe1cf to 5d0a78e Compare April 2, 2024 11:16

rhpvorderman mentioned this pull request Apr 3, 2024

Use dispatch rather than rely on separate compile commands for SSSE3 bam parser marcelm/dnaio#131

Merged

daviesrob assigned jkbonfield Apr 4, 2024

rhpvorderman force-pushed the ssse3nibble branch from 7b2ee37 to 1c5c763 Compare April 22, 2024 11:45

Improve compiler compatibility of nibble2base_ssse3

4f82c67

rhpvorderman force-pushed the ssse3nibble branch from 1c5c763 to 4f82c67 Compare April 22, 2024 11:59

jkbonfield merged commit 292a35d into samtools:develop May 9, 2024
9 checks passed

rhpvorderman deleted the ssse3nibble branch May 10, 2024 07:50

daviesrob mentioned this pull request May 13, 2024

Fix various cpu detection issues #1785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

rhpvorderman commented Apr 2, 2024

jkbonfield commented Apr 4, 2024 •

edited

Loading

rhpvorderman commented Apr 5, 2024

rhpvorderman commented Apr 22, 2024

rhpvorderman commented Apr 22, 2024

jkbonfield commented Apr 22, 2024

rhpvorderman commented Apr 22, 2024

jkbonfield commented Apr 29, 2024

rhpvorderman commented Apr 30, 2024

jkbonfield commented May 9, 2024

rhpvorderman commented May 10, 2024

jmarshall commented May 14, 2024

rhpvorderman commented May 15, 2024

jkbonfield commented May 15, 2024 •

edited

Loading

rhpvorderman commented May 15, 2024

Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

Conversation

rhpvorderman commented Apr 2, 2024

jkbonfield commented Apr 4, 2024 • edited Loading

rhpvorderman commented Apr 5, 2024

rhpvorderman commented Apr 22, 2024

rhpvorderman commented Apr 22, 2024

jkbonfield commented Apr 22, 2024

rhpvorderman commented Apr 22, 2024

jkbonfield commented Apr 29, 2024

rhpvorderman commented Apr 30, 2024

jkbonfield commented May 9, 2024

rhpvorderman commented May 10, 2024

jmarshall commented May 14, 2024

rhpvorderman commented May 15, 2024

jkbonfield commented May 15, 2024 • edited Loading

rhpvorderman commented May 15, 2024

jkbonfield commented Apr 4, 2024 •

edited

Loading

jkbonfield commented May 15, 2024 •

edited

Loading