-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster nibble parsing and a SSSE3 routine for even faster nibble parsing #1677
Conversation
0bdacbc
to
d739ecd
Compare
This looks interesting, but it would need some form of dynamic dispatch. While SSSE3 may be widespread now, we can't make any assumptions about the hardware in use. |
@daviesrob Sure. I have been meaning to practice with dynamic dispatch for sometime. At least it is also quite a lot faster without the SSSE3 routine. Would it be an idea to remove the SSSE3 code from this PR and do that later in a separate PR? |
1037343
to
d739ecd
Compare
Strangely enough I retested on a intel core i7 -6700 system and the result was slower than on my ryzen 5 5650 system. So it seems to differ per CPU strangely enough whether it is faster or slower using the while loop vs the for loop. Here is the loop in assembly for this PR .L3:
movzx esi, BYTE PTR [rcx]
add rax, 2
add rcx, 1
movzx esi, WORD PTR code2base.0[rsi+rsi]
mov WORD PTR [rax-2], si
cmp rax, r8
jb .L3 Develop branch
I can't see why one branch would be faster than the other. The SSSE3 routine is always faster, but indeed that requires some extra work. |
I was testing on Intel and found it's also highly dependent on compiler, and probably version. Original scalar code, for 100k ONT reads (NB this nibble function only, perf counts): clan16 543, gcc13 413 The SSSE3 implementation of this function however was around 110 perf hits/counts for both compilers. The unrolled alternative looks better than both orig and the new variant as far as scalar code goes. It's a work-in-progress, but I have atm:
I can tidy that up a bit more as it has remnant of earlier experiments in there too with uint16_ts. I can test AMD too though as it sounds like that's different again. |
I checked AMD EPYC 7713 too and found your new scalar code was considerably slower than the old scalar code. I recall exploring pointer vs array indices in this code before and found it very variable on system / compiler, so I think the old code was basically a happy medium of the lot. However in all cases manual unrolling seems to win. Curiously it's also important to cast code2base into uint16_t first as without it the unrolled code takes 4 moves and 1 add per entry. With the cast it's 3 moves. I can't explain this, but it appears the *2 array indices is really being processed that way (add-self is *2). Maybe it's because it's unsure of alignment, but code2base is local and it ought to know this. Manual alignment instructions may fix it (I didn't try), but the easy solution is a cast. Ie the code I quoted above. I tried doing it as a loop and letting it auto-unroll, but that demonstrates another issue:
That should be the same as the manually unrolled one, but it runs at half speed and it doesn't unroll. To fix that we either need I think it's easier just to stick with the manually unrolled version for easy portability. (Oddly clang doesn't need the restrict bit. Maybe it's generating two code paths and testing at run time? I didn't check the assembly output.) |
Finally, some full testing on the AMD system for 100k ONT records from uncompressed BAM to SAM. Scalar unrolled code vs SSSE3 enabled: clang 16 was 15% faster and gcc13 was 10% faster. So it does add up, but that's pretty much the majority of the time in this function so it's hard to see how it improves things to the extent you saw. I assume it's something specific about your data. Length of reads, existance of aux tags maybe? |
The reads in this set are very long. They are from the GM24385_1.fastq.gz file (from the GIAB data indices, extra long ONT reads). The average read length was longer than 7000. This data was converted to BAM using Picard FastqToSam, in order to simulate uBAM. Because of the very long reads I think the algorithm has a chance to truly shine. I wrote a little uBAM-only parser for dnaio (alignments are ignored) and found that the SSSE3 routine improved parsing speed about 20%, so that is concordant with your result. If I cheat and do not store the BAM in bgzip blocks (just decompress the bam with gzip), then the speedup is 33%. In my uBAM parser I also used SSE2 for adding +33 to all the qualities so that is less of a bottleneck. How should I proceed from this point? It is clear that the scalar code in this PR is mixed depending on hardware (so not worth it) and the SSSE3 code is phenomally better, but also much harder to integrate. 10-15% overall application performance by improving a single function is quite nice. So it would feel a bit wasteful not to do anything with it. |
My ONT file had an average read length of about 4.6kb, so similar ballpark. You probably gained more because you didn't have any alignment fields or aux tags to print up, meaning sequence printing becomes more significant. Agreed a 10-15% speed gain for this case is worth having, even more so if the circumstances make that a larger percentage. I'm not sure it's worth it if it's only included if the user builds using I haven't tried it, but there is this: https://gcc.gnu.org/wiki/FunctionMultiVersioning and more blurb at https://johnnysswlab.com/cpu-dispatching-make-your-code-both-portable-and-fast/ It'd need appropriate compiler guards to ensure it's only there on compilers that support it (see htslib/hts_defs.h for examples). For htscodecs I went with the more complex dispatching route, but primarily because when I started it I wasn't aware of the builtin methods. Since then though it's turned out to be helpful because I have multiple versions and it's not always best to use the highest SIMD implementation on all platforms. That's over the top though for this function I think. Regarding the qual+33, I'm surprised it's not auto-vectorised. If not then maybe this is a case where we ought to be using |
It turns out that gcc really hates using SIMD in loops. I've absolutely no idea why.
The first compilation is disabling restrict and asking what gcc thinks. It notices an alias issue and refuses the vectorise. Fine - that's what restrict is for. (Although I'll note clang builds vectorised and non-vectorised versions with a run-time check for aliasing.) The second compilation line is the one that astounded me. It just doesn't believe vectorising is worth while. Whatever costing model it uses is clearly not fit for purpose here. Explicitly telling it to use the tree-loop-vectorize option fixes this (it's also part of -O3), but it does amaze me that it's not an automatic thing. Adding Using this, with and without the -O3 attribute, on my I rebuilt both versions (with and without -O3) with -fno-inline so I could report just that one function. Auto-vectorisation of this code is massively beneficial, as we'd expect.
That's not even remotely close! I don't want to add explicit vectorisation code, but clearly supplying some hints to the compiler are appropriate. An alternative implementation, rather brain dead, does get vectorised, but only with gcc 12 onwards. Otherwise it's still rubbish:
|
Yeah I don't usually bother with the compiler optimizations and go straight for the SIMD. This is my function: static void
decode_bam_qualities(uint8_t *dest, const uint8_t *encoded_qualities, size_t length)
{
const uint8_t *end_ptr = encoded_qualities + length;
const uint8_t *cursor = encoded_qualities;
uint8_t *dest_cursor = dest;
#ifdef __SSE2__
const uint8_t *vec_end_ptr = end_ptr - sizeof(__m128i);
while (cursor < vec_end_ptr) {
__m128i quals = _mm_loadu_si128((__m128i *)cursor);
__m128i phreds = _mm_add_epi8(quals, _mm_set1_epi8(33));
_mm_storeu_si128((__m128i *)dest_cursor, phreds);
cursor += sizeof(__m128i);
dest_cursor += sizeof(__m128i);
}
#endif
while (cursor < end_ptr) {
*dest_cursor = *cursor + 33;
cursor += 1;
dest_cursor += 1;
}
} Because of the compile guards this will compile to slow but working code. Since SSE2 is part of the X86_64 spec, it is automatically compiled in on that platform. (Which is 90%+ of the target audience anyway so quite happy with that). |
My own preference is where possible to avoid explicit intrinsics, as it doesn't help ARM for example. There are generalised vectorisation libraries, but then we get additional code complexities (although some are as simple as a header file you copy which isn't so bad). Best though is working out how to get the compiler to do the work for us. I added #1679 to beat the qual+33 problems into submission, but the nibble-to-ASCII sequence conversion is your PR here. I'll let you progress with that one. I think it would be good though to improve both the scalar version (eg my unrolled form) as well as to incorporate the SSSE3 variant. |
Yes, but it only works well for trivial problems. Writing the nibble decoding in such a way that the compiler automatically comes up with the sequence presented here is not feasible. Luckily most problems are trivial. For this function I will try out dynamic dispatch. I think it is a good fit for this problem as the dispatch overhead won't be that bad given that sequences are usually at least 150bp. So compiler inlining won't save the day here. Also it is a nice opportunity for me to learn about dynamic dispatch and having someone already experienced review the code. It might take me a few day to get to this though, if you don't mind. |
Is this still being worked on. We cannot accept this as it currently stands as it'd need CPU detection or some auto-dispatch mechanism so the binaries are compatible across hardware platforms. At the very least, we could consider a non SSSE3 variant if there is better auto-vectorised or optimisable code. |
Hi, this is still on my todo list, but work has been tremendously busy at the moment. Feel free to close this and I can make another attempt using dynamic dispatch at a later date. Unless you want to implement that yourself of course. |
Ok, I'll close this for now. Feel free to open a new pull request if you come up with a new version. |
Nibble parsing was written as a for loop. This leads to an extra loop control variable being used. By using pointer arithmetic, the loop control variable can be parsed out. This leads to faster code.
I also wrote a routine to utilise the pshufb instruction in order to get even faster translation.
Speed before:
Speed after:
Speed after, using SSSE3
Both the non-ssse3 and ssse3 function deliver the same htsfile output for all the 100.000 reads tested. (Tested using a checksum.)