Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster nibble parsing and a SSSE3 routine for even faster nibble parsing #1677

Closed
wants to merge 1 commit into from

Conversation

rhpvorderman
Copy link
Contributor

Nibble parsing was written as a for loop. This leads to an extra loop control variable being used. By using pointer arithmetic, the loop control variable can be parsed out. This leads to faster code.

I also wrote a routine to utilise the pshufb instruction in order to get even faster translation.

Speed before:

Benchmark 1: ./test/bin/htsfile ~/test/nanopore_100000reads.u.bam  --view > /dev/null
  Time (mean ± σ):     893.6 ms ±   9.7 ms    [User: 770.5 ms, System: 122.9 ms]
  Range (min … max):   878.6 ms … 913.4 ms    10 runs

Speed after:

Benchmark 1: ./test/bin/htsfile ~/test/nanopore_100000reads.u.bam  --view > /dev/null
  Time (mean ± σ):     694.4 ms ±   6.7 ms    [User: 583.3 ms, System: 110.6 ms]
  Range (min … max):   682.1 ms … 703.0 ms    10 runs

Speed after, using SSSE3

Benchmark 1: ./test/bin/htsfile ~/test/nanopore_100000reads.u.bam  --view > /dev/null
  Time (mean ± σ):     524.7 ms ±  11.9 ms    [User: 400.3 ms, System: 124.3 ms]
  Range (min … max):   510.0 ms … 552.7 ms    10 runs

Both the non-ssse3 and ssse3 function deliver the same htsfile output for all the 100.000 reads tested. (Tested using a checksum.)

@rhpvorderman rhpvorderman changed the title Enable SSSE3 intrinsics for nibble parsing Faster nibble parsing and a SSSE3 routine for even faster nibble parsing Sep 26, 2023
@daviesrob
Copy link
Member

This looks interesting, but it would need some form of dynamic dispatch. While SSSE3 may be widespread now, we can't make any assumptions about the hardware in use.

@rhpvorderman
Copy link
Contributor Author

@daviesrob Sure. I have been meaning to practice with dynamic dispatch for sometime. At least it is also quite a lot faster without the SSSE3 routine. Would it be an idea to remove the SSSE3 code from this PR and do that later in a separate PR?

@rhpvorderman
Copy link
Contributor Author

rhpvorderman commented Sep 26, 2023

Strangely enough I retested on a intel core i7 -6700 system and the result was slower than on my ryzen 5 5650 system. So it seems to differ per CPU strangely enough whether it is faster or slower using the while loop vs the for loop.

Here is the loop in assembly for this PR

.L3:
        movzx   esi, BYTE PTR [rcx]
        add     rax, 2
        add     rcx, 1
        movzx   esi, WORD PTR code2base.0[rsi+rsi]
        mov     WORD PTR [rax-2], si
        cmp     rax, r8
        jb      .L3

Develop branch

.L3:
        movzx   esi, BYTE PTR [rdi+rax]
        movzx   esi, WORD PTR code2base.0[rsi+rsi]
        mov     WORD PTR [rbp+0+rax*2], si
        add     rax, 1
        cmp     ecx, eax
        jg      .L3

I can't see why one branch would be faster than the other.

The SSSE3 routine is always faster, but indeed that requires some extra work.

@jkbonfield
Copy link
Contributor

I was testing on Intel and found it's also highly dependent on compiler, and probably version.

Original scalar code, for 100k ONT reads (NB this nibble function only, perf counts): clan16 543, gcc13 413
Your pointer based scalar code: clang16 314, gcc13 553.
An unrolled alternative: clang16 325, gcc13 368.

The SSSE3 implementation of this function however was around 110 perf hits/counts for both compilers.
However I'm unsure if the code is worth it as for me at least it's only looking like ~4% speed gain to the overall test_view program. What was in your BAM file? Is it just sequence, or also having quality values, alignments, aux tags, etc? Also what compiler and version are you using? What CPU?

The unrolled alternative looks better than both orig and the new variant as far as scalar code goes. It's a work-in-progress, but I have atm:

    const char *seq_end_ptr_8atatime = seq + (len & (~7ULL));                                        
    uint16_t *c16 = (uint16_t *)code2base;                                                           
    while (seq_cursor < seq_end_ptr_8atatime) {                                                      
        memcpy(seq_cursor+0*2, &c16[nibble_cursor[0]], 2);                                           
        memcpy(seq_cursor+1*2, &c16[nibble_cursor[1]], 2);                                           
        memcpy(seq_cursor+2*2, &c16[nibble_cursor[2]], 2);                                           
        memcpy(seq_cursor+3*2, &c16[nibble_cursor[3]], 2);                                           
        memcpy(seq_cursor+4*2, &c16[nibble_cursor[4]], 2);                                           
        memcpy(seq_cursor+5*2, &c16[nibble_cursor[5]], 2);                                           
        memcpy(seq_cursor+6*2, &c16[nibble_cursor[6]], 2);                                           
        memcpy(seq_cursor+7*2, &c16[nibble_cursor[7]], 2);                                           
                                                                                        
        seq_cursor += 16;                                                                            
        nibble_cursor += 8;                                                                          
    }                                                                                                

I can tidy that up a bit more as it has remnant of earlier experiments in there too with uint16_ts.

I can test AMD too though as it sounds like that's different again.

@jkbonfield
Copy link
Contributor

I checked AMD EPYC 7713 too and found your new scalar code was considerably slower than the old scalar code. I recall exploring pointer vs array indices in this code before and found it very variable on system / compiler, so I think the old code was basically a happy medium of the lot.

However in all cases manual unrolling seems to win.

Curiously it's also important to cast code2base into uint16_t first as without it the unrolled code takes 4 moves and 1 add per entry. With the cast it's 3 moves. I can't explain this, but it appears the *2 array indices is really being processed that way (add-self is *2). Maybe it's because it's unsure of alignment, but code2base is local and it ought to know this. Manual alignment instructions may fix it (I didn't try), but the easy solution is a cast.

Ie the code I quoted above.

I tried doing it as a loop and letting it auto-unroll, but that demonstrates another issue:

#define NX 8
    const char *seq_end_ptr_8atatime = seq + (len & (~(NX-1)));                                      
    uint16_t *c16 = (uint16_t *)code2base;                                                           
    while (seq_cursor < seq_end_ptr_8atatime) {                                                      
        for (unsigned int i = 0; i < NX; i++)                                                        
            memcpy(seq_cursor+i*2, &c16[nibble_cursor[i]], 2);                                       
        seq_cursor += NX*2;                                                                          
        nibble_cursor += NX;                                                                         
    }                                                                                                

That should be the same as the manually unrolled one, but it runs at half speed and it doesn't unroll. To fix that we either need #program GCC ivdep to indicate there are no dependencies between loop iterations, or more portably add restrict to the seq and nib args to tell the compiler that they are non-overlapping memory objects.

I think it's easier just to stick with the manually unrolled version for easy portability.

(Oddly clang doesn't need the restrict bit. Maybe it's generating two code paths and testing at run time? I didn't check the assembly output.)

@jkbonfield
Copy link
Contributor

Finally, some full testing on the AMD system for 100k ONT records from uncompressed BAM to SAM.

Scalar unrolled code vs SSSE3 enabled: clang 16 was 15% faster and gcc13 was 10% faster.

So it does add up, but that's pretty much the majority of the time in this function so it's hard to see how it improves things to the extent you saw. I assume it's something specific about your data. Length of reads, existance of aux tags maybe?

@rhpvorderman
Copy link
Contributor Author

The reads in this set are very long. They are from the GM24385_1.fastq.gz file (from the GIAB data indices, extra long ONT reads). The average read length was longer than 7000. This data was converted to BAM using Picard FastqToSam, in order to simulate uBAM. Because of the very long reads I think the algorithm has a chance to truly shine.

I wrote a little uBAM-only parser for dnaio (alignments are ignored) and found that the SSSE3 routine improved parsing speed about 20%, so that is concordant with your result. If I cheat and do not store the BAM in bgzip blocks (just decompress the bam with gzip), then the speedup is 33%. In my uBAM parser I also used SSE2 for adding +33 to all the qualities so that is less of a bottleneck.

How should I proceed from this point? It is clear that the scalar code in this PR is mixed depending on hardware (so not worth it) and the SSSE3 code is phenomally better, but also much harder to integrate. 10-15% overall application performance by improving a single function is quite nice. So it would feel a bit wasteful not to do anything with it.

@jkbonfield
Copy link
Contributor

My ONT file had an average read length of about 4.6kb, so similar ballpark. You probably gained more because you didn't have any alignment fields or aux tags to print up, meaning sequence printing becomes more significant.

Agreed a 10-15% speed gain for this case is worth having, even more so if the circumstances make that a larger percentage. I'm not sure it's worth it if it's only included if the user builds using -march=native or -mssse3 though.

I haven't tried it, but there is this: https://gcc.gnu.org/wiki/FunctionMultiVersioning and more blurb at https://johnnysswlab.com/cpu-dispatching-make-your-code-both-portable-and-fast/

It'd need appropriate compiler guards to ensure it's only there on compilers that support it (see htslib/hts_defs.h for examples).

For htscodecs I went with the more complex dispatching route, but primarily because when I started it I wasn't aware of the builtin methods. Since then though it's turned out to be helpful because I have multiple versions and it's not always best to use the highest SIMD implementation on all platforms. That's over the top though for this function I think.

Regarding the qual+33, I'm surprised it's not auto-vectorised. If not then maybe this is a case where we ought to be using restrict.

@jkbonfield
Copy link
Contributor

jkbonfield commented Sep 27, 2023

Regarding the qual+33, I'm surprised it's not auto-vectorised. If not then maybe this is a case where we ought to be using restrict.

It turns out that gcc really hates using SIMD in loops. I've absolutely no idea why.

@ seq4d[samtools.../htslib]; cat _.c
void add33(char *restrict a, char *restrict b, int len) {
    for (int i = 0; i < len; i++)
        a[i] = b[i] + 33;
}

@ seq4d[samtools.../htslib]; gcc13 -fopt-info-loop -O2 -fopt-info-vec-missed -Drestrict="" -c _.c
_.c:2:23: missed: couldn't vectorize loop
_.c:2:23: missed: would need a runtime alias check

@ seq4d[samtools.../htslib]; gcc13 -fopt-info-loop -O2 -fopt-info-vec-missed -c _.c
_.c:2:23: missed: couldn't vectorize loop
_.c:2:23: missed: Loop costings not worthwhile.

@ seq4d[samtools.../htslib]1; gcc13 -fopt-info-loop -O2 -fopt-info-vec-missed -ftree-loop-vectorize -c _.c
_.c:2:23: optimized: loop vectorized using 16 byte vectors
_.c:2:23: optimized: loop vectorized using 8 byte vectors
_.c:1:6: optimized: loop turned into non-loop; it never loops

The first compilation is disabling restrict and asking what gcc thinks. It notices an alias issue and refuses the vectorise. Fine - that's what restrict is for. (Although I'll note clang builds vectorised and non-vectorised versions with a run-time check for aliasing.)

The second compilation line is the one that astounded me. It just doesn't believe vectorising is worth while. Whatever costing model it uses is clearly not fit for purpose here. Explicitly telling it to use the tree-loop-vectorize option fixes this (it's also part of -O3), but it does amaze me that it's not an automatic thing.

Adding __attribute__((optimize("O3"))) around that function also rescues this so the default optimisation levels we use build sensible code.

Using this, with and without the -O3 attribute, on my test_view command to turn uncompressed BAM into SAM changed from 3.9 to 3.2s. So a full 20% speed gain across the entire program.

I rebuilt both versions (with and without -O3) with -fno-inline so I could report just that one function. Auto-vectorisation of this code is massively beneficial, as we'd expect.

$ for i in `seq 1 5`;do taskset -c 20 perf record ./test/test_view -N 100000 /tmp/_.bam -p /dev/null 2>/dev/null;perf report -n 2>/dev/null| grep add33;done
    27.04%          1302  test_view  test_view           [.] add33
    27.80%          1268  test_view  test_view              [.] add33
    27.86%          1261  test_view  test_view           [.] add33
    27.07%          1222  test_view  test_view           [.] add33
    27.09%          1225  test_view  test_view              [.] add33

$ for i in `seq 1 5`;do taskset -c 20 perf record ./test/test_view -N 100000 /tmp/_.bam -p /dev/null 2>/dev/null;perf report -n 2>/dev/null| grep add33;done
     2.26%            80  test_view  test_view              [.] add33
     2.33%            84  test_view  test_view           [.] add33
     2.56%            91  test_view  test_view           [.] add33
     2.25%            80  test_view  test_view           [.] add33
     2.23%            80  test_view  test_view              [.] add33

That's not even remotely close! I don't want to add explicit vectorisation code, but clearly supplying some hints to the compiler are appropriate. #pragma GCC ivdep isn't enough either, as that's much the same as restrict. Maybe gcc just doesn't know how many loop counts exist?

An alternative implementation, rather brain dead, does get vectorised, but only with gcc 12 onwards. Otherwise it's still rubbish:

void add33(uint8_t *restrict a, uint8_t *restrict b, int len) {
    int i, j;
#define NSTEP 16

    // Unrolled inner loop with fixed size to aid gcc -O2
    for (i = 0; i < len-NSTEP; i+=NSTEP)
        for (j = 0; j < NSTEP; j++)
            a[i+j] = b[i+j] + 33;

    // remainder
    for (; i < len; i++)
        a[i] = b[i] + 33;
}

@rhpvorderman
Copy link
Contributor Author

Yeah I don't usually bother with the compiler optimizations and go straight for the SIMD. This is my function:

static void
decode_bam_qualities(uint8_t *dest, const uint8_t *encoded_qualities, size_t length)
{
    const uint8_t *end_ptr = encoded_qualities + length;
    const uint8_t *cursor = encoded_qualities;
    uint8_t *dest_cursor = dest;
    #ifdef __SSE2__
    const uint8_t *vec_end_ptr = end_ptr - sizeof(__m128i);
    while (cursor < vec_end_ptr) {
        __m128i quals = _mm_loadu_si128((__m128i *)cursor);
        __m128i phreds = _mm_add_epi8(quals, _mm_set1_epi8(33));
        _mm_storeu_si128((__m128i *)dest_cursor, phreds);
        cursor += sizeof(__m128i);
        dest_cursor += sizeof(__m128i);
    }
    #endif
    while (cursor < end_ptr) {
        *dest_cursor = *cursor + 33;
        cursor += 1;
        dest_cursor += 1;
    }
}

Because of the compile guards this will compile to slow but working code. Since SSE2 is part of the X86_64 spec, it is automatically compiled in on that platform. (Which is 90%+ of the target audience anyway so quite happy with that).
I am very thankful for your digressions into making the compiler behave. That is very instructive to me. I am quite new to this area of programming for performance. Especially when it comes to dynamic dispatch and making the compiler behave.

@jkbonfield
Copy link
Contributor

My own preference is where possible to avoid explicit intrinsics, as it doesn't help ARM for example. There are generalised vectorisation libraries, but then we get additional code complexities (although some are as simple as a header file you copy which isn't so bad). Best though is working out how to get the compiler to do the work for us.

I added #1679 to beat the qual+33 problems into submission, but the nibble-to-ASCII sequence conversion is your PR here. I'll let you progress with that one.

I think it would be good though to improve both the scalar version (eg my unrolled form) as well as to incorporate the SSSE3 variant.

@rhpvorderman
Copy link
Contributor Author

My own preference is where possible to avoid explicit intrinsics, as it doesn't help ARM for example. There are generalised vectorisation libraries, but then we get additional code complexities (although some are as simple as a header file you copy which isn't so bad). Best though is working out how to get the compiler to do the work for us.

Yes, but it only works well for trivial problems. Writing the nibble decoding in such a way that the compiler automatically comes up with the sequence presented here is not feasible. Luckily most problems are trivial.

For this function I will try out dynamic dispatch. I think it is a good fit for this problem as the dispatch overhead won't be that bad given that sequences are usually at least 150bp. So compiler inlining won't save the day here. Also it is a nice opportunity for me to learn about dynamic dispatch and having someone already experienced review the code.

It might take me a few day to get to this though, if you don't mind.

@jkbonfield
Copy link
Contributor

Is this still being worked on.

We cannot accept this as it currently stands as it'd need CPU detection or some auto-dispatch mechanism so the binaries are compatible across hardware platforms. At the very least, we could consider a non SSSE3 variant if there is better auto-vectorised or optimisable code.

@rhpvorderman
Copy link
Contributor Author

Hi, this is still on my todo list, but work has been tremendously busy at the moment. Feel free to close this and I can make another attempt using dynamic dispatch at a later date. Unless you want to implement that yourself of course.

@daviesrob
Copy link
Member

Ok, I'll close this for now. Feel free to open a new pull request if you come up with a new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants