Improve sha256 performance on ppc64 by 4.5x #394

gut · 2017-03-20T11:58:40Z

I removed the other hash algorithm on test.cpp and gave as an input a ~700MB file. Below are the results.

Crypto++:

$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m18.811s
user    0m18.456s
sys     0m0.356s

This patch:

$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m4.158s
user    0m3.992s
sys     0m0.168s

This approach used Altivec + VSX instructions found on POWER8 systems and newer.

Results: I removed the other hash algorithm on test.cpp and gave as an input a ~700MB file. Upstream: $ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e real 0m18.811s user 0m18.456s sys 0m0.356s This patch: $ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e real 0m4.158s user 0m3.992s sys 0m0.168s This approach used altivec + VSX instructions found on POWER8 systems and newer. If unwanted on PPC, "cmake" it with -DDISABLE_ALTIVEC=1 or "make" it with USE_ALTIVEC=0.

noloader · 2017-03-20T15:06:16Z

Ack, thanks @gut. Give me a couple of days for the cut-in.

noloader · 2017-09-15T07:00:31Z

@gut,

We got our first Power8 implementation behind us; see Add Power8 AES Encryption. AES performance is very impressive. Its about 20x faster than C/C++.

We'd like to pursue SHA now. Do you have some spare cycles to work with us?

gut · 2017-09-19T14:12:10Z

Hi!
Great! Huge boost!
What do you have in mind? The whole SHA-2? This pull request added only SHA-256.

noloader · 2017-09-22T12:17:13Z

@gut,

Thanks for the continued interest, and sorry about the late reply. We are in a much better position to support you now.

What do you have in mind? The whole SHA-2? This pull request added only SHA-256.

As much or as little as you feel like working on.

Eventually it has to get done, so it does not matter to me if you or I do it. I think I would prefer you to cut-in SHA256. I'm guessing you can do it much quicker then me.

Are you OK with placing your work in public domain so no license is required?

We started tracking the SHA addition at Issue 513, Add Power8 SHA Hashing. Its got several supporting commits based on our experience with the AES cut-in.

The first two commits are the library abstracting Linux, AIX, GCC and XLC and hiding the differences behind a common interface in a single header. The single header is ppc-crypto.h, and I already updated the documentation. Effectively we cleaned up the mess you would encounter in C/C++. It also handles that cursed vec_sld (vsldoi) that breaks on little endian.

The second is, the header now has VectorSHA256 and VectorSHA512 for you. VectorSHA256 and VectorSHA512 are template functions due to the "compile time constant" requirement. You would call them like so, depending on whether you wanted sigma0, sigma1, Sigma0 or Sigma1.

uint8x16_p8 x;
x = VectorSHA256<0,0>(x);

The third is, we stubbed-out SHA for you so you don't need to hijack C/C++. You need to open config.h and uncomment CRYPTOPP_POWER8_SHA_AVAILABLE. Then, work your magic in sha-simd.cpp.

Don't worry about CMake. We removed support at Issue 506, Remove CMake from library sources. Also see CMake | Removal on the wiki for the back story, if interested.

All you need to do is (assuming you have 80 or 160 cores):

# Build with default compiler
make -j 80

# Build with GCC
CXX=g++ make -j 80

# Build with default XL C/C++
CXX=xlC make -j 80

You will see either:

g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -mcpu=power8 -maltivec -c sha-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c sha.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -mcpu=power8 -maltivec -c shacal2-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c shacal2.cpp
...

Or:

xlC -DNDEBUG -g2 -O3 -qpic -qrtti -qarch=pwr8 -qaltivec -c sha-simd.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -c sha.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -qarch=pwr8 -qaltivec -c shacal2-simd.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -c shacal2.cpp
...

As far as the current commits go, you can reset your clone with the following. You will be even with Wei's Master.

cd cryptopp
cp TestScripts/reset-fork.sh .
./reset-fork.sh

gut · 2017-09-22T14:17:35Z

Hi @noloader
I can't really tell right now if my company allows to license it under public domain, but probably not. I'll get that back to you soon.

Either way, our SIMD implementation is available, at least for reference purposes, on https://github.com/PPC64/sha2-le so you can feel free to use that.

Also I don't know if I'll be able to help porting the SHA-512 as my project priorities shifted a bit and I'm not looking into theses things at the moment. So I'll be up to you, ok?

Thanks for getting back!

noloader · 2018-01-21T14:32:07Z

Closing again. We still did not get to it. It will be on the top of the TODO list.

noloader · 2018-03-08T04:24:19Z

@gut,

We finally rolled an implementation using Power8 SHA intrinsics. See my incubator at Noloader | SHA-Intrinsics.

The performance absolutely sucks. I think your implementation probably outperforms the SHA intrinsic implementation. IBM should be ashamed of themselves for such an under-performing waste of time.

Related, the implementation might be useful as a datapoint for PPC64 | Issue 3.

gut · 2018-03-08T11:12:53Z

It does sucks, but it also may be that we used it incorrectly (think about inline assembly barriers and how the compiler is being extra careful when handling that code), but as it was a long time ago, I really don't remember how far we made it into this investigation (mostly we used disassembly to find out why that code was inefficient).

In the end we opted to do the whole assembly by ourselves, and it's now precompiled by the gnu's m4 preprocessor to match some architecture configuration (like endianness and alignment of input data):
https://github.com/PPC64/sha2-le/blob/master/sha256_compress_ppc.m4#L4-L12

noloader · 2018-03-08T13:03:55Z

@gut,

It does sucks, but it also may be that we used it incorrectly...

Yeah, I agree with you. The problem is IBM did not tell anyone how to use it. The company did not provide a whitepaper, and the company did not provide a reference implementation. Even Andy's implementation for OpenSSL underperforms.

I also made the effort contact the folks on the IBM team who supply the GCC patches. The GCC team received the email and passed it on the the appropriate team. No one responded back and supplied the requested information. The company just ignored the requests for the information. The best I can tell, they don't have a whitepaper or a reference implementation.

The onus is on IBM to provide the necessary documentation.

noloader · 2018-03-10T21:45:34Z

Thanks again @gut.

We checked-in an intrinsic/built-in based implementation at Commit 0630d46fe828.

We know the little-endian numbers are off because the compiler is screwing up the loads. Also see GCC Issue 84753 - GCC does not fold xxswapd followed by vperm. What is happening is, the compiler is generating lvx followed by xxswapd and xxlnand to correct endianess, and then it is applying our vperm mask. So there are four instructions to load and permute an arrangement.

The big-endian numbers look a lot better but they probably lag behind your implementation. Big-endian does not suffer the little-endian brain dead-ness. It looks like there's some inefficiency in the way registers are allocated and offloaded. I don't know how to control it from C/C++ code.

gut · 2018-03-13T12:54:52Z

@lbianc and @leitao : ping. Please take a closer look to this.

gut mentioned this pull request Mar 20, 2017

Add ppc64le (POWER8 little endian) as supported cpu randombit/botan#929

Merged

gut closed this Jul 13, 2017

noloader reopened this Sep 22, 2017

noloader closed this Jan 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sha256 performance on ppc64 by 4.5x #394

Improve sha256 performance on ppc64 by 4.5x #394

gut commented Mar 20, 2017 •

edited by noloader

Loading

noloader commented Mar 20, 2017

noloader commented Sep 15, 2017 •

edited

Loading

gut commented Sep 19, 2017

noloader commented Sep 22, 2017 •

edited

Loading

gut commented Sep 22, 2017

noloader commented Jan 21, 2018

noloader commented Mar 8, 2018 •

edited

Loading

gut commented Mar 8, 2018

noloader commented Mar 8, 2018 •

edited

Loading

noloader commented Mar 10, 2018 •

edited

Loading

gut commented Mar 13, 2018

Improve sha256 performance on ppc64 by 4.5x #394

Improve sha256 performance on ppc64 by 4.5x #394

Conversation

gut commented Mar 20, 2017 • edited by noloader Loading

noloader commented Mar 20, 2017

noloader commented Sep 15, 2017 • edited Loading

gut commented Sep 19, 2017

noloader commented Sep 22, 2017 • edited Loading

gut commented Sep 22, 2017

noloader commented Jan 21, 2018

noloader commented Mar 8, 2018 • edited Loading

gut commented Mar 8, 2018

noloader commented Mar 8, 2018 • edited Loading

noloader commented Mar 10, 2018 • edited Loading

gut commented Mar 13, 2018

gut commented Mar 20, 2017 •

edited by noloader

Loading

noloader commented Sep 15, 2017 •

edited

Loading

noloader commented Sep 22, 2017 •

edited

Loading

noloader commented Mar 8, 2018 •

edited

Loading

noloader commented Mar 8, 2018 •

edited

Loading

noloader commented Mar 10, 2018 •

edited

Loading