Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sha256 performance on ppc64 by 4.5x #394

Closed
wants to merge 1 commit into from

Conversation

gut
Copy link

@gut gut commented Mar 20, 2017

I removed the other hash algorithm on test.cpp and gave as an input a ~700MB file. Below are the results.

Crypto++:

$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m18.811s
user    0m18.456s
sys     0m0.356s

This patch:

$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m4.158s
user    0m3.992s
sys     0m0.168s

This approach used Altivec + VSX instructions found on POWER8 systems and newer.

Results: I removed the other hash algorithm on test.cpp and gave as an
input a ~700MB file.

Upstream:
$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m18.811s
user    0m18.456s
sys     0m0.356s

This patch:
$ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso
SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e

real    0m4.158s
user    0m3.992s
sys     0m0.168s

This approach used altivec + VSX instructions found on POWER8 systems
and newer.

If unwanted on PPC, "cmake" it with -DDISABLE_ALTIVEC=1 or "make" it
with USE_ALTIVEC=0.
@noloader
Copy link
Collaborator

Ack, thanks @gut. Give me a couple of days for the cut-in.

@gut gut closed this Jul 13, 2017
@noloader
Copy link
Collaborator

noloader commented Sep 15, 2017

@gut,

We got our first Power8 implementation behind us; see Add Power8 AES Encryption. AES performance is very impressive. Its about 20x faster than C/C++.

We'd like to pursue SHA now. Do you have some spare cycles to work with us?

@gut
Copy link
Author

gut commented Sep 19, 2017

Hi!
Great! Huge boost!
What do you have in mind? The whole SHA-2? This pull request added only SHA-256.

@noloader
Copy link
Collaborator

noloader commented Sep 22, 2017

@gut,

Thanks for the continued interest, and sorry about the late reply. We are in a much better position to support you now.

What do you have in mind? The whole SHA-2? This pull request added only SHA-256.

As much or as little as you feel like working on.

Eventually it has to get done, so it does not matter to me if you or I do it. I think I would prefer you to cut-in SHA256. I'm guessing you can do it much quicker then me.

Are you OK with placing your work in public domain so no license is required?


We started tracking the SHA addition at Issue 513, Add Power8 SHA Hashing. Its got several supporting commits based on our experience with the AES cut-in.

The first two commits are the library abstracting Linux, AIX, GCC and XLC and hiding the differences behind a common interface in a single header. The single header is ppc-crypto.h, and I already updated the documentation. Effectively we cleaned up the mess you would encounter in C/C++. It also handles that cursed vec_sld (vsldoi) that breaks on little endian.

The second is, the header now has VectorSHA256 and VectorSHA512 for you. VectorSHA256 and VectorSHA512 are template functions due to the "compile time constant" requirement. You would call them like so, depending on whether you wanted sigma0, sigma1, Sigma0 or Sigma1.

uint8x16_p8 x;
x = VectorSHA256<0,0>(x);

The third is, we stubbed-out SHA for you so you don't need to hijack C/C++. You need to open config.h and uncomment CRYPTOPP_POWER8_SHA_AVAILABLE. Then, work your magic in sha-simd.cpp.


Don't worry about CMake. We removed support at Issue 506, Remove CMake from library sources. Also see CMake | Removal on the wiki for the back story, if interested.

All you need to do is (assuming you have 80 or 160 cores):

# Build with default compiler
make -j 80

# Build with GCC
CXX=g++ make -j 80

# Build with default XL C/C++
CXX=xlC make -j 80

You will see either:

g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -mcpu=power8 -maltivec -c sha-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c sha.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -mcpu=power8 -maltivec -c shacal2-simd.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c shacal2.cpp
...

Or:

xlC -DNDEBUG -g2 -O3 -qpic -qrtti -qarch=pwr8 -qaltivec -c sha-simd.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -c sha.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -qarch=pwr8 -qaltivec -c shacal2-simd.cpp
xlC -DNDEBUG -g2 -O3 -qpic -qrtti -c shacal2.cpp
...

As far as the current commits go, you can reset your clone with the following. You will be even with Wei's Master.

cd cryptopp
cp TestScripts/reset-fork.sh .
./reset-fork.sh

@noloader noloader reopened this Sep 22, 2017
@gut
Copy link
Author

gut commented Sep 22, 2017

Hi @noloader
I can't really tell right now if my company allows to license it under public domain, but probably not. I'll get that back to you soon.

Either way, our SIMD implementation is available, at least for reference purposes, on https://github.com/PPC64/sha2-le so you can feel free to use that.

Also I don't know if I'll be able to help porting the SHA-512 as my project priorities shifted a bit and I'm not looking into theses things at the moment. So I'll be up to you, ok?

Thanks for getting back!

@noloader
Copy link
Collaborator

Closing again. We still did not get to it. It will be on the top of the TODO list.

@noloader noloader closed this Jan 21, 2018
@noloader
Copy link
Collaborator

noloader commented Mar 8, 2018

@gut,

We finally rolled an implementation using Power8 SHA intrinsics. See my incubator at Noloader | SHA-Intrinsics.

The performance absolutely sucks. I think your implementation probably outperforms the SHA intrinsic implementation. IBM should be ashamed of themselves for such an under-performing waste of time.

Related, the implementation might be useful as a datapoint for PPC64 | Issue 3.

@gut
Copy link
Author

gut commented Mar 8, 2018

It does sucks, but it also may be that we used it incorrectly (think about inline assembly barriers and how the compiler is being extra careful when handling that code), but as it was a long time ago, I really don't remember how far we made it into this investigation (mostly we used disassembly to find out why that code was inefficient).

In the end we opted to do the whole assembly by ourselves, and it's now precompiled by the gnu's m4 preprocessor to match some architecture configuration (like endianness and alignment of input data):
https://github.com/PPC64/sha2-le/blob/master/sha256_compress_ppc.m4#L4-L12

@noloader
Copy link
Collaborator

noloader commented Mar 8, 2018

@gut,

It does sucks, but it also may be that we used it incorrectly...

Yeah, I agree with you. The problem is IBM did not tell anyone how to use it. The company did not provide a whitepaper, and the company did not provide a reference implementation. Even Andy's implementation for OpenSSL underperforms.

I also made the effort contact the folks on the IBM team who supply the GCC patches. The GCC team received the email and passed it on the the appropriate team. No one responded back and supplied the requested information. The company just ignored the requests for the information. The best I can tell, they don't have a whitepaper or a reference implementation.

The onus is on IBM to provide the necessary documentation.

@noloader
Copy link
Collaborator

noloader commented Mar 10, 2018

Thanks again @gut.

We checked-in an intrinsic/built-in based implementation at Commit 0630d46fe828.

We know the little-endian numbers are off because the compiler is screwing up the loads. Also see GCC Issue 84753 - GCC does not fold xxswapd followed by vperm. What is happening is, the compiler is generating lvx followed by xxswapd and xxlnand to correct endianess, and then it is applying our vperm mask. So there are four instructions to load and permute an arrangement.

The big-endian numbers look a lot better but they probably lag behind your implementation. Big-endian does not suffer the little-endian brain dead-ness. It looks like there's some inefficiency in the way registers are allocated and offloaded. I don't know how to control it from C/C++ code.

@gut
Copy link
Author

gut commented Mar 13, 2018

@lbianc and @leitao : ping. Please take a closer look to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants