-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sha256 performance on ppc64 by 4.5x #394
Conversation
Results: I removed the other hash algorithm on test.cpp and gave as an input a ~700MB file. Upstream: $ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e real 0m18.811s user 0m18.456s sys 0m0.356s This patch: $ time ./cryptest.exe m ~/ubuntu-16.10-server-ppc64el.iso SHA-256: d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e real 0m4.158s user 0m3.992s sys 0m0.168s This approach used altivec + VSX instructions found on POWER8 systems and newer. If unwanted on PPC, "cmake" it with -DDISABLE_ALTIVEC=1 or "make" it with USE_ALTIVEC=0.
Ack, thanks @gut. Give me a couple of days for the cut-in. |
@gut, We got our first Power8 implementation behind us; see Add Power8 AES Encryption. AES performance is very impressive. Its about 20x faster than C/C++. We'd like to pursue SHA now. Do you have some spare cycles to work with us? |
Hi! |
@gut, Thanks for the continued interest, and sorry about the late reply. We are in a much better position to support you now.
As much or as little as you feel like working on. Eventually it has to get done, so it does not matter to me if you or I do it. I think I would prefer you to cut-in SHA256. I'm guessing you can do it much quicker then me. Are you OK with placing your work in public domain so no license is required? We started tracking the SHA addition at Issue 513, Add Power8 SHA Hashing. Its got several supporting commits based on our experience with the AES cut-in. The first two commits are the library abstracting Linux, AIX, GCC and XLC and hiding the differences behind a common interface in a single header. The single header is The second is, the header now has
The third is, we stubbed-out SHA for you so you don't need to hijack C/C++. You need to open Don't worry about CMake. We removed support at Issue 506, Remove CMake from library sources. Also see CMake | Removal on the wiki for the back story, if interested. All you need to do is (assuming you have 80 or 160 cores):
You will see either:
Or:
As far as the current commits go, you can reset your clone with the following. You will be even with Wei's Master.
|
Hi @noloader Either way, our SIMD implementation is available, at least for reference purposes, on https://github.com/PPC64/sha2-le so you can feel free to use that. Also I don't know if I'll be able to help porting the SHA-512 as my project priorities shifted a bit and I'm not looking into theses things at the moment. So I'll be up to you, ok? Thanks for getting back! |
Closing again. We still did not get to it. It will be on the top of the TODO list. |
@gut, We finally rolled an implementation using Power8 SHA intrinsics. See my incubator at Noloader | SHA-Intrinsics. The performance absolutely sucks. I think your implementation probably outperforms the SHA intrinsic implementation. IBM should be ashamed of themselves for such an under-performing waste of time. Related, the implementation might be useful as a datapoint for PPC64 | Issue 3. |
It does sucks, but it also may be that we used it incorrectly (think about inline assembly barriers and how the compiler is being extra careful when handling that code), but as it was a long time ago, I really don't remember how far we made it into this investigation (mostly we used disassembly to find out why that code was inefficient). In the end we opted to do the whole assembly by ourselves, and it's now precompiled by the gnu's m4 preprocessor to match some architecture configuration (like endianness and alignment of input data): |
@gut,
Yeah, I agree with you. The problem is IBM did not tell anyone how to use it. The company did not provide a whitepaper, and the company did not provide a reference implementation. Even Andy's implementation for OpenSSL underperforms. I also made the effort contact the folks on the IBM team who supply the GCC patches. The GCC team received the email and passed it on the the appropriate team. No one responded back and supplied the requested information. The company just ignored the requests for the information. The best I can tell, they don't have a whitepaper or a reference implementation. The onus is on IBM to provide the necessary documentation. |
Thanks again @gut. We checked-in an intrinsic/built-in based implementation at Commit 0630d46fe828. We know the little-endian numbers are off because the compiler is screwing up the loads. Also see GCC Issue 84753 - GCC does not fold xxswapd followed by vperm. What is happening is, the compiler is generating The big-endian numbers look a lot better but they probably lag behind your implementation. Big-endian does not suffer the little-endian brain dead-ness. It looks like there's some inefficiency in the way registers are allocated and offloaded. I don't know how to control it from C/C++ code. |
I removed the other hash algorithm on test.cpp and gave as an input a ~700MB file. Below are the results.
Crypto++:
This patch:
This approach used Altivec + VSX instructions found on POWER8 systems and newer.