Cleanup of the switch from yescrypt to yespower #17

solardiz · 2018-11-05T16:00:21Z

I thought I'd only drop the misattribution from yespower.h and fix the dangerous bug of failing to check yespower_tls() return value, but I ended up noticing and fixing other issues as well.

Two things I left as-is for now for not knowing their rationale:

Why does sha256.h have extern "C" { ... } dropped? It has those in yespower 1.0, and nested ones are meant to work OK: https://stackoverflow.com/questions/48099828/what-happens-if-you-nest-extern-c
Why does yespower-opt.c have #undef unlikely added, and why only in the GCC-specific branch? At least the latter is probably a bug: if this name was somehow a previously defined macro, then it probably needs to be undefined for non-GCC as well. But I didn't find any other code in the tree that would define this macro.

These changes are untested, but they got to work. ;-) So perhaps test them before merging. Not being even a user of Koto, I feel it'd be too much for me to also contribute testing. ;-)

…longer misattributed.

…roper."

possibly compile without AVX enabled in the compiler, not to disable this code in the source. Doing so in the source as it was done in Koto prior to this commit is dangerous: the resulting code is a mix of VEX-encoded compiler-generated instructions (e.g., for Salsa20) and non-VEX-encoded/legacy SSE2 instructions from the inline asm. Such mix can be extremely slow on some CPUs. Not slightly slower, but many times slower, depending on how frequent or not the transitions are (does the compiler also introduce occasional VEX-encoded instructions between the pieces of inline asm or not) and on CPU microarch (some only penalize the transitions, for some others the slowdown is smaller but it persists past transition point). A good explanation: https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852

okoto-xyz · 2018-11-07T12:57:16Z

These source files are that I copied from cpuminer-yescrypt.
Regarding the case AVX code was disabled, I think it was implimented to pursue faster performance as much as possible.
Actually, I also observed slightly faster result on testing for inline asm (AVX disabled) code with -march=native compile option.
but in case of only -mavx option, sandybridge and haswell ares slower and skylake is faster.

asm: inline asm (Koto cpuminer optimized)
avx: avx code (original yespower)
based on benchmark.c in yespower-1.0.0, gcc 7.3.0

For Koto core daemon, it seems that there are no -mavx or -march=native option for build script therefore Koto will be compiled with sse2 option in most case because gcc for x86_64 is sse2 enabled by default.
However, considering Koto core daemon is used in various cpu environments and compile options,
it should be a safe code as @solardiz pointed out.

solardiz · 2018-11-07T14:06:32Z

I'm sorry I sent this PR against master - apparently, it should have been against develop.

Yes, cpuminer will need similar changes, and probably more: I'd fully replace the old yescrypt code with yespower, not be adding yespower while still keeping the old yescrypt code like it's been done.

Regarding the speeds, I'm aware that the SSE2 inline asm is typically the fastest - that's why I wrote and included it in there. We just shouldn't use it in AVX-enabled builds for the reason I mentioned. Unfortunately.

Koto's cpuminer should probably build for SSE2 by default, not use any "native" optimization flags.

None of your benchmarks include plain SSE2 builds. You always try to enable some native instruction set in the compiler, and that allows the compiler to also include AVX and even AVX2 instructions in other places, which as I explained might hurt. It'd be useful to include a plain SSE2 build in your comparison. Quite possibly it'll be the fastest on at least some of those CPUs.

In PERFORMANCE file for yespower, I included this wording:

On x86(-64), the following code versions may reasonably be built: SSE2,
AVX, and XOP.  (There's no reason to build for AVX2 and higher, which is
unsuitable for and thus unused by current yespower anyway.  There's also
no reason to build yespower as-is for SSE4, although there's a disabled
by default 32-bit specific SSE4 code version that may be re-enabled and
given a try if someone is so inclined; it may perform slightly slower or
slightly faster across different systems.)

yescrypt and especially yespower 1.0 have been designed to fit the SSE2
instruction set almost perfectly, so there's very little benefit from
the AVX and XOP builds, yet even at yespower 1.0 there may be
performance differences between SSE2, AVX, and XOP builds within 2% or
so (and it is unclear which is the fastest on a given system until
tested, except that where XOP is supported it is almost always faster
than AVX).

For the old yescrypt/yespower 0.5 mode, AVX and XOP matter slightly more, because the code uses Salsa20 more, and Salsa20 benefits from AVX and XOP. Especially when r is low, like it is in Koto. So it is possible AVX is of some help, and it is even possible that the SSE2/AVX mix is the fastest on some builds on some CPUs - but this is risky territory as huge slowdowns may be observed in different builds and/or on different CPUs. I once observed a ~30x slowdown from tight mix of SSE2/AVX2 on Haswell. No joke.

okoto-xyz · 2018-11-10T15:05:00Z

I ran benchmark again with SSE2.
Certainly plain SSE2 build is the fastest performance on Haswell and Skylake.
Interesting.

For Ryzen, AVX seems to be slightly faster.

solardiz · 2018-11-10T16:55:18Z

These new results make sense to me as it relates to relative speeds of the different builds. I'm a bit surprised i5-6500 (3.3 GHz all-core turbo) performs so poorly - I'd expect it to be closer to i5-4590 (3.5 GHz all-core turbo). I'm not surprised i5-4590 performs much worse than i7-4770K, for which I included a comparable benchmark result in PERFORMANCE - it gives 3773 H/s there on CentOS 7 and with default "gcc version 4.8.5 20150623 (Red Hat 4.8.5-28)" - that's clock rate difference (all-core turbo 3.7 GHz) combined with cache size difference (8 vs. 6 MiB). Maybe there's something "wrong" with the specific i5-6500 system - e.g., only one DIMM installed. Whatever.

For the Ryzen 1700 system, optimal thread count is probably 16 or 8, not 4 which you measured - or is that an error in the footnote? The speed is unrealistically high for 4 threads.

In PERFORMANCE, I recommend limiting the threads to physical cores, but that's for yespower 1.0. For yespower 0.5 mode, some systems will benefit from matching logical CPU count and some others from matching physical cores, but either way the performance differences are smaller and this isn't as important as it is for yespower 1.0 mode.

okoto-xyz · 2018-11-11T07:29:58Z

That's right.
i5-6500 was result of single channel DIMM.
Ryzen was result of 8 threads. I updated the image. 12 threads was the best result.

solardiz · 2018-11-16T11:30:55Z

Also related: Koto adds libbitcoin_common_a_CFLAGS = -fPIC, which I understand was needed for the build to succeed at all (because of -fPIE on linking, inherited from Zcash, which they had for security hardening?), but possibly this leaves out the optimization flags when building yespower C files. A better alternative line may be libbitcoin_common_a_CFLAGS = $(libbitcoin_common_a_CXXFLAGS) (works for me when integrating yespower in another tree, not Koto) or even libbitcoin_common_a_CFLAGS = $(libbitcoin_common_a_CXXFLAGS) -O2 -fomit-frame-pointer to optimize these files in that way regardless of whether the rest of the tree is built with optimizations or not.

Ignore exceptions when deserializing note plaintexts

196962ff0 Add AcceleratedCRC32C to port_win.h 1bdf1c34c Merge upstream LevelDB v1.20 d31721eb0 Merge #17: Fixed file sharing errors fecd44902 Fixed file sharing error in Win32Env::GetFileSize(), Win32SequentialFile::_Init(), Win32RandomAccessFile::_Init() Fixed error checking in Win32SequentialFile::_Init() 5b7510f1b Merge #14: Merge upstream LevelDB 1.19 0d969fd57 Merge #16: [LevelDB] Do no crash if filesystem can't fsync c8c029b5b [LevelDB] Do no crash if filesystem can't fsync a53934a3a Increase leveldb version to 1.20. f3f139737 Separate Env tests from PosixEnv tests. eb4f0972f leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit). d0883b600 Fixed path to doc file: index.md. 7fa20948d Convert documentation to markdown. ea175e28f Implement support for Intel crc32 instruction (SSE 4.2) 95cd743e5 Including <limits> for std::numeric_limits. 646c3588d Limit the number of read-only files the POSIX Env will have open. d40bc3fa5 Merge #13: Typo ebbd772d3 Typo a2fb086d0 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus. git-subtree-dir: src/leveldb git-subtree-split: 196962ff01c39b4705d8117df5c3f8c205349950

98fadc090 Merge #24: Push bool into array correctly 5f03f1f39 Push bool into array correctly 98261b1e7 Merge #22: Clamp JSON object depth to PHP limit 54c401541 Clamp JSON object depth to PHP limit 5a58a4667 Merge #21: Remove hand-coded UniValue destructor. b4cdfc4f4 Remove hand-coded UniValue destructor. 7fba60b5a Merge #17: [docs] Update readme 4577454e7 Merge #13: Fix typo ac7e73cda [docs] Update readme 7890db99d Merge #11: Remove deprecated std pair wrappers 40e34852a Merge #14: Cleaned up namespace imports to reduce symbol collisions 4a4964729 Fix typo 85052a481 Remove deprecated std::pair wrappers 51d3ab34b Merge #10: Add pushKV(key, boolean) function (replaces #5) 129bad96d [tests] test pushKV for boolean values b3c44c947 Pushing boolean value to univalue correctly 07947ff2d Merge #9: [tests] Fix BOOST_CHECK_THROW macro ec849d9a2 [tests] Fix BOOST_CHECK_THROW macro d208f986d Cleaned up namespace imports to reduce symbol collisions 31bc9f5a4 Merge #8: Remove unused Homebrew workaround fa042093d Remove HomeBrew workaround a523e08ae Merge #7: Declare single-argument (non-converting) constructors "explicit" a9e53b38b Merge #4: Pull upstream fe805ea74 Declare single-argument (non-converting) constructors "explicit" 8a2d6f1e3 Merge pull request #41 from jgarzik/get-obj-map ba341a20d Add getObjMap() helper method. Also, constify checkObject(). ceb119413 Handle .pushKV() and .checkObject() edge cases. 107db9829 Add ::push_back(double) method for feature parity. d41530031 Move one-line implementation of UniValue::read() to header. 52e85b35b Move exception-throwing get_* methods into separate implementation module. dac529675 README.md: update code quotes 3e31dcffb README.md: close code quote d09b8429d Update README.md f1b86edb4 Convert README to markdown style. 1dfe464ef Import UniValue class unit tests from bitcoin project. 0d3e74dd1 operator[] takes size_t index parameter (versus unsigned int) 640158fa2 Private findKey() method becomes size_t clean, and returns bool on failure. 709913585 Merge pull request #36 from ryanofsky/pr/end-str 4fd5444d1 Reject unterminated strings 16a1f7f6e Merge #3: Pull upstream daf1285af Merge pull request #2 from jgarzik/master f32df99e9 Merge branch '2016_04_unicode' into bitcoin 280b191cb Merge remote-tracking branch 'jgarzik/master' into bitcoin 2740c4f71 Merge branch '2015_11_escape_plan' into bitcoin REVERT: 9ef5b78c1 Use size_t for UniValue array indexing git-subtree-dir: src/univalue git-subtree-split: 98fadc090984fa7e070b6c41ccb514f69a371c85

wo01 and others added 8 commits September 8, 2018 18:41

Merge branch 'dev-1.1.2'

2bac512

Update Koto 1.1.2.1

5ab3b6c

Merge remote-tracking branch 'origin/dev-1.1.2'

83f7461

Replace the copyright statement with Koto's, so that this file is no …

792478b

…longer misattributed.

Check for errors from yespower_tls() and abort() on any.

26b909b

Note that "yespower_hash() is a Koto addition, not part of yespower p…

d307fed

…roper."

Update sysendian.h to yespower's revision of it.

676353a

wo01 changed the base branch from master to develop November 6, 2018 13:22

okoto-xyz referenced this pull request in okoto-xyz/node-multi-hashing-yespower Nov 16, 2018

Cleanup of yespower

37a09a4

wo01 pushed a commit that referenced this pull request Oct 4, 2019

Merge pull request #17 from zcash/exception-handling

f5832a9

Ignore exceptions when deserializing note plaintexts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup of the switch from yescrypt to yespower #17

Cleanup of the switch from yescrypt to yespower #17

solardiz commented Nov 5, 2018 •

edited

Loading

okoto-xyz commented Nov 7, 2018

solardiz commented Nov 7, 2018

okoto-xyz commented Nov 10, 2018 •

edited

Loading

solardiz commented Nov 10, 2018 •

edited

Loading

okoto-xyz commented Nov 11, 2018

solardiz commented Nov 16, 2018

Cleanup of the switch from yescrypt to yespower #17

Are you sure you want to change the base?

Cleanup of the switch from yescrypt to yespower #17

Conversation

solardiz commented Nov 5, 2018 • edited Loading

okoto-xyz commented Nov 7, 2018

solardiz commented Nov 7, 2018

okoto-xyz commented Nov 10, 2018 • edited Loading

solardiz commented Nov 10, 2018 • edited Loading

okoto-xyz commented Nov 11, 2018

solardiz commented Nov 16, 2018

solardiz commented Nov 5, 2018 •

edited

Loading

okoto-xyz commented Nov 10, 2018 •

edited

Loading

solardiz commented Nov 10, 2018 •

edited

Loading