Skip to content

Cleanup of the switch from yescrypt to yespower #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

solardiz
Copy link

@solardiz solardiz commented Nov 5, 2018

I thought I'd only drop the misattribution from yespower.h and fix the dangerous bug of failing to check yespower_tls() return value, but I ended up noticing and fixing other issues as well.

Two things I left as-is for now for not knowing their rationale:

  1. Why does sha256.h have extern "C" { ... } dropped? It has those in yespower 1.0, and nested ones are meant to work OK: https://stackoverflow.com/questions/48099828/what-happens-if-you-nest-extern-c

  2. Why does yespower-opt.c have #undef unlikely added, and why only in the GCC-specific branch? At least the latter is probably a bug: if this name was somehow a previously defined macro, then it probably needs to be undefined for non-GCC as well. But I didn't find any other code in the tree that would define this macro.

These changes are untested, but they got to work. ;-) So perhaps test them before merging. Not being even a user of Koto, I feel it'd be too much for me to also contribute testing. ;-)

wo01 and others added 8 commits September 8, 2018 18:41
possibly compile without AVX enabled in the compiler, not to disable this
code in the source.  Doing so in the source as it was done in Koto prior to
this commit is dangerous: the resulting code is a mix of VEX-encoded
compiler-generated instructions (e.g., for Salsa20) and
non-VEX-encoded/legacy SSE2 instructions from the inline asm.  Such mix can
be extremely slow on some CPUs.  Not slightly slower, but many times
slower, depending on how frequent or not the transitions are (does the
compiler also introduce occasional VEX-encoded instructions between the
pieces of inline asm or not) and on CPU microarch (some only penalize the
transitions, for some others the slowdown is smaller but it persists past
transition point).
A good explanation:
https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852
@wo01 wo01 changed the base branch from master to develop November 6, 2018 13:22
@okoto-xyz
Copy link

These source files are that I copied from cpuminer-yescrypt.
Regarding the case AVX code was disabled, I think it was implimented to pursue faster performance as much as possible.
Actually, I also observed slightly faster result on testing for inline asm (AVX disabled) code with -march=native compile option.
but in case of only -mavx option, sandybridge and haswell ares slower and skylake is faster.
benchmark
asm: inline asm (Koto cpuminer optimized)
avx: avx code (original yespower)
based on benchmark.c in yespower-1.0.0, gcc 7.3.0

For Koto core daemon, it seems that there are no -mavx or -march=native option for build script therefore Koto will be compiled with sse2 option in most case because gcc for x86_64 is sse2 enabled by default.
However, considering Koto core daemon is used in various cpu environments and compile options,
it should be a safe code as @solardiz pointed out.

@solardiz
Copy link
Author

solardiz commented Nov 7, 2018

I'm sorry I sent this PR against master - apparently, it should have been against develop.

Yes, cpuminer will need similar changes, and probably more: I'd fully replace the old yescrypt code with yespower, not be adding yespower while still keeping the old yescrypt code like it's been done.

Regarding the speeds, I'm aware that the SSE2 inline asm is typically the fastest - that's why I wrote and included it in there. We just shouldn't use it in AVX-enabled builds for the reason I mentioned. Unfortunately.

Koto's cpuminer should probably build for SSE2 by default, not use any "native" optimization flags.

None of your benchmarks include plain SSE2 builds. You always try to enable some native instruction set in the compiler, and that allows the compiler to also include AVX and even AVX2 instructions in other places, which as I explained might hurt. It'd be useful to include a plain SSE2 build in your comparison. Quite possibly it'll be the fastest on at least some of those CPUs.

In PERFORMANCE file for yespower, I included this wording:

On x86(-64), the following code versions may reasonably be built: SSE2,
AVX, and XOP.  (There's no reason to build for AVX2 and higher, which is
unsuitable for and thus unused by current yespower anyway.  There's also
no reason to build yespower as-is for SSE4, although there's a disabled
by default 32-bit specific SSE4 code version that may be re-enabled and
given a try if someone is so inclined; it may perform slightly slower or
slightly faster across different systems.)

yescrypt and especially yespower 1.0 have been designed to fit the SSE2
instruction set almost perfectly, so there's very little benefit from
the AVX and XOP builds, yet even at yespower 1.0 there may be
performance differences between SSE2, AVX, and XOP builds within 2% or
so (and it is unclear which is the fastest on a given system until
tested, except that where XOP is supported it is almost always faster
than AVX).

For the old yescrypt/yespower 0.5 mode, AVX and XOP matter slightly more, because the code uses Salsa20 more, and Salsa20 benefits from AVX and XOP. Especially when r is low, like it is in Koto. So it is possible AVX is of some help, and it is even possible that the SSE2/AVX mix is the fastest on some builds on some CPUs - but this is risky territory as huge slowdowns may be observed in different builds and/or on different CPUs. I once observed a ~30x slowdown from tight mix of SSE2/AVX2 on Haswell. No joke.

@okoto-xyz
Copy link

okoto-xyz commented Nov 10, 2018

I ran benchmark again with SSE2.
Certainly plain SSE2 build is the fastest performance on Haswell and Skylake.
Interesting.
benchmark2

For Ryzen, AVX seems to be slightly faster.
benchmark3

@solardiz
Copy link
Author

solardiz commented Nov 10, 2018

These new results make sense to me as it relates to relative speeds of the different builds. I'm a bit surprised i5-6500 (3.3 GHz all-core turbo) performs so poorly - I'd expect it to be closer to i5-4590 (3.5 GHz all-core turbo). I'm not surprised i5-4590 performs much worse than i7-4770K, for which I included a comparable benchmark result in PERFORMANCE - it gives 3773 H/s there on CentOS 7 and with default "gcc version 4.8.5 20150623 (Red Hat 4.8.5-28)" - that's clock rate difference (all-core turbo 3.7 GHz) combined with cache size difference (8 vs. 6 MiB). Maybe there's something "wrong" with the specific i5-6500 system - e.g., only one DIMM installed. Whatever.

For the Ryzen 1700 system, optimal thread count is probably 16 or 8, not 4 which you measured - or is that an error in the footnote? The speed is unrealistically high for 4 threads.

In PERFORMANCE, I recommend limiting the threads to physical cores, but that's for yespower 1.0. For yespower 0.5 mode, some systems will benefit from matching logical CPU count and some others from matching physical cores, but either way the performance differences are smaller and this isn't as important as it is for yespower 1.0 mode.

@okoto-xyz
Copy link

That's right.
i5-6500 was result of single channel DIMM.
Ryzen was result of 8 threads. I updated the image. 12 threads was the best result.

@solardiz
Copy link
Author

Also related: Koto adds libbitcoin_common_a_CFLAGS = -fPIC, which I understand was needed for the build to succeed at all (because of -fPIE on linking, inherited from Zcash, which they had for security hardening?), but possibly this leaves out the optimization flags when building yespower C files. A better alternative line may be libbitcoin_common_a_CFLAGS = $(libbitcoin_common_a_CXXFLAGS) (works for me when integrating yespower in another tree, not Koto) or even libbitcoin_common_a_CFLAGS = $(libbitcoin_common_a_CXXFLAGS) -O2 -fomit-frame-pointer to optimize these files in that way regardless of whether the rest of the tree is built with optimizations or not.

okoto-xyz referenced this pull request in okoto-xyz/node-multi-hashing-yespower Nov 16, 2018
wo01 pushed a commit that referenced this pull request Oct 4, 2019
Ignore exceptions when deserializing note plaintexts
wo01 pushed a commit that referenced this pull request Nov 6, 2020
196962ff0 Add AcceleratedCRC32C to port_win.h
1bdf1c34c Merge upstream LevelDB v1.20
d31721eb0 Merge #17: Fixed file sharing errors
fecd44902 Fixed file sharing error in Win32Env::GetFileSize(), Win32SequentialFile::_Init(), Win32RandomAccessFile::_Init() Fixed error checking in Win32SequentialFile::_Init()
5b7510f1b Merge #14: Merge upstream LevelDB 1.19
0d969fd57 Merge #16: [LevelDB] Do no crash if filesystem can't fsync
c8c029b5b [LevelDB] Do no crash if filesystem can't fsync
a53934a3a Increase leveldb version to 1.20.
f3f139737 Separate Env tests from PosixEnv tests.
eb4f0972f leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit).
d0883b600 Fixed path to doc file: index.md.
7fa20948d Convert documentation to markdown.
ea175e28f Implement support for Intel crc32 instruction (SSE 4.2)
95cd743e5 Including <limits> for std::numeric_limits.
646c3588d Limit the number of read-only files the POSIX Env will have open.
d40bc3fa5 Merge #13: Typo
ebbd772d3 Typo
a2fb086d0 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus.

git-subtree-dir: src/leveldb
git-subtree-split: 196962ff01c39b4705d8117df5c3f8c205349950
wo01 pushed a commit that referenced this pull request Mar 25, 2021
98fadc090 Merge #24: Push bool into array correctly
5f03f1f39 Push bool into array correctly
98261b1e7 Merge #22: Clamp JSON object depth to PHP limit
54c401541 Clamp JSON object depth to PHP limit
5a58a4667 Merge #21: Remove hand-coded UniValue destructor.
b4cdfc4f4 Remove hand-coded UniValue destructor.
7fba60b5a Merge #17: [docs] Update readme
4577454e7 Merge #13: Fix typo
ac7e73cda [docs] Update readme
7890db99d Merge #11: Remove deprecated std pair wrappers
40e34852a Merge #14: Cleaned up namespace imports to reduce symbol collisions
4a4964729 Fix typo
85052a481 Remove deprecated std::pair wrappers
51d3ab34b Merge #10: Add pushKV(key, boolean) function (replaces #5)
129bad96d [tests] test pushKV for boolean values
b3c44c947 Pushing boolean value to univalue correctly
07947ff2d Merge #9: [tests] Fix BOOST_CHECK_THROW macro
ec849d9a2 [tests] Fix BOOST_CHECK_THROW macro
d208f986d Cleaned up namespace imports to reduce symbol collisions
31bc9f5a4 Merge #8: Remove unused Homebrew workaround
fa042093d Remove HomeBrew workaround
a523e08ae Merge #7: Declare single-argument (non-converting) constructors "explicit"
a9e53b38b Merge #4: Pull upstream
fe805ea74 Declare single-argument (non-converting) constructors "explicit"
8a2d6f1e3 Merge pull request #41 from jgarzik/get-obj-map
ba341a20d Add getObjMap() helper method.  Also, constify checkObject().
ceb119413 Handle .pushKV() and .checkObject() edge cases.
107db9829 Add ::push_back(double) method for feature parity.
d41530031 Move one-line implementation of UniValue::read() to header.
52e85b35b Move exception-throwing get_* methods into separate implementation module.
dac529675 README.md: update code quotes
3e31dcffb README.md: close code quote
d09b8429d Update README.md
f1b86edb4 Convert README to markdown style.
1dfe464ef Import UniValue class unit tests from bitcoin project.
0d3e74dd1 operator[] takes size_t index parameter (versus unsigned int)
640158fa2 Private findKey() method becomes size_t clean, and returns bool on failure.
709913585 Merge pull request #36 from ryanofsky/pr/end-str
4fd5444d1 Reject unterminated strings
16a1f7f6e Merge #3: Pull upstream
daf1285af Merge pull request #2 from jgarzik/master
f32df99e9 Merge branch '2016_04_unicode' into bitcoin
280b191cb Merge remote-tracking branch 'jgarzik/master' into bitcoin
2740c4f71 Merge branch '2015_11_escape_plan' into bitcoin
REVERT: 9ef5b78c1 Use size_t for UniValue array indexing

git-subtree-dir: src/univalue
git-subtree-split: 98fadc090984fa7e070b6c41ccb514f69a371c85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants