-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base64: add avx512 and vbmi version. #6361
Conversation
675c488
to
7748c2e
Compare
did you check the performance impact on a system where the COMPILER supports avx512/vbmi but the cpu doesn't? (i've heard horror stories about gcc emulating intrinsics horribly slow on non-capable cpus) |
I just checked on a AVX2 machine, no impact. It has a runtime path(see resolve_base64_encode/resolve_base64_decode) to resolve the SIMD path capable by the machine, for a AVX2 device it still pick the original AVX2 intrinsic code. |
@cmb69 Do you know who can help to review this? Thanks. Also we has other AVX512 related performance optimization, this PR add some build and runtime support for AVX512 variants. |
AFAIK, @laruence originally implemented this. Maybe he want to take a look? |
@jianxind my comment is off-topic: I see that you work at Intel so it is in your interest to optimizer hardware use on mainstream languages. Therefore a nice follow-up of this PR would be to port such optimization to the JVM/OpenJDK too (using intrinsics or the https://openjdk.java.net/jeps/338 ). It's just an HS suggestion from a jealous java dev, feel free to not answer my comment :) |
Great thanks for the possible pathfinding, yes it's our interest to fully utilize hardware capacity on all frameworks. I see AdoptOpenJDK/openjdk-jdk11@d764765 added AVX512 support for base64 encoding, seems it use a different algorithm. We'd like to check the technology detail then and add Java to our scope. |
7748c2e
to
d6c1436
Compare
@jianxind could you rebase this on latest master? |
d6c1436
to
749862e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't object against this.
The only bad thing is reading strings above boundary. This should be resolved to avoid valgrind and address sanitizer warnings.
The speed measurement was done on short strings. The speed up on longer strings should be much better. Usually, SIMD optimized string function may provide even worse result on short strings.
Yes, more improvements on the longer strings. The benchmark measure the average performance on string size from 1 to 10000. I don't know if has much longer size then 10000 in real case:) |
Right again. Sorry, I should be more careful :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please split your new test cases into a new file as recommended by Dmitry?
749862e
to
83b4683
Compare
Done, all test case are new adding now. Also add one loop test to cover all SIMD variant. |
83b4683
to
abc1d8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check this in action, but sources look good and tests passed.
In worst case of serious problems we may revert this.
@dstogov PHP is often executed in many process. The AVX-2, AVX-512 instructions improve the efficiency of single instruction performance, but threads running in parallel that do not use SIMD are subject to clock down. This can affect servers with high RPS. On the other hand, these problems are mitigated on CPUs from IceLake or later: Sorry, this is a multi-post with room11, but it was timely. |
The clock-down mostly happens on heavy data operation(FMA/crypto instruction), in current case, the SIMD part for base64 only use the light byte shuffle/shift instructions, even for CPUs before Icelake the the chance of clock-down is very low. |
Yes, but this source, all AVX-512 instructions used the L1 (Power State 1 or higher), even on the ICL. This is meant to result in a lightweight clock down. I am all for SIMD speedups, but I think we need to look at this carefully. |
I didn't know about frequency level licensing yet. First make instruction sets that may bring at most 2 times speed up, but then drop frequency of all cores when using them, that may bring up to 1.5 times slowdown for all cores. Funny :) Anyway, I don't think this is a stopper for this particular use case. |
1. Implementation based on https://github.com/WojciechMula/base64simd 2. Only runtime path is added to reduce the complexity of SIMD variants. 3. Expand test case to cover SIMD implementation. Signed-off-by: Frank Du <[email protected]>
/home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c: In function ‘zend_jit_init’:
/home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:4850:119: error: passing argument 4 of ‘ts_allocate_id’ from incompatible pointer type [-Werror=incompatible-pointer-types]
4850 | jit_globals_id = ts_allocate_id(&jit_globals_id, sizeof(zend_jit_globals), (ts_allocate_ctor) zend_jit_globals_ctor, zend_jit_globals_dtor);
| ^~~~~~~~~~~~~~~~~~~~~
| |
| void (*)(zend_jit_globals *) {aka void (*)(struct _zend_jit_globals *)}
In file included from /home/runner/work/php-src/php-src/Zend/zend_portability.h:47,
from /home/runner/work/php-src/php-src/Zend/zend_types.h:25,
from /home/runner/work/php-src/php-src/Zend/zend.h:27,
from /home/runner/work/php-src/php-src/main/php.h:31,
from /home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:19:
/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
| ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors Please fix the compilation error. |
abc1d8d
to
f37e064
Compare
This error is not related with this commit. Just rebase the PR and then we can see how it's going. |
I'll let CI run and merge this if it's mostly green. Sorry for this taking so long to get merged. |
Signed-off-by: Frank Du [email protected]
Benchmarking with below synthetic code on a capable device.
Results, little is best