base64: add avx512 and vbmi version. #6361

frankdjx · 2020-10-21T03:24:26Z

Implementation based on https://github.com/WojciechMula/base64simd
Only runtime path is added to reduce the complexity of SIMD variants.
Expand test case to cover SIMD implementation.

Signed-off-by: Frank Du [email protected]

Benchmarking with below synthetic code on a capable device.

function simple_base64_encode() {
  $a = "foo";
  for ($i = 0; $i < 10000; $i++) {
      base64_encode($a);
      $a .= "o";
  }
}

function simple_base64_decode() {
  $a = "foo";
  for ($i = 0; $i < 10000; $i++) {
      base64_decode($a);
      $a .= "o";
  }
}

Results, little is best

Avx2:
base64_encode      0.042
base64_decode      0.044

Avx512:
base64_encode      0.029
base64_decode      0.032

Vbmi:
base64_encode      0.020
base64_decode      0.025

divinity76 · 2020-10-27T23:21:15Z

did you check the performance impact on a system where the COMPILER supports avx512/vbmi but the cpu doesn't?

(i've heard horror stories about gcc emulating intrinsics horribly slow on non-capable cpus)

frankdjx · 2020-10-28T01:12:55Z

did you check the performance impact on a system where the COMPILER supports avx512/vbmi but the cpu doesn't?

(i've heard horror stories about gcc emulating intrinsics horribly slow on non-capable cpus)

I just checked on a AVX2 machine, no impact. It has a runtime path(see resolve_base64_encode/resolve_base64_decode) to resolve the SIMD path capable by the machine, for a AVX2 device it still pick the original AVX2 intrinsic code.

frankdjx · 2020-11-11T08:37:57Z

@cmb69 Do you know who can help to review this? Thanks. Also we has other AVX512 related performance optimization, this PR add some build and runtime support for AVX512 variants.

cmb69 · 2020-11-11T11:49:32Z

AFAIK, @laruence originally implemented this. Maybe he want to take a look?

LifeIsStrange · 2020-11-24T18:24:09Z

@jianxind my comment is off-topic: I see that you work at Intel so it is in your interest to optimizer hardware use on mainstream languages. Therefore a nice follow-up of this PR would be to port such optimization to the JVM/OpenJDK too (using intrinsics or the https://openjdk.java.net/jeps/338 ). It's just an HS suggestion from a jealous java dev, feel free to not answer my comment :)

frankdjx · 2020-11-25T02:23:24Z

@jianxind my comment is off-topic: I see that you work at Intel so it is in your interest to optimizer hardware use on mainstream languages. Therefore a nice follow-up of this PR would be to port such optimization to the JVM/OpenJDK too (using intrinsics or the https://openjdk.java.net/jeps/338 ). It's just an HS suggestion from a jealous java dev, feel free to not answer my comment :)

Great thanks for the possible pathfinding, yes it's our interest to fully utilize hardware capacity on all frameworks. I see AdoptOpenJDK/openjdk-jdk11@d764765 added AVX512 support for base64 encoding, seems it use a different algorithm. We'd like to check the technology detail then and add Java to our scope.

stayeronglass · 2020-12-03T19:53:07Z

@dstogov Dmitry?
@nikic Nikita?
can take a look?

build/php.m4

Girgias · 2023-02-05T17:40:47Z

@jianxind could you rebase this on latest master?

ext/standard/base64.c

ext/standard/tests/url/base64_decode_basic_002.phpt

dstogov

I don't object against this.

The only bad thing is reading strings above boundary. This should be resolved to avoid valgrind and address sanitizer warnings.

The speed measurement was done on short strings. The speed up on longer strings should be much better. Usually, SIMD optimized string function may provide even worse result on short strings.

frankdjx · 2023-02-06T08:05:02Z

was done on short strings. The speed up on longer strings should be much better. Usually, SIMD optimized string function may provide even worse result on short strings.

Yes, more improvements on the longer strings. The benchmark measure the average performance on string size from 1 to 10000. I don't know if has much longer size then 10000 in real case:)

dstogov · 2023-02-06T08:15:15Z

Yes, more improvements on the longer strings. The benchmark measure the average performance on string size from 1 to 10000. I don't know if has much longer size then 10000 in real case:)

Right again. Sorry, I should be more careful :)

Girgias

Could you please split your new test cases into a new file as recommended by Dmitry?

build/php.m4

frankdjx · 2023-02-07T01:48:40Z

Could you please split your new test cases into a new file as recommended by Dmitry?

Done, all test case are new adding now. Also add one loop test to cover all SIMD variant.

dstogov

I didn't check this in action, but sources look good and tests passed.
In worst case of serious problems we may revert this.

zeriyoshi · 2023-02-07T07:28:46Z

@dstogov
We may need to be aware of the clock-down problem that Intel CPUs have with the AVX instruction
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

PHP is often executed in many process. The AVX-2, AVX-512 instructions improve the efficiency of single instruction performance, but threads running in parallel that do not use SIMD are subject to clock down. This can affect servers with high RPS.

On the other hand, these problems are mitigated on CPUs from IceLake or later:
https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Sorry, this is a multi-post with room11, but it was timely.

frankdjx · 2023-02-07T07:46:00Z

@dstogov We may need to be aware of the clock-down problem that Intel CPUs have with the AVX instruction https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

PHP is often executed in many process. The AVX-2, AVX-512 instructions improve the efficiency of single instruction performance, but threads running in parallel that do not use SIMD are subject to clock down. This can affect servers with high RPS.

On the other hand, these problems are mitigated on CPUs from IceLake or later: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Sorry, this is a multi-post with room11, but it was timely.

The clock-down mostly happens on heavy data operation(FMA/crypto instruction), in current case, the SIMD part for base64 only use the light byte shuffle/shift instructions, even for CPUs before Icelake the the chance of clock-down is very low.

zeriyoshi · 2023-02-07T08:07:00Z

The clock-down mostly happens on heavy data operation(FMA/crypto instruction), in current case, the SIMD part for base64 only use the light byte shuffle/shift instructions, even for CPUs before Icelake the the chance of clock-down is very low.

Yes, but this source, all AVX-512 instructions used the L1 (Power State 1 or higher), even on the ICL. This is meant to result in a lightweight clock down.

I am all for SIMD speedups, but I think we need to look at this carefully.

dstogov · 2023-02-07T09:50:04Z

I didn't know about frequency level licensing yet. First make instruction sets that may bring at most 2 times speed up, but then drop frequency of all cores when using them, that may bring up to 1.5 times slowdown for all cores. Funny :)

Anyway, I don't think this is a stopper for this particular use case.

1. Implementation based on https://github.com/WojciechMula/base64simd 2. Only runtime path is added to reduce the complexity of SIMD variants. 3. Expand test case to cover SIMD implementation. Signed-off-by: Frank Du <[email protected]>

Girgias · 2023-02-13T01:39:54Z

 /home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c: In function ‘zend_jit_init’:
/home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:4850:119: error: passing argument 4 of ‘ts_allocate_id’ from incompatible pointer type [-Werror=incompatible-pointer-types]
 4850 |  jit_globals_id = ts_allocate_id(&jit_globals_id, sizeof(zend_jit_globals), (ts_allocate_ctor) zend_jit_globals_ctor, zend_jit_globals_dtor);
      |                                                                                                                       ^~~~~~~~~~~~~~~~~~~~~
      |                                                                                                                       |
      |                                                                                                                       void (*)(zend_jit_globals *) {aka void (*)(struct _zend_jit_globals *)}
In file included from /home/runner/work/php-src/php-src/Zend/zend_portability.h:47,
                 from /home/runner/work/php-src/php-src/Zend/zend_types.h:25,
                 from /home/runner/work/php-src/php-src/Zend/zend.h:27,
                 from /home/runner/work/php-src/php-src/main/php.h:31,
                 from /home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:19:
/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

frankdjx · 2023-02-13T01:45:19Z

/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

This error is not related with this commit. Just rebase the PR and then we can see how it's going.

Girgias · 2023-02-13T01:46:46Z

/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

This error is not related with this commit. Just rebase the PR and then we can see how it's going.

I'll let CI run and merge this if it's mostly green. Sorry for this taking so long to get merged.

frankdjx force-pushed the base64_avx512 branch from 675c488 to 7748c2e Compare October 21, 2020 03:50

divinity76 mentioned this pull request Oct 21, 2020

BLAKE3 hash support #6358

Closed

frankdjx force-pushed the base64_avx512 branch from 7748c2e to d6c1436 Compare December 8, 2020 03:24

ramsey added Feature Waiting on Review labels Jun 9, 2021

pierrejoye reviewed Sep 2, 2021

View reviewed changes

build/php.m4 Show resolved Hide resolved

frankdjx force-pushed the base64_avx512 branch from d6c1436 to 749862e Compare February 6, 2023 03:17

github-actions bot added Category: Build System Category: Engine Extension: standard labels Feb 6, 2023

dstogov reviewed Feb 6, 2023

View reviewed changes

ext/standard/base64.c Show resolved Hide resolved

dstogov reviewed Feb 6, 2023

View reviewed changes

ext/standard/tests/url/base64_decode_basic_002.phpt Outdated Show resolved Hide resolved

dstogov reviewed Feb 6, 2023

View reviewed changes

Girgias reviewed Feb 6, 2023

View reviewed changes

build/php.m4 Outdated Show resolved Hide resolved

frankdjx force-pushed the base64_avx512 branch from 749862e to 83b4683 Compare February 7, 2023 01:39

frankdjx closed this Feb 7, 2023

frankdjx reopened this Feb 7, 2023

frankdjx force-pushed the base64_avx512 branch from 83b4683 to abc1d8d Compare February 7, 2023 03:20

dstogov approved these changes Feb 7, 2023

View reviewed changes

base64: add avx512 and vbmi version.

f37e064

1. Implementation based on https://github.com/WojciechMula/base64simd 2. Only runtime path is added to reduce the complexity of SIMD variants. 3. Expand test case to cover SIMD implementation. Signed-off-by: Frank Du <[email protected]>

frankdjx force-pushed the base64_avx512 branch from abc1d8d to f37e064 Compare February 13, 2023 01:43

Girgias merged commit a9437ce into php:master Feb 13, 2023

frankdjx deleted the base64_avx512 branch February 13, 2023 04:16

mvorisek mentioned this pull request Feb 14, 2023

Major overhaul of mbstring (part 31) #10591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base64: add avx512 and vbmi version. #6361

base64: add avx512 and vbmi version. #6361

frankdjx commented Oct 21, 2020 •

edited

Loading

divinity76 commented Oct 27, 2020 •

edited

Loading

frankdjx commented Oct 28, 2020

frankdjx commented Nov 11, 2020

cmb69 commented Nov 11, 2020

LifeIsStrange commented Nov 24, 2020

frankdjx commented Nov 25, 2020

stayeronglass commented Dec 3, 2020

Girgias commented Feb 5, 2023

dstogov left a comment

frankdjx commented Feb 6, 2023

dstogov commented Feb 6, 2023

Girgias left a comment

frankdjx commented Feb 7, 2023

dstogov left a comment

zeriyoshi commented Feb 7, 2023

frankdjx commented Feb 7, 2023

zeriyoshi commented Feb 7, 2023

dstogov commented Feb 7, 2023

Girgias commented Feb 13, 2023

frankdjx commented Feb 13, 2023

Girgias commented Feb 13, 2023

base64: add avx512 and vbmi version. #6361

base64: add avx512 and vbmi version. #6361

Conversation

frankdjx commented Oct 21, 2020 • edited Loading

divinity76 commented Oct 27, 2020 • edited Loading

frankdjx commented Oct 28, 2020

frankdjx commented Nov 11, 2020

cmb69 commented Nov 11, 2020

LifeIsStrange commented Nov 24, 2020

frankdjx commented Nov 25, 2020

stayeronglass commented Dec 3, 2020

Girgias commented Feb 5, 2023

dstogov left a comment

Choose a reason for hiding this comment

frankdjx commented Feb 6, 2023

dstogov commented Feb 6, 2023

Girgias left a comment

Choose a reason for hiding this comment

frankdjx commented Feb 7, 2023

dstogov left a comment

Choose a reason for hiding this comment

zeriyoshi commented Feb 7, 2023

frankdjx commented Feb 7, 2023

zeriyoshi commented Feb 7, 2023

dstogov commented Feb 7, 2023

Girgias commented Feb 13, 2023

frankdjx commented Feb 13, 2023

Girgias commented Feb 13, 2023

frankdjx commented Oct 21, 2020 •

edited

Loading

divinity76 commented Oct 27, 2020 •

edited

Loading