Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base64: add avx512 and vbmi version. #6361

Merged
merged 1 commit into from
Feb 13, 2023
Merged

Conversation

frankdjx
Copy link
Contributor

@frankdjx frankdjx commented Oct 21, 2020

  1. Implementation based on https://github.com/WojciechMula/base64simd
  2. Only runtime path is added to reduce the complexity of SIMD variants.
  3. Expand test case to cover SIMD implementation.

Signed-off-by: Frank Du [email protected]

Benchmarking with below synthetic code on a capable device.

function simple_base64_encode() {
  $a = "foo";
  for ($i = 0; $i < 10000; $i++) {
      base64_encode($a);
      $a .= "o";
  }
}

function simple_base64_decode() {
  $a = "foo";
  for ($i = 0; $i < 10000; $i++) {
      base64_decode($a);
      $a .= "o";
  }
}

Results, little is best

Avx2:
base64_encode      0.042
base64_decode      0.044

Avx512:
base64_encode      0.029
base64_decode      0.032

Vbmi:
base64_encode      0.020
base64_decode      0.025

@divinity76
Copy link
Contributor

divinity76 commented Oct 27, 2020

did you check the performance impact on a system where the COMPILER supports avx512/vbmi but the cpu doesn't?

(i've heard horror stories about gcc emulating intrinsics horribly slow on non-capable cpus)

@frankdjx
Copy link
Contributor Author

did you check the performance impact on a system where the COMPILER supports avx512/vbmi but the cpu doesn't?

(i've heard horror stories about gcc emulating intrinsics horribly slow on non-capable cpus)

I just checked on a AVX2 machine, no impact. It has a runtime path(see resolve_base64_encode/resolve_base64_decode) to resolve the SIMD path capable by the machine, for a AVX2 device it still pick the original AVX2 intrinsic code.

@frankdjx
Copy link
Contributor Author

@cmb69 Do you know who can help to review this? Thanks. Also we has other AVX512 related performance optimization, this PR add some build and runtime support for AVX512 variants.

@cmb69
Copy link
Member

cmb69 commented Nov 11, 2020

AFAIK, @laruence originally implemented this. Maybe he want to take a look?

@LifeIsStrange
Copy link

@jianxind my comment is off-topic: I see that you work at Intel so it is in your interest to optimizer hardware use on mainstream languages. Therefore a nice follow-up of this PR would be to port such optimization to the JVM/OpenJDK too (using intrinsics or the https://openjdk.java.net/jeps/338 ). It's just an HS suggestion from a jealous java dev, feel free to not answer my comment :)

@frankdjx
Copy link
Contributor Author

@jianxind my comment is off-topic: I see that you work at Intel so it is in your interest to optimizer hardware use on mainstream languages. Therefore a nice follow-up of this PR would be to port such optimization to the JVM/OpenJDK too (using intrinsics or the https://openjdk.java.net/jeps/338 ). It's just an HS suggestion from a jealous java dev, feel free to not answer my comment :)

Great thanks for the possible pathfinding, yes it's our interest to fully utilize hardware capacity on all frameworks. I see AdoptOpenJDK/openjdk-jdk11@d764765 added AVX512 support for base64 encoding, seems it use a different algorithm. We'd like to check the technology detail then and add Java to our scope.

@stayeronglass
Copy link

@dstogov Dmitry?
@nikic Nikita?
can take a look?

@Girgias
Copy link
Member

Girgias commented Feb 5, 2023

@jianxind could you rebase this on latest master?

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object against this.

The only bad thing is reading strings above boundary. This should be resolved to avoid valgrind and address sanitizer warnings.

The speed measurement was done on short strings. The speed up on longer strings should be much better. Usually, SIMD optimized string function may provide even worse result on short strings.

@frankdjx
Copy link
Contributor Author

frankdjx commented Feb 6, 2023

was done on short strings. The speed up on longer strings should be much better. Usually, SIMD optimized string function may provide even worse result on short strings.

Yes, more improvements on the longer strings. The benchmark measure the average performance on string size from 1 to 10000. I don't know if has much longer size then 10000 in real case:)

@dstogov
Copy link
Member

dstogov commented Feb 6, 2023

Yes, more improvements on the longer strings. The benchmark measure the average performance on string size from 1 to 10000. I don't know if has much longer size then 10000 in real case:)

Right again. Sorry, I should be more careful :)

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please split your new test cases into a new file as recommended by Dmitry?

build/php.m4 Outdated Show resolved Hide resolved
@frankdjx
Copy link
Contributor Author

frankdjx commented Feb 7, 2023

Could you please split your new test cases into a new file as recommended by Dmitry?

Done, all test case are new adding now. Also add one loop test to cover all SIMD variant.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check this in action, but sources look good and tests passed.
In worst case of serious problems we may revert this.

@zeriyoshi
Copy link
Contributor

@dstogov
We may need to be aware of the clock-down problem that Intel CPUs have with the AVX instruction
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

PHP is often executed in many process. The AVX-2, AVX-512 instructions improve the efficiency of single instruction performance, but threads running in parallel that do not use SIMD are subject to clock down. This can affect servers with high RPS.

On the other hand, these problems are mitigated on CPUs from IceLake or later:
https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Sorry, this is a multi-post with room11, but it was timely.

@frankdjx
Copy link
Contributor Author

frankdjx commented Feb 7, 2023

@dstogov We may need to be aware of the clock-down problem that Intel CPUs have with the AVX instruction https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

PHP is often executed in many process. The AVX-2, AVX-512 instructions improve the efficiency of single instruction performance, but threads running in parallel that do not use SIMD are subject to clock down. This can affect servers with high RPS.

On the other hand, these problems are mitigated on CPUs from IceLake or later: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Sorry, this is a multi-post with room11, but it was timely.

The clock-down mostly happens on heavy data operation(FMA/crypto instruction), in current case, the SIMD part for base64 only use the light byte shuffle/shift instructions, even for CPUs before Icelake the the chance of clock-down is very low.

@zeriyoshi
Copy link
Contributor

The clock-down mostly happens on heavy data operation(FMA/crypto instruction), in current case, the SIMD part for base64 only use the light byte shuffle/shift instructions, even for CPUs before Icelake the the chance of clock-down is very low.

Yes, but this source, all AVX-512 instructions used the L1 (Power State 1 or higher), even on the ICL. This is meant to result in a lightweight clock down.

I am all for SIMD speedups, but I think we need to look at this carefully.

@dstogov
Copy link
Member

dstogov commented Feb 7, 2023

I didn't know about frequency level licensing yet. First make instruction sets that may bring at most 2 times speed up, but then drop frequency of all cores when using them, that may bring up to 1.5 times slowdown for all cores. Funny :)

Anyway, I don't think this is a stopper for this particular use case.

1. Implementation based on https://github.com/WojciechMula/base64simd
2. Only runtime path is added to reduce the complexity of SIMD variants.
3. Expand test case to cover SIMD implementation.

Signed-off-by: Frank Du <[email protected]>
@Girgias
Copy link
Member

Girgias commented Feb 13, 2023

 /home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c: In function ‘zend_jit_init’:
/home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:4850:119: error: passing argument 4 of ‘ts_allocate_id’ from incompatible pointer type [-Werror=incompatible-pointer-types]
 4850 |  jit_globals_id = ts_allocate_id(&jit_globals_id, sizeof(zend_jit_globals), (ts_allocate_ctor) zend_jit_globals_ctor, zend_jit_globals_dtor);
      |                                                                                                                       ^~~~~~~~~~~~~~~~~~~~~
      |                                                                                                                       |
      |                                                                                                                       void (*)(zend_jit_globals *) {aka void (*)(struct _zend_jit_globals *)}
In file included from /home/runner/work/php-src/php-src/Zend/zend_portability.h:47,
                 from /home/runner/work/php-src/php-src/Zend/zend_types.h:25,
                 from /home/runner/work/php-src/php-src/Zend/zend.h:27,
                 from /home/runner/work/php-src/php-src/main/php.h:31,
                 from /home/runner/work/php-src/php-src/ext/opcache/jit/zend_jit.c:19:
/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

@frankdjx
Copy link
Contributor Author

/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

This error is not related with this commit. Just rebase the PR and then we can see how it's going.

@Girgias
Copy link
Member

Girgias commented Feb 13, 2023

/home/runner/work/php-src/php-src/Zend/../TSRM/TSRM.h:91:110: note: expected ‘ts_allocate_dtor’ {aka ‘void (*)(void *)’} but argument is of type ‘void (*)(zend_jit_globals *)’ {aka ‘void (*)(struct _zend_jit_globals *)’}
   91 | TSRM_API ts_rsrc_id ts_allocate_id(ts_rsrc_id *rsrc_id, size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor);
      |                                                                                             ~~~~~~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors

Please fix the compilation error.

This error is not related with this commit. Just rebase the PR and then we can see how it's going.

I'll let CI run and merge this if it's mostly green. Sorry for this taking so long to get merged.

@Girgias Girgias merged commit a9437ce into php:master Feb 13, 2023
@frankdjx frankdjx deleted the base64_avx512 branch February 13, 2023 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants