Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

IanButterworth · 2020-07-01T19:09:54Z

OpenBLAS matrix multiplication optimization on an AWS EC2 ARM graviton2 (neoverse-n1) system with the following julia setup seems to be failing:

OpenBLAS 0.3.9 (present on Julia master and v1.5-rc1)
LLVM 10 (PR upgrade LLVM to 10 JuliaLang/julia#35318) (may be irrelevant)
Updated ARM cpu detection (PR Update ARM feature and CPU detection (supersedes #36464) JuliaLang/julia#36485) (may be irrelevant)

julia> versioninfo()
Julia Version 1.6.0-DEV.341
Commit 8367e441ac* (2020-07-01 18:30 UTC)
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, neoverse-n1)
Environment:
  JULIA_NUM_THREADS = 16

julia> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9 NO_AFFINITY ARMV8 MAX_THREADS=32"

julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  21.123 ms (2 allocations: 39.14 KiB)

compared to a mac:

julia> @btime x * x setup=(x=rand(Float32, 100, 100));
18.161 μs (2 allocations: 39.14 KiB)

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-07-01T19:39:48Z

This looks to me as if the cpu was not identified, leading to a generic ARMV8 build (approximately Cortex A57). Also I am not sure if the Neoverse PR was ever tested with anything other than gcc. (And I notice that Julia itself labels the CPU as "unknown")
As far as OpenBLAS is concerned, you could try if setting the environment variable OPENBLAS_CORETYPE to NEOVERSEN1 helps. (This would override the possibly incorrect autodetection in the library, assuming that the Julia binary of OpenBLAS is/was built with DYNAMIC_ARCH=1 to support multiple cpu types. What is clear from the openblas_get_config output is that OpenBLAS was compiled for a maximum number of 32 threads, so would not be able to make full use of a 64-core hardware).
But in any case the current implementation of NeoverseN1 support is basically a mix of the pre-existing ARMV8 and ThunderX2 targets so may not have optimum performance yet.

IanButterworth · 2020-07-01T20:03:41Z

Great. Setting OPENBLAS_CORETYPE=NEOVERSEN1 seems to enable optimization, but falls a bit short of the mac intel result, as you predicted

julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  52.135 μs (2 allocations: 39.14 KiB)

martin-frbg · 2020-07-01T20:11:28Z

Can you get me the output of /proc/cpuinfo please ? Even if the Graviton2 is based on Neoverse, I suspect it will have a different implementer and/or part code, hence the autodetection failure

IanButterworth · 2020-07-01T20:22:42Z

Happily:

cat /proc/cpuinfo

$ cat /proc/cpuinfo
processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 1
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 2
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 3
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 4
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 5
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 6
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 7
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 8
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 9
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 10
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 11
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 12
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 13
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 14
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 15
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

martin-frbg · 2020-07-01T20:38:01Z

Strange, this shows the manufacturer and model codes of a genuine Neoverse-N1 just as they are already present in dynamic_arm64.c (and cpuid_arm64.c). Right now I do not see why autodetection would fail (unless there are actually different hardwares providing this EC2 instance)

IanButterworth · 2020-07-02T03:35:49Z

This instance wasn’t one of the “metal” instances. I.e. the core count is restricted, I assume through some level of virtualization. Could that be the issue? Typically OpenBLAS via julia works on non-metal instances like this on x86 though.

yuyichao · 2020-07-02T03:37:26Z

It's probably not the reason but note that CPUID should not be used to identify the CPU on ARM. This is because with big-LITTLE and alike the CPUID result for MIDR_EL1 is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.

The correct way is to read the /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 file.

yuyichao · 2020-07-02T03:45:10Z

And I said this is not the reason here because the result given to me at JuliaLang/julia#36464 (comment) suggests that the instruction does return the correct value in this case, and the cpuinfo suggests that all the cores are identical.

martin-frbg · 2020-07-02T06:12:00Z

Actually the runtime detection is using an mrs instruction to get MIDR_EL1 (and should complain with a clear message if that fails), only the cpuid_arm64 invoked for the compile-time detection still reads from /proc/cpuinfo. Do you see anything logged like either Kernel lacks cpuid feature support. Auto detection of core type failed !!! or Falling back to generic ARMV8 core ?

IanButterworth · 2020-07-02T18:55:47Z

On my mac, with julia 1.4.2 & OpenBLAS 0.3.5, starting julia with OPENBLAS_VERBOSE=3 gives:

$ OPENBLAS_VERBOSE=3 julia
Core: Haswell

but nothing is printed if I do the same on the graviton2 system. No warning, and no Core: line

IanButterworth · 2020-07-02T19:02:52Z

Perhaps this is indicative.. I just got a fast case, without having set OPENBLAS_CORETYPE=NEOVERSEN1

$ ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.341 (2020-07-01)
 _/ |\__'_|_|_|\__'_|  |  vc/upgrade_llvm_10/8367e441ac* (fork: 1 commits, 1 day)
|__/                   |
julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  29.982 μs (2 allocations: 39.14 KiB)

I tried a few more julia sessions and it went back to slow speeds.

yuyichao · 2020-07-02T19:13:34Z

Do you ever get unstable/different results if you run the c program I asked you to run if you run it multiple times?

IanButterworth · 2020-07-02T19:17:58Z

I just ran it ~40 times and it didn't change

$ ./mrsemu
ID_AA64ISAR0_EL1    : 0x0000100010211120
ID_AA64ISAR1_EL1    : 0x0000000000100001
ID_AA64MMFR0_EL1    : 0x00000000ff000000
ID_AA64MMFR1_EL1    : 0x0000000000000000
ID_AA64PFR0_EL1     : 0x0000000000110011
ID_AA64PFR1_EL1     : 0x0000000000000020
ID_AA64DFR0_EL1     : 0x0000000000000006
ID_AA64DFR1_EL1     : 0x0000000000000000
MIDR_EL1            : 0x00000000413fd0c1
MPIDR_EL1           : 0x0000000080000000
REVIDR_EL1          : 0x0000000000000000

martin-frbg · 2020-07-02T19:23:09Z

Very strange. The code for printing the "Core:" information in verbose mode is identical between the arm64 and x86 versions of DYNAMIC_ARCH support, and I do not see how it can escape printing something.
Occasional fast cases without the CORETYPE variable could be in support of my theory that there are two types of hardware
providing your EC2 instance - one with a genuine Neoverse N1 and a "compatible" that has an unfamiliar identifier.

martin-frbg · 2020-07-02T19:59:47Z

Does julia filter the info that OpenBLAS tries to print ? (Unlikely, but just asking for completeness). I just did a DYNAMIC_ARCH build on my phone (in termux), and there the library complains that my kernel does not support the cpuid feature, before announcing a falback to generic armv8 and printing "Core: armv8" as it should.

IanButterworth · 2020-07-02T20:09:59Z

Just for the record, the same thing happens in julia 1.4.2 with OpenBLAS 0.3.5

$ OPENBLAS_VERBOSE=3 ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.2 (2020-05-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/

martin-frbg · 2020-07-02T21:09:06Z

0.3.5 did not have Neoverse support, but still should be printing Falling back to generic ARMV8 core and Core: armv8. (Or worst case stage a dramatic exit with OpenBLAS: Architecture Initialization failed) Something here makes no sense at all

IanButterworth · 2020-07-02T21:26:36Z

I'm happy to build & run something else to see if this is julia-specific, if you have a recommendation? The easier the better

martin-frbg · 2020-07-02T21:49:36Z

Any of the test cases included with OpenBLAS would be my choice, or a build of OpenBLAS itself with 'make DYNAMIC_ARCH=1`

giordano · 2020-07-02T21:52:56Z

This is the configuration of the pre-built OpenBLAS provided by Julia: https://github.com/JuliaPackaging/Yggdrasil/blob/816c103b16c1b63fdb1ef17677be103c3efe3cb8/O/OpenBLAS/common.jl#L73-L83. DYNAMIC_ARCH is only used when building for Intel CPUs, can this be the issue?

martin-frbg · 2020-07-02T22:15:12Z

If that is what is currently installed on the AWS Neoverse, it should not react to setting OPENBLAS_CORETYPE at all, and always use the generic ARMV8 support. It would explain the lack of coretype printout but not the huge variations in performance

IanButterworth · 2020-07-02T22:58:24Z

@giordano made a trial OpenBLAS build with DYNAMIC_ARCH=1 for aarch64 that I switched out for the built-in julia libopenblas.so and we have success:

$ OPENBLAS_VERBOSE=3 ./julia
Core: neoversen1
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.341 (2020-07-01)
 _/ |\__'_|_|_|\__'_|  |  vc/upgrade_llvm_10/8367e441ac* (fork: 1 commits, 1 day)
|__/                   |
julia> using LinearAlgebra
julia> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9 DYNAMIC_ARCH NO_AFFINITY neoversen1 MAX_THREADS=32"
julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  20.972 μs (2 allocations: 39.14 KiB)

martin-frbg · 2020-07-03T07:45:52Z

Great - though I am wondering if the variations in performance will continue. (But if they do, at least it should be clearer then if the cpu identification changed). I notice that the author of the PR that added Neoverse support is also maintaining https://github.com/aws/aws-graviton-gettting-started , so choosing the generic armv8 GEMM kernels over their ThunderX2-optimized counterparts will probably have been deliberate. (Actually not sure what performance to expect from 16 Neoverse cores compared to your Haswell-generation Mac)

IanButterworth mentioned this issue Jul 2, 2020

OpenBLAS: Enable DYNAMIC_ARCH for aarch64 JuliaPackaging/Yggdrasil#1280

Merged

IanButterworth mentioned this issue Jul 4, 2020

Bump OpenBLAS BB build to provide aarch64 support JuliaLang/julia#36533

Merged

martin-frbg closed this as completed Jul 11, 2020

yuyichao mentioned this issue Jul 11, 2020

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

IanButterworth commented Jul 1, 2020 •

edited

Loading

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 1, 2020 •

edited

Loading

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 1, 2020

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 2, 2020 •

edited

Loading

yuyichao commented Jul 2, 2020 •

edited

Loading

yuyichao commented Jul 2, 2020 •

edited

Loading

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

yuyichao commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

giordano commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 3, 2020

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

Comments

IanButterworth commented Jul 1, 2020 • edited Loading

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 1, 2020 • edited Loading

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 1, 2020

martin-frbg commented Jul 1, 2020

IanButterworth commented Jul 2, 2020 • edited Loading

yuyichao commented Jul 2, 2020 • edited Loading

yuyichao commented Jul 2, 2020 • edited Loading

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

yuyichao commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

giordano commented Jul 2, 2020

martin-frbg commented Jul 2, 2020

IanButterworth commented Jul 2, 2020

martin-frbg commented Jul 3, 2020

IanButterworth commented Jul 1, 2020 •

edited

Loading

IanButterworth commented Jul 1, 2020 •

edited

Loading

IanButterworth commented Jul 2, 2020 •

edited

Loading

yuyichao commented Jul 2, 2020 •

edited

Loading

yuyichao commented Jul 2, 2020 •

edited

Loading