Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

Closed
IanButterworth opened this issue Jul 1, 2020 · 23 comments
Closed

Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696

IanButterworth opened this issue Jul 1, 2020 · 23 comments

Comments

@IanButterworth
Copy link

IanButterworth commented Jul 1, 2020

OpenBLAS matrix multiplication optimization on an AWS EC2 ARM graviton2 (neoverse-n1) system with the following julia setup seems to be failing:

julia> versioninfo()
Julia Version 1.6.0-DEV.341
Commit 8367e441ac* (2020-07-01 18:30 UTC)
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, neoverse-n1)
Environment:
  JULIA_NUM_THREADS = 16

julia> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9 NO_AFFINITY ARMV8 MAX_THREADS=32"

julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  21.123 ms (2 allocations: 39.14 KiB)

compared to a mac:

julia> @btime x * x setup=(x=rand(Float32, 100, 100));
18.161 μs (2 allocations: 39.14 KiB)
@martin-frbg
Copy link
Collaborator

This looks to me as if the cpu was not identified, leading to a generic ARMV8 build (approximately Cortex A57). Also I am not sure if the Neoverse PR was ever tested with anything other than gcc. (And I notice that Julia itself labels the CPU as "unknown")
As far as OpenBLAS is concerned, you could try if setting the environment variable OPENBLAS_CORETYPE to NEOVERSEN1 helps. (This would override the possibly incorrect autodetection in the library, assuming that the Julia binary of OpenBLAS is/was built with DYNAMIC_ARCH=1 to support multiple cpu types. What is clear from the openblas_get_config output is that OpenBLAS was compiled for a maximum number of 32 threads, so would not be able to make full use of a 64-core hardware).
But in any case the current implementation of NeoverseN1 support is basically a mix of the pre-existing ARMV8 and ThunderX2 targets so may not have optimum performance yet.

@IanButterworth
Copy link
Author

IanButterworth commented Jul 1, 2020

Great. Setting OPENBLAS_CORETYPE=NEOVERSEN1 seems to enable optimization, but falls a bit short of the mac intel result, as you predicted

julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  52.135 μs (2 allocations: 39.14 KiB)

@martin-frbg
Copy link
Collaborator

Can you get me the output of /proc/cpuinfo please ? Even if the Graviton2 is based on Neoverse, I suspect it will have a different implementer and/or part code, hence the autodetection failure

@IanButterworth
Copy link
Author

Happily:

cat /proc/cpuinfo
$ cat /proc/cpuinfo
processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 1
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 2
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 3
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 4
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 5
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 6
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 7
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 8
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 9
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 10
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 11
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 12
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 13
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 14
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 15
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

@martin-frbg
Copy link
Collaborator

Strange, this shows the manufacturer and model codes of a genuine Neoverse-N1 just as they are already present in dynamic_arm64.c (and cpuid_arm64.c). Right now I do not see why autodetection would fail (unless there are actually different hardwares providing this EC2 instance)

@IanButterworth
Copy link
Author

IanButterworth commented Jul 2, 2020

This instance wasn’t one of the “metal” instances. I.e. the core count is restricted, I assume through some level of virtualization. Could that be the issue? Typically OpenBLAS via julia works on non-metal instances like this on x86 though.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 2, 2020

It's probably not the reason but note that CPUID should not be used to identify the CPU on ARM. This is because with big-LITTLE and alike the CPUID result for MIDR_EL1 is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.

The correct way is to read the /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 file.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 2, 2020

And I said this is not the reason here because the result given to me at JuliaLang/julia#36464 (comment) suggests that the instruction does return the correct value in this case, and the cpuinfo suggests that all the cores are identical.

@martin-frbg
Copy link
Collaborator

Actually the runtime detection is using an mrs instruction to get MIDR_EL1 (and should complain with a clear message if that fails), only the cpuid_arm64 invoked for the compile-time detection still reads from /proc/cpuinfo. Do you see anything logged like either Kernel lacks cpuid feature support. Auto detection of core type failed !!! or Falling back to generic ARMV8 core ?

@IanButterworth
Copy link
Author

On my mac, with julia 1.4.2 & OpenBLAS 0.3.5, starting julia with OPENBLAS_VERBOSE=3 gives:

$ OPENBLAS_VERBOSE=3 julia
Core: Haswell

but nothing is printed if I do the same on the graviton2 system. No warning, and no Core: line

@IanButterworth
Copy link
Author

Perhaps this is indicative.. I just got a fast case, without having set OPENBLAS_CORETYPE=NEOVERSEN1

$ ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.341 (2020-07-01)
 _/ |\__'_|_|_|\__'_|  |  vc/upgrade_llvm_10/8367e441ac* (fork: 1 commits, 1 day)
|__/                   |
julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  29.982 μs (2 allocations: 39.14 KiB)

I tried a few more julia sessions and it went back to slow speeds.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 2, 2020

Do you ever get unstable/different results if you run the c program I asked you to run if you run it multiple times?

@IanButterworth
Copy link
Author

I just ran it ~40 times and it didn't change

$ ./mrsemu
ID_AA64ISAR0_EL1    : 0x0000100010211120
ID_AA64ISAR1_EL1    : 0x0000000000100001
ID_AA64MMFR0_EL1    : 0x00000000ff000000
ID_AA64MMFR1_EL1    : 0x0000000000000000
ID_AA64PFR0_EL1     : 0x0000000000110011
ID_AA64PFR1_EL1     : 0x0000000000000020
ID_AA64DFR0_EL1     : 0x0000000000000006
ID_AA64DFR1_EL1     : 0x0000000000000000
MIDR_EL1            : 0x00000000413fd0c1
MPIDR_EL1           : 0x0000000080000000
REVIDR_EL1          : 0x0000000000000000

@martin-frbg
Copy link
Collaborator

Very strange. The code for printing the "Core:" information in verbose mode is identical between the arm64 and x86 versions of DYNAMIC_ARCH support, and I do not see how it can escape printing something.
Occasional fast cases without the CORETYPE variable could be in support of my theory that there are two types of hardware
providing your EC2 instance - one with a genuine Neoverse N1 and a "compatible" that has an unfamiliar identifier.

@martin-frbg
Copy link
Collaborator

Does julia filter the info that OpenBLAS tries to print ? (Unlikely, but just asking for completeness). I just did a DYNAMIC_ARCH build on my phone (in termux), and there the library complains that my kernel does not support the cpuid feature, before announcing a falback to generic armv8 and printing "Core: armv8" as it should.

@IanButterworth
Copy link
Author

Just for the record, the same thing happens in julia 1.4.2 with OpenBLAS 0.3.5

$ OPENBLAS_VERBOSE=3 ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.2 (2020-05-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/ 

@martin-frbg
Copy link
Collaborator

0.3.5 did not have Neoverse support, but still should be printing Falling back to generic ARMV8 core and Core: armv8. (Or worst case stage a dramatic exit with OpenBLAS: Architecture Initialization failed) Something here makes no sense at all

@IanButterworth
Copy link
Author

I'm happy to build & run something else to see if this is julia-specific, if you have a recommendation? The easier the better

@martin-frbg
Copy link
Collaborator

Any of the test cases included with OpenBLAS would be my choice, or a build of OpenBLAS itself with 'make DYNAMIC_ARCH=1`

@giordano
Copy link
Contributor

giordano commented Jul 2, 2020

This is the configuration of the pre-built OpenBLAS provided by Julia: https://github.com/JuliaPackaging/Yggdrasil/blob/816c103b16c1b63fdb1ef17677be103c3efe3cb8/O/OpenBLAS/common.jl#L73-L83. DYNAMIC_ARCH is only used when building for Intel CPUs, can this be the issue?

@martin-frbg
Copy link
Collaborator

If that is what is currently installed on the AWS Neoverse, it should not react to setting OPENBLAS_CORETYPE at all, and always use the generic ARMV8 support. It would explain the lack of coretype printout but not the huge variations in performance

@IanButterworth
Copy link
Author

@giordano made a trial OpenBLAS build with DYNAMIC_ARCH=1 for aarch64 that I switched out for the built-in julia libopenblas.so and we have success:

$ OPENBLAS_VERBOSE=3 ./julia
Core: neoversen1
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.341 (2020-07-01)
 _/ |\__'_|_|_|\__'_|  |  vc/upgrade_llvm_10/8367e441ac* (fork: 1 commits, 1 day)
|__/                   |
julia> using LinearAlgebra
julia> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9 DYNAMIC_ARCH NO_AFFINITY neoversen1 MAX_THREADS=32"
julia> using BenchmarkTools
julia> @btime x * x setup=(x=rand(Float32, 100, 100));
  20.972 μs (2 allocations: 39.14 KiB)

@martin-frbg
Copy link
Collaborator

Great - though I am wondering if the variations in performance will continue. (But if they do, at least it should be clearer then if the cpu identification changed). I notice that the author of the PR that added Neoverse support is also maintaining https://github.com/aws/aws-graviton-gettting-started , so choosing the generic armv8 GEMM kernels over their ThunderX2-optimized counterparts will probably have been deliberate. (Actually not sure what performance to expect from 16 Neoverse cores compared to your Haswell-generation Mac)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants