-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization failing on AWS Graviton2 (neoverse-n1) machine #2696
Comments
This looks to me as if the cpu was not identified, leading to a generic ARMV8 build (approximately Cortex A57). Also I am not sure if the Neoverse PR was ever tested with anything other than gcc. (And I notice that Julia itself labels the CPU as "unknown") |
Great. Setting
|
Can you get me the output of /proc/cpuinfo please ? Even if the Graviton2 is based on Neoverse, I suspect it will have a different implementer and/or part code, hence the autodetection failure |
Happily: cat /proc/cpuinfo
|
Strange, this shows the manufacturer and model codes of a genuine Neoverse-N1 just as they are already present in dynamic_arm64.c (and cpuid_arm64.c). Right now I do not see why autodetection would fail (unless there are actually different hardwares providing this EC2 instance) |
This instance wasn’t one of the “metal” instances. I.e. the core count is restricted, I assume through some level of virtualization. Could that be the issue? Typically OpenBLAS via julia works on non-metal instances like this on x86 though. |
It's probably not the reason but note that The correct way is to read the |
And I said this is not the reason here because the result given to me at JuliaLang/julia#36464 (comment) suggests that the instruction does return the correct value in this case, and the cpuinfo suggests that all the cores are identical. |
Actually the runtime detection is using an |
On my mac, with julia 1.4.2 & OpenBLAS 0.3.5, starting julia with
but nothing is printed if I do the same on the graviton2 system. No warning, and no |
Perhaps this is indicative.. I just got a fast case, without having set
I tried a few more julia sessions and it went back to slow speeds. |
Do you ever get unstable/different results if you run the c program I asked you to run if you run it multiple times? |
I just ran it ~40 times and it didn't change
|
Very strange. The code for printing the "Core:" information in verbose mode is identical between the arm64 and x86 versions of DYNAMIC_ARCH support, and I do not see how it can escape printing something. |
Does julia filter the info that OpenBLAS tries to print ? (Unlikely, but just asking for completeness). I just did a DYNAMIC_ARCH build on my phone (in termux), and there the library complains that my kernel does not support the cpuid feature, before announcing a falback to generic armv8 and printing "Core: armv8" as it should. |
Just for the record, the same thing happens in julia 1.4.2 with OpenBLAS 0.3.5
|
0.3.5 did not have Neoverse support, but still should be printing |
I'm happy to build & run something else to see if this is julia-specific, if you have a recommendation? The easier the better |
Any of the test cases included with OpenBLAS would be my choice, or a build of OpenBLAS itself with 'make DYNAMIC_ARCH=1` |
This is the configuration of the pre-built OpenBLAS provided by Julia: https://github.com/JuliaPackaging/Yggdrasil/blob/816c103b16c1b63fdb1ef17677be103c3efe3cb8/O/OpenBLAS/common.jl#L73-L83. |
If that is what is currently installed on the AWS Neoverse, it should not react to setting OPENBLAS_CORETYPE at all, and always use the generic ARMV8 support. It would explain the lack of coretype printout but not the huge variations in performance |
@giordano made a trial OpenBLAS build with
|
Great - though I am wondering if the variations in performance will continue. (But if they do, at least it should be clearer then if the cpu identification changed). I notice that the author of the PR that added Neoverse support is also maintaining https://github.com/aws/aws-graviton-gettting-started , so choosing the generic armv8 GEMM kernels over their ThunderX2-optimized counterparts will probably have been deliberate. (Actually not sure what performance to expect from 16 Neoverse cores compared to your Haswell-generation Mac) |
OpenBLAS matrix multiplication optimization on an AWS EC2 ARM graviton2 (neoverse-n1) system with the following julia setup seems to be failing:
compared to a mac:
The text was updated successfully, but these errors were encountered: