HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

yuyichao · 2020-07-11T14:52:40Z

Ref #2696 (comment) since the issue was closed.

This is because with big-LITTLE and alike the CPUID result for MIDR_EL1 is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.

The correct way is to read the /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 file.

This should replace https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/driver/others/dynamic_arm64.c#L150 and should also be preferred over the /proc/cpuinfo reader in https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/cpuid_arm64.c#L123

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-07-11T15:55:59Z

Closed but not forgotten - I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ? (Currently it can happen that a simple make builds for the weaker core, but a subsequent make lapack-test runs on a big core that got woken up due to the compiler workload
and goes looking for the wrong library name to link to.)
In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

yuyichao · 2020-07-11T16:13:30Z

Closed but not forgotten

That issue isn't about this anyway and I'm not trying to say that one shouldn't have been closed. Just want to make sure there is one open issue tracking this..

I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ?

I don't think there's any difference between build time and runtime here. I would either go for one that's optimized for both big and LITTLE or just one that's optimized for the big.

In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

AFAICT the moving between cores isn't something the userspace should care about. Running on the LITTLE core means that the OS doesn't think it's doing much work and running on the big core means the process is probably doing a lot of work. That's why I think it makes sense to simply optimize for the big core, the performance matters much less when running on the LITTLE core. Now with enough workload I assume both the big and the LITTLE cores are going to be used, so there's probably a slightly different target to optimize for (GCC seems to have some target like that, but I'm not sure if there's enough gain to worth the effort). However, I still think the big core perforrmance is more important in those cases and my faith is in that the pairing core are going to be "compatible" enough that code optimized for the big isn't going to be that bad on the LITTLE.

Also, AFAICT, the detection is done on only one thread so I the result is going to be wrong for some thread anyway if one only make the decision based on current core. (and HWCAP_CPUID uses kernel emulation so I don't think doing detection on worker threads every time is an option...)

martin-frbg · 2020-07-11T16:41:46Z

Ok, I was unsure if code would simply crash when optimized for big and running on little but that would certainly be an unhealthy platform then. OTOH it seems the /sys/devices path depends on kernel version or something - I do not have the path you suggested on a 3.10 kernel (termux on an android8 device)

yuyichao · 2020-07-11T16:47:56Z

simply crash when optimized for big and running on little

The crash can only be caused by feature mismatch which should use HWCAP_ to detect. Even accessing the hardware register for feature detection should also be fine since I think the kernel normalizes it (not 100% sure).

Yes, that is a 4.7 kernel feature and it's added for exactly this reason. If the file doesn't exist, the alternaive should be parsing of /proc/cpuinfo, which also contains info for all cores.

martin-frbg · 2020-07-11T17:01:22Z

On 3.10, it enumerates the cores but prints out only one (variable) "CPU part" though.

yuyichao · 2020-07-11T17:25:03Z

If the kernel doesn't provide anything better then that'll have to be whar you work with of course... It's still better than cpuid since it should be more stable.

Also note that at least for mainline kernel, HWCAP_CPUID is a 4.11 feature.

martin-frbg · 2020-07-11T17:36:16Z

Ah, ok. Iterating over all cores to determine the most capable type sounds a bit PITA though.

brada4 · 2020-07-11T18:16:39Z

REF: https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt
The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

yuyichao · 2020-07-11T18:19:39Z

Iterating over all cores to determine the most capable type sounds a bit PITA though.

That's the only way to get a stable result.

The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

Well, that's unrelated to this issue. FWIW, what I'm asking for here is to lower the kernel version requirement. The feature that is intended for this is in 4.7, and the fallback of /proc/cpuinfo is there for all versios.

ognjentodic · 2020-10-21T18:09:26Z

I wonder if this should be left to the developer to set instead of trying to automate it under the hood. Since we can already set thread affinity for blas threads, maybe the approach is that this is done only for the relevant subset (little/big), where what's relevant would be developers choice based on the application.

So, the high-level workflow would be to determine which cores you'd want the openblas to run on, set affinity for blas thread to those cores, and then pick the core type accordingly.

It looks like https://github.com/google/cpu_features works mainly via /proc/cpuinfo (based on the README). @yuyichao is there a reason you prefer the approach via /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 ?

yuyichao · 2020-10-21T21:42:00Z

I wonder if this should be left to the developer to set instead of trying to automate it under the hood

Allowing it via something better than an environment variable is good but that must not replace autodetection. Most users will not have a better collection of CPU architectures to feed into the library (I do, but that's the exception).

is there a reason you prefer the approach via /sys/devices/system/cpu/cpu/regs/identification/midr_el1 ?

Yes absolutely. This is the whole reason the sysfs file was added. /proc/cpuinfo is never meant to be machine parsed but uses a human readable format that has changed over the years. It's much simpler and more reliable to use the newer interface. A fallback is certainly still OK.

t-vi · 2021-04-15T08:48:06Z

I'm not entirely sure if it is relevant for this discussion but:
It appears that the current code (after #2952) causes SIGILL on NVIDIA Jetson devices because the HWCAP_CPUID seems to have the checked bit set but the get_cpu_ftr trigerrs SIGILL nonetheless. This is observed e.g. with the Debian testing/unstable packages of openblas. I'd be happy to investigate more, but from what I can see the /sys method would work.

martin-frbg · 2021-04-15T09:24:26Z

That is/was supposed to be fixed with #3060 but Debian has probably not updated beyond 0.3.13 yet.

t-vi · 2021-04-15T09:42:46Z

@martin-frbg Thank you for the pointer. When I try to look at their 0.3.14 version, it seems to be missing that patch, too?! I'll file a bug with Debian.

martin-frbg · 2021-04-15T09:50:33Z

That fix is definitely in 0.3.14 (which should be superseded by 0.3.15 "shortly"), and according to the numpy ticket linked from #3060 it solved the problem with Jetson (I do not have Jetson hardware).
As a workaround you can set OPENBLAS_CORETYPE=ARMV8 to avoid using the autodetection mechanism at runtime

t-vi · 2021-04-15T10:04:27Z

Yes, thank you. It'd be a shame if the upcoming Debian release would be missing the fix. I filed it as Debian Bug 986996 on their system.

martin-frbg · 2021-04-16T08:31:45Z

My fault - fix overwritten by a subsequent PR from a different branch. Restoring...

t-vi · 2021-04-16T08:34:15Z

I'm glad I didn't just waste your time on something that was completely and obviously fixed. Thank you again for your work on OpenBLAS and your help with this.

martin-frbg · 2021-04-16T09:04:36Z

Thanks for alerting me to my blunder. Pity this could not show up in the CI runs

martin-frbg mentioned this issue Aug 31, 2020

Failure to detect CPU type for Apple Silicon #2804

Closed

martin-frbg mentioned this issue Oct 21, 2020

DYNAMIC_ARCH support on Android #2931

Closed

martin-frbg mentioned this issue Oct 27, 2020

Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails #2952

Merged

martin-frbg mentioned this issue Apr 16, 2021

Restore __volatile__ keyword in ARM64 DYNAMIC_ARCH detection mechanism #3183

Merged

penpornk mentioned this issue Jul 26, 2023

Make oneDNN ACL default on Neoverse V1 cores tensorflow/tensorflow#60353

Merged

kawakami-k mentioned this issue Jul 31, 2023

SIGILL, Illegal instruction on Jetson fujitsu/xbyak_aarch64#92

Closed

anrieff mentioned this issue Jun 24, 2024

Add initial support for ARM CPUs anrieff/libcpuid#200

Closed

martin-frbg mentioned this issue Jan 1, 2025

Identify all cores in ARM64 autodetection, return fastest TARGET and performance group sizes #5041

Merged

martin-frbg closed this as completed in #5041 Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

yuyichao commented Jul 11, 2020 •

edited

Loading

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

brada4 commented Jul 11, 2020

yuyichao commented Jul 11, 2020

ognjentodic commented Oct 21, 2020

yuyichao commented Oct 21, 2020

t-vi commented Apr 15, 2021

martin-frbg commented Apr 15, 2021

t-vi commented Apr 15, 2021

martin-frbg commented Apr 15, 2021

t-vi commented Apr 15, 2021

martin-frbg commented Apr 16, 2021

t-vi commented Apr 16, 2021

martin-frbg commented Apr 16, 2021

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

Comments

yuyichao commented Jul 11, 2020 • edited Loading

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

yuyichao commented Jul 11, 2020

martin-frbg commented Jul 11, 2020

brada4 commented Jul 11, 2020

yuyichao commented Jul 11, 2020

ognjentodic commented Oct 21, 2020

yuyichao commented Oct 21, 2020

t-vi commented Apr 15, 2021

martin-frbg commented Apr 15, 2021

t-vi commented Apr 15, 2021

martin-frbg commented Apr 15, 2021

t-vi commented Apr 15, 2021

martin-frbg commented Apr 16, 2021

t-vi commented Apr 16, 2021

martin-frbg commented Apr 16, 2021

yuyichao commented Jul 11, 2020 •

edited

Loading