Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

Closed
yuyichao opened this issue Jul 11, 2020 · 19 comments · Fixed by #5041
Closed

HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715

yuyichao opened this issue Jul 11, 2020 · 19 comments · Fixed by #5041

Comments

@yuyichao
Copy link
Contributor

yuyichao commented Jul 11, 2020

Ref #2696 (comment) since the issue was closed.

This is because with big-LITTLE and alike the CPUID result for MIDR_EL1 is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.

The correct way is to read the /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 file.

This should replace https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/driver/others/dynamic_arm64.c#L150 and should also be preferred over the /proc/cpuinfo reader in https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/cpuid_arm64.c#L123

@martin-frbg
Copy link
Collaborator

Closed but not forgotten - I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ? (Currently it can happen that a simple make builds for the weaker core, but a subsequent make lapack-test runs on a big core that got woken up due to the compiler workload
and goes looking for the wrong library name to link to.)
In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

@yuyichao
Copy link
Contributor Author

Closed but not forgotten

That issue isn't about this anyway and I'm not trying to say that one shouldn't have been closed. Just want to make sure there is one open issue tracking this..

I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ?

I don't think there's any difference between build time and runtime here. I would either go for one that's optimized for both big and LITTLE or just one that's optimized for the big.

In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

AFAICT the moving between cores isn't something the userspace should care about. Running on the LITTLE core means that the OS doesn't think it's doing much work and running on the big core means the process is probably doing a lot of work. That's why I think it makes sense to simply optimize for the big core, the performance matters much less when running on the LITTLE core. Now with enough workload I assume both the big and the LITTLE cores are going to be used, so there's probably a slightly different target to optimize for (GCC seems to have some target like that, but I'm not sure if there's enough gain to worth the effort). However, I still think the big core perforrmance is more important in those cases and my faith is in that the pairing core are going to be "compatible" enough that code optimized for the big isn't going to be that bad on the LITTLE.

Also, AFAICT, the detection is done on only one thread so I the result is going to be wrong for some thread anyway if one only make the decision based on current core. (and HWCAP_CPUID uses kernel emulation so I don't think doing detection on worker threads every time is an option...)

@martin-frbg
Copy link
Collaborator

Ok, I was unsure if code would simply crash when optimized for big and running on little but that would certainly be an unhealthy platform then. OTOH it seems the /sys/devices path depends on kernel version or something - I do not have the path you suggested on a 3.10 kernel (termux on an android8 device)

@yuyichao
Copy link
Contributor Author

simply crash when optimized for big and running on little

The crash can only be caused by feature mismatch which should use HWCAP_ to detect. Even accessing the hardware register for feature detection should also be fine since I think the kernel normalizes it (not 100% sure).

Yes, that is a 4.7 kernel feature and it's added for exactly this reason. If the file doesn't exist, the alternaive should be parsing of /proc/cpuinfo, which also contains info for all cores.

@martin-frbg
Copy link
Collaborator

On 3.10, it enumerates the cores but prints out only one (variable) "CPU part" though.

@yuyichao
Copy link
Contributor Author

If the kernel doesn't provide anything better then that'll have to be whar you work with of course... It's still better than cpuid since it should be more stable.

Also note that at least for mainline kernel, HWCAP_CPUID is a 4.11 feature.

@martin-frbg
Copy link
Collaborator

Ah, ok. Iterating over all cores to determine the most capable type sounds a bit PITA though.

@brada4
Copy link
Contributor

brada4 commented Jul 11, 2020

REF: https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt
The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

@yuyichao
Copy link
Contributor Author

Iterating over all cores to determine the most capable type sounds a bit PITA though.

That's the only way to get a stable result.

The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

Well, that's unrelated to this issue. FWIW, what I'm asking for here is to lower the kernel version requirement. The feature that is intended for this is in 4.7, and the fallback of /proc/cpuinfo is there for all versios.

@ognjentodic
Copy link

I wonder if this should be left to the developer to set instead of trying to automate it under the hood. Since we can already set thread affinity for blas threads, maybe the approach is that this is done only for the relevant subset (little/big), where what's relevant would be developers choice based on the application.

So, the high-level workflow would be to determine which cores you'd want the openblas to run on, set affinity for blas thread to those cores, and then pick the core type accordingly.

It looks like https://github.com/google/cpu_features works mainly via /proc/cpuinfo (based on the README). @yuyichao is there a reason you prefer the approach via /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 ?

@yuyichao
Copy link
Contributor Author

I wonder if this should be left to the developer to set instead of trying to automate it under the hood

Allowing it via something better than an environment variable is good but that must not replace autodetection. Most users will not have a better collection of CPU architectures to feed into the library (I do, but that's the exception).

is there a reason you prefer the approach via /sys/devices/system/cpu/cpu/regs/identification/midr_el1 ?

Yes absolutely. This is the whole reason the sysfs file was added. /proc/cpuinfo is never meant to be machine parsed but uses a human readable format that has changed over the years. It's much simpler and more reliable to use the newer interface. A fallback is certainly still OK.

@t-vi
Copy link

t-vi commented Apr 15, 2021

I'm not entirely sure if it is relevant for this discussion but:
It appears that the current code (after #2952) causes SIGILL on NVIDIA Jetson devices because the HWCAP_CPUID seems to have the checked bit set but the get_cpu_ftr trigerrs SIGILL nonetheless. This is observed e.g. with the Debian testing/unstable packages of openblas. I'd be happy to investigate more, but from what I can see the /sys method would work.

@martin-frbg
Copy link
Collaborator

That is/was supposed to be fixed with #3060 but Debian has probably not updated beyond 0.3.13 yet.

@t-vi
Copy link

t-vi commented Apr 15, 2021

@martin-frbg Thank you for the pointer. When I try to look at their 0.3.14 version, it seems to be missing that patch, too?! I'll file a bug with Debian.

@martin-frbg
Copy link
Collaborator

That fix is definitely in 0.3.14 (which should be superseded by 0.3.15 "shortly"), and according to the numpy ticket linked from #3060 it solved the problem with Jetson (I do not have Jetson hardware).
As a workaround you can set OPENBLAS_CORETYPE=ARMV8 to avoid using the autodetection mechanism at runtime

@t-vi
Copy link

t-vi commented Apr 15, 2021

Yes, thank you. It'd be a shame if the upcoming Debian release would be missing the fix. I filed it as Debian Bug 986996 on their system.

@martin-frbg
Copy link
Collaborator

My fault - fix overwritten by a subsequent PR from a different branch. Restoring...

@t-vi
Copy link

t-vi commented Apr 16, 2021

I'm glad I didn't just waste your time on something that was completely and obviously fixed. Thank you again for your work on OpenBLAS and your help with this.

@martin-frbg
Copy link
Collaborator

Thanks for alerting me to my blunder. Pity this could not show up in the CI runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants