-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HWCAP_CPUID should not be used to detect CPU uarch on aarch64 #2715
Comments
Closed but not forgotten - I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ? (Currently it can happen that a simple |
That issue isn't about this anyway and I'm not trying to say that one shouldn't have been closed. Just want to make sure there is one open issue tracking this..
I don't think there's any difference between build time and runtime here. I would either go for one that's optimized for both big and LITTLE or just one that's optimized for the big.
AFAICT the moving between cores isn't something the userspace should care about. Running on the LITTLE core means that the OS doesn't think it's doing much work and running on the big core means the process is probably doing a lot of work. That's why I think it makes sense to simply optimize for the big core, the performance matters much less when running on the LITTLE core. Now with enough workload I assume both the big and the LITTLE cores are going to be used, so there's probably a slightly different target to optimize for (GCC seems to have some target like that, but I'm not sure if there's enough gain to worth the effort). However, I still think the big core perforrmance is more important in those cases and my faith is in that the pairing core are going to be "compatible" enough that code optimized for the big isn't going to be that bad on the LITTLE. Also, AFAICT, the detection is done on only one thread so I the result is going to be wrong for some thread anyway if one only make the decision based on current core. (and HWCAP_CPUID uses kernel emulation so I don't think doing detection on worker threads every time is an option...) |
Ok, I was unsure if code would simply crash when optimized for big and running on little but that would certainly be an unhealthy platform then. OTOH it seems the /sys/devices path depends on kernel version or something - I do not have the path you suggested on a 3.10 kernel (termux on an android8 device) |
The crash can only be caused by feature mismatch which should use Yes, that is a 4.7 kernel feature and it's added for exactly this reason. If the file doesn't exist, the alternaive should be parsing of |
On 3.10, it enumerates the cores but prints out only one (variable) "CPU part" though. |
If the kernel doesn't provide anything better then that'll have to be whar you work with of course... It's still better than cpuid since it should be more stable. Also note that at least for mainline kernel, |
Ah, ok. Iterating over all cores to determine the most capable type sounds a bit PITA though. |
REF: https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt |
That's the only way to get a stable result.
Well, that's unrelated to this issue. FWIW, what I'm asking for here is to lower the kernel version requirement. The feature that is intended for this is in 4.7, and the fallback of |
I wonder if this should be left to the developer to set instead of trying to automate it under the hood. Since we can already set thread affinity for blas threads, maybe the approach is that this is done only for the relevant subset (little/big), where what's relevant would be developers choice based on the application. So, the high-level workflow would be to determine which cores you'd want the openblas to run on, set affinity for blas thread to those cores, and then pick the core type accordingly. It looks like https://github.com/google/cpu_features works mainly via /proc/cpuinfo (based on the README). @yuyichao is there a reason you prefer the approach via |
Allowing it via something better than an environment variable is good but that must not replace autodetection. Most users will not have a better collection of CPU architectures to feed into the library (I do, but that's the exception).
Yes absolutely. This is the whole reason the sysfs file was added. |
I'm not entirely sure if it is relevant for this discussion but: |
That is/was supposed to be fixed with #3060 but Debian has probably not updated beyond 0.3.13 yet. |
@martin-frbg Thank you for the pointer. When I try to look at their 0.3.14 version, it seems to be missing that patch, too?! I'll file a bug with Debian. |
That fix is definitely in 0.3.14 (which should be superseded by 0.3.15 "shortly"), and according to the numpy ticket linked from #3060 it solved the problem with Jetson (I do not have Jetson hardware). |
Yes, thank you. It'd be a shame if the upcoming Debian release would be missing the fix. I filed it as Debian Bug 986996 on their system. |
My fault - fix overwritten by a subsequent PR from a different branch. Restoring... |
I'm glad I didn't just waste your time on something that was completely and obviously fixed. Thank you again for your work on OpenBLAS and your help with this. |
Thanks for alerting me to my blunder. Pity this could not show up in the CI runs |
Ref #2696 (comment) since the issue was closed.
This is because with big-LITTLE and alike the CPUID result for
MIDR_EL1
is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.The correct way is to read the
/sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1
file.This should replace https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/driver/others/dynamic_arm64.c#L150 and should also be preferred over the
/proc/cpuinfo
reader in https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/cpuid_arm64.c#L123The text was updated successfully, but these errors were encountered: