-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2022b] PyTorch v2.0.1 #19067
{ai}[foss/2022b] PyTorch v2.0.1 #19067
Conversation
…2.0.1_add-missing-vsx-vector-shift-functions.patch, PyTorch-2.0.1_avoid-test_quantization-failures.patch, PyTorch-2.0.1_disable-test-sharding.patch, PyTorch-2.0.1_fix-numpy-compat.patch, PyTorch-2.0.1_fix-shift-ops.patch, PyTorch-2.0.1_fix-skip-decorators.patch, PyTorch-2.0.1_fix-test_memory_profiler.patch, PyTorch-2.0.1_fix-test-ops-conf.patch, PyTorch-2.0.1_fix-torch.compile-on-ppc.patch, PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch, PyTorch-2.0.1_remove-test-requiring-online-access.patch, PyTorch-2.0.1_skip-diff-test-on-ppc.patch, PyTorch-2.0.1_skip-failing-gradtest.patch, PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch
Test report by @Flamefire |
@boegelbot please test @ jsc-zen2 |
@boegel: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1778706883 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @branfosj |
Test report by @SebastianAchilles |
Test report by @boegel |
@boegel @branfosj I see a pattern there (visible thanks to the updated EasyBlock):
Looks like an issue with the |
The The traceback is similar for all of them.
I'm rebuilding now with |
I'd be interested in the output for My current analysis:
Not sure how that can happen though |
In
So, the |
So your system is an AVX512 system? (64*8bit=512bit) so it has a bug in the AVX512 vectorized code. I'll see if I can find that. Strange though, that it passes those tests for 2022a. I opened draft PRs for 2.1.0 where I just ported the patches from here and removed those that are no longer required. Those are not tested yet, but maybe you can given them a try just to see if it compiles and succeeds on those tests at least. |
Yes
|
@branfosj or @boegel I'd like to report that to PyTorch as this is likely a bug there. Or in our 2022b toolchain (which I hope not). Does your GCCcore/12.2.0 installation include the If so and you can spare the time I'd like you to try the following (on such an AVX512 node):
I hope this reproduces on the pip installed 2.0.1 (sanity check for us) and expect it to also fail on 2.1.0 (git checkout of the latter not required, only pip) After this we can blame this on PyTorch and just skip the tests leaving the bug in the installed module as-if it was pip-installed if that is acceptable. At least I couldn't find anything wrong, so it might be a compiler bug given it works for 2022a |
Test report by @Flamefire |
Test report by @VRehnberg Edit: Full log attached |
Yes.
For 2.0.1 (and 2.1.0 on SkyLake machine, didn't try on IceLake) I'm not seeing any test errors for this test after trying on two different machines with the following avx512 flags listed by lscpu (don't know which matter here or what all the differences mean):
|
I reviewed the relevant code and used AVX512 intrinsics and haven't found anything wrong with them. Also a Given that this EC and test works with foss/2022a I'd suspect this to be a bug in GCC 12 -.- Short of disabling AVX512 by patching it out I don't know what to do... |
Test report by @VRehnberg Edit: Full log attached |
Test report by @VRehnberg |
Test report by @VRehnberg |
Test report by @Flamefire |
I am currently trying to do a testbuild but I noticed these modules are also dragged in:
Most of them are build-dependencies so ok I guess but clearly |
I don't understand why we would be shooting our feet by loading bzip2. Can you explain what issue you see with that? FWIW: This dependency is pulled in by Python which requires it for the bz2 Package in the Python stdlib according to the comment in that EC. So it looks entirely reasonable to me. |
667d245
to
87d9d70
Compare
Test report by @branfosj |
@branfosj Is this the same node as before? Suddenly different tests are failing :-( |
Test report by @Flamefire |
@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/7088418368
bleep, bloop, I'm just a bot (boegelbot v20200716.01) |
daa5968
to
f4b48c9
Compare
Test report by @Flamefire |
Test report by @Flamefire |
@boegelbot please test @ generoso |
@branfosj: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1858766566 processed Message to humans: this is just bookkeeping information for me, |
@boegelbot please test @ jsc-zen2 |
@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1858767086 processed Message to humans: this is just bookkeeping information for me, |
Test report by @Flamefire |
Test report by @branfosj |
Test report by @boegelbot |
@boegelbot please test @ generoso |
@branfosj: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1859087345 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
…asyconfigs into 20231024101128_new_pr_PyTorch201
Going in, thanks @Flamefire! |
(created using
eb --new-pr
)Requires (including rebuild!)