-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuTT issue with callback #271
Comments
Hi Victor, I was the one trying to add the option to use Umpire in cuTT. But it never worked out for me, so it was left as an unfinished project. I don't remember what was exactly the issue back then. As you have seen, it is not working properly. In tiledarray, it will always build cuTT without Umpire. https://github.com/ValeevGroup/tiledarray/blob/master/external/cutt.cmake#L68 Regards, |
I compiled the TiledArray fork of cuTT with Umpire support (CUTT_HAS_UMPIRE) for CUDA platform. If I comment out the line cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp, all cutt_test test successfully pass. So it seems to me that the pair cuTT+Umpire works well in the stand-alone tests. I was just curious if you need to keep that Callback, which causes trouble. |
I think the reason for me to have a cuda callback to delete the plan here https://github.com/ValeevGroup/cutt/blob/master/src/cutt.cpp#L329, is that the original cuTT delete will deallocate GPU memory, which forces a synchronization. so the cuTT tasks are always finished before the deallocation. While with Umpire, this implicit synchronization is gone, so it might deallocate the GPU memory before the cuTT tasks finished. I tried to solve this by adding a cuda callback to deallocate the GPU memory. However, the issue is, cuda callback can only make CPU function calls, and it can not make function calls to deallocate GPU memory. This's where the error message Yes, removing this callback function will work. But it will also lead to memory leak. Since cuTT allocate GPU memory for each plan. |
Hi Chong, Thank you for your explanation! I will proceed with that Callback function disabled in my copy of the cuTT code. I will think about the impact of the memory leak. Thank you, |
Now I remember that I have faced similar situation in TiledArray when deallocating the Tensor with GPU memory in SUMMA. The solution I made was to call the deallocation function from another cpu thread inside the callback. For example, https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/reduce_task.h#L336 Maybe you can try something like this below by updating the function
Sorry that I don't have a GPU device to test this, please try and see if this will fix the exception. |
Hi Chong, This implementation of Callback causes a deadlock at runtime. Looks like the existing synchronization between threads is not working. Thank you, |
Can you try this below. The goal is to do the memory deallocation in the callback from another thread, rather than in the callback directly.
|
It made no difference. This code hangs. |
@victor-anisimov I think per @pchong90 's suggestion you should turn off cuTT's use of Umpire ( p.s. where does the deadlock occur when you use @pchong90's workaround? |
@victor-anisimov #277 will partially fix this, many test are already passing (including permutation tests, e.g. |
@victor-anisimov #279 should fix most of the problems |
@victor-anisimov master contains all fixes now, all tests (incl. CUDA) pass locally. Please try and report. |
Thanks for restoring CUDA unit tests, @evaleev and @asadchev. I downloaded May 31 commit 8c471b9 of TiledArray from the master brunch. No hangs, but I get numerically incorrect results for um_expressions_suite. Here is a sample error message: `/home/vanisimov/tiledarray/tiledarray/tests/expressions_cuda_um.cpp(720): error: in "um_expressions_suite/assign_subblock_permute_block": check result_tile[j] == 2 * (3 * a_tile[j] + 4 * b_tile[j]) has failed [4.6449680328369141 != 6.8127832412719727] *** 2233 failures are detected in the test module "TiledArray Tests"` The TA code is compiled with CUDA/11.2 and GNU/8.2. I get the same errors as above if I use icpx (Clang) compiler from the latest release of oneAPI with CUDA/11.2. The single-rank and multiple-rank tests identically fail. No improvement if I run ta_test manually without mpirun. Also tried compiling TA for CUDA architecture 60 or 70 and that did not change anything. The nodes I run the tests are nVidia V100. |
@victor-anisimov For this build, did you use the default cuTT build by TiledArray or cuTT you built with umpire support? The failing test has a permute operation. I suspect that it might related to cuTT. |
@pchong90 I compiled TiledArray with its own version of Umpire and cuTT, and did not change anything in the way TiledArray compiles those libraries. |
@victor-anisimov did only block permutation tests break? Did any block expression tests succeed? |
gnu.8.2.manual.test.log.gz |
The test group "um_expressions_suite" still produces numerical errors for me when using the latest TA. All cutt unit tests pass. |
finally resolved by #421 |
I do not see how to open an issue against the TiledArray fork of cuTT. With that, I'm describing the problem here.
I extended the integration with Umpire library, which is present in the TiledArray fork of cuTT library, to cuTT tests. The purpose is to make the stand-alone cuTT tests running in the cuTT+Umpire library, so that the integration is complete. That required adding cuttInitialize() and cuttFinalize() calls into main() in cutt_test.cpp. Both Umpire and cuTT have been compiled on V100 node on JLSE. Now, if I start bin/cutt_test, I get Umpire error right after the first micro-test:
terminate called after throwing an instance of 'umpire::util::Exception' what(): ! Umpire Exception [/home/vanisimov/Umpire/umpire-TA/src/umpire/alloc/CudaMallocAllocator.hpp:52]: deallocate cudaFree( ptr = 0x7fb3e7601000 ) failed with error: operation not permitted
I tracked the crash to cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp (line 329). For some reason, the line "delete plan" causes Umpire exception. Any idea why it happens? Do we actually need the line "delete plan", or perhaps it should be done differently? Any suggestions?
The text was updated successfully, but these errors were encountered: