Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuTT issue with callback #271

Closed
victor-anisimov opened this issue May 4, 2021 · 19 comments
Closed

cuTT issue with callback #271

victor-anisimov opened this issue May 4, 2021 · 19 comments

Comments

@victor-anisimov
Copy link
Collaborator

I do not see how to open an issue against the TiledArray fork of cuTT. With that, I'm describing the problem here.

I extended the integration with Umpire library, which is present in the TiledArray fork of cuTT library, to cuTT tests. The purpose is to make the stand-alone cuTT tests running in the cuTT+Umpire library, so that the integration is complete. That required adding cuttInitialize() and cuttFinalize() calls into main() in cutt_test.cpp. Both Umpire and cuTT have been compiled on V100 node on JLSE. Now, if I start bin/cutt_test, I get Umpire error right after the first micro-test:

terminate called after throwing an instance of 'umpire::util::Exception' what(): ! Umpire Exception [/home/vanisimov/Umpire/umpire-TA/src/umpire/alloc/CudaMallocAllocator.hpp:52]: deallocate cudaFree( ptr = 0x7fb3e7601000 ) failed with error: operation not permitted

I tracked the crash to cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp (line 329). For some reason, the line "delete plan" causes Umpire exception. Any idea why it happens? Do we actually need the line "delete plan", or perhaps it should be done differently? Any suggestions?

@pchong90
Copy link
Member

pchong90 commented May 4, 2021

Hi Victor,

I was the one trying to add the option to use Umpire in cuTT. But it never worked out for me, so it was left as an unfinished project. I don't remember what was exactly the issue back then.

As you have seen, it is not working properly.

In tiledarray, it will always build cuTT without Umpire. https://github.com/ValeevGroup/tiledarray/blob/master/external/cutt.cmake#L68

Regards,
Chong

@victor-anisimov
Copy link
Collaborator Author

I compiled the TiledArray fork of cuTT with Umpire support (CUTT_HAS_UMPIRE) for CUDA platform. If I comment out the line cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp, all cutt_test test successfully pass. So it seems to me that the pair cuTT+Umpire works well in the stand-alone tests. I was just curious if you need to keep that Callback, which causes trouble.

@pchong90
Copy link
Member

pchong90 commented May 4, 2021

I think the reason for me to have a cuda callback to delete the plan here https://github.com/ValeevGroup/cutt/blob/master/src/cutt.cpp#L329, is that the original cuTT delete will deallocate GPU memory, which forces a synchronization. so the cuTT tasks are always finished before the deallocation.

While with Umpire, this implicit synchronization is gone, so it might deallocate the GPU memory before the cuTT tasks finished.

I tried to solve this by adding a cuda callback to deallocate the GPU memory. However, the issue is, cuda callback can only make CPU function calls, and it can not make function calls to deallocate GPU memory. This's where the error message deallocate cudaFree( ptr = 0x7fb3e7601000 ) failed with error: operation not permitted from.

Yes, removing this callback function will work. But it will also lead to memory leak. Since cuTT allocate GPU memory for each plan.

@victor-anisimov
Copy link
Collaborator Author

Hi Chong,

Thank you for your explanation! I will proceed with that Callback function disabled in my copy of the cuTT code. I will think about the impact of the memory leak.

Thank you,
Victor

@pchong90
Copy link
Member

pchong90 commented May 5, 2021

Now I remember that I have faced similar situation in TiledArray when deallocating the Tensor with GPU memory in SUMMA. The solution I made was to call the deallocation function from another cpu thread inside the callback.

For example, https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/reduce_task.h#L336

Maybe you can try something like this below by updating the function cuttDestroy_callback in cutt.cpp:

void CUDART_CB cuttDestroy_callback(cudaStream_t stream, cudaError_t status, void *userData){
  cuttPlan_t* plan = (cuttPlan_t*) userData;
  auto delete_plan = [plan] () {
    delete plan;
  };
  std::async(delete_plan);
}

Sorry that I don't have a GPU device to test this, please try and see if this will fix the exception.

@victor-anisimov
Copy link
Collaborator Author

Hi Chong,

This implementation of Callback causes a deadlock at runtime. Looks like the existing synchronization between threads is not working.

Thank you,
Victor

@pchong90
Copy link
Member

pchong90 commented May 5, 2021

Can you try this below. The goal is to do the memory deallocation in the callback from another thread, rather than in the callback directly.

void CUDART_CB cuttDestroy_callback(cudaStream_t stream, cudaError_t status, void *userData){
  cuttPlan_t* plan = (cuttPlan_t*) userData;
  auto delete_plan = [plan] () {
    delete plan;
  };
  auto result = std::async(delete_plan);
  result.get();
}

@victor-anisimov
Copy link
Collaborator Author

It made no difference. This code hangs.

@evaleev
Copy link
Member

evaleev commented May 6, 2021

@victor-anisimov I think per @pchong90 's suggestion you should turn off cuTT's use of Umpire (CUTT_HAS_UMPIRE=OFF).

p.s. where does the deadlock occur when you use @pchong90's workaround?

@evaleev evaleev linked a pull request May 25, 2021 that will close this issue
@evaleev
Copy link
Member

evaleev commented May 25, 2021

@victor-anisimov #277 will partially fix this, many test are already passing (including permutation tests, e.g. um_expressions_suite//permute), but not all.

@evaleev evaleev removed a link to a pull request May 26, 2021
@evaleev
Copy link
Member

evaleev commented May 28, 2021

@victor-anisimov #279 should fix most of the problems

@evaleev
Copy link
Member

evaleev commented May 29, 2021

@victor-anisimov master contains all fixes now, all tests (incl. CUDA) pass locally. Please try and report.

@victor-anisimov
Copy link
Collaborator Author

Thanks for restoring CUDA unit tests, @evaleev and @asadchev. I downloaded May 31 commit 8c471b9 of TiledArray from the master brunch. No hangs, but I get numerically incorrect results for um_expressions_suite. Here is a sample error message:

`/home/vanisimov/tiledarray/tiledarray/tests/expressions_cuda_um.cpp(720): error: in "um_expressions_suite/assign_subblock_permute_block": check result_tile[j] == 2 * (3 * a_tile[j] + 4 * b_tile[j]) has failed [4.6449680328369141 != 6.8127832412719727]

*** 2233 failures are detected in the test module "TiledArray Tests"`

The TA code is compiled with CUDA/11.2 and GNU/8.2. I get the same errors as above if I use icpx (Clang) compiler from the latest release of oneAPI with CUDA/11.2. The single-rank and multiple-rank tests identically fail. No improvement if I run ta_test manually without mpirun. Also tried compiling TA for CUDA architecture 60 or 70 and that did not change anything. The nodes I run the tests are nVidia V100.

@pchong90
Copy link
Member

pchong90 commented Jun 2, 2021

Thanks for restoring CUDA unit tests, @evaleev and @asadchev. I downloaded May 31 commit 8c471b9 of TiledArray from the master brunch. No hangs, but I get numerically incorrect results for um_expressions_suite. Here is a sample error message:

`/home/vanisimov/tiledarray/tiledarray/tests/expressions_cuda_um.cpp(720): error: in "um_expressions_suite/assign_subblock_permute_block": check result_tile[j] == 2 * (3 * a_tile[j] + 4 * b_tile[j]) has failed [4.6449680328369141 != 6.8127832412719727]

*** 2233 failures are detected in the test module "TiledArray Tests"`

The TA code is compiled with CUDA/11.2 and GNU/8.2. I get the same errors as above if I use icpx (Clang) compiler from the latest release of oneAPI with CUDA/11.2. The single-rank and multiple-rank tests identically fail. No improvement if I run ta_test manually without mpirun. Also tried compiling TA for CUDA architecture 60 or 70 and that did not change anything. The nodes I run the tests are nVidia V100.

@victor-anisimov For this build, did you use the default cuTT build by TiledArray or cuTT you built with umpire support? The failing test has a permute operation. I suspect that it might related to cuTT.

@victor-anisimov
Copy link
Collaborator Author

@pchong90 I compiled TiledArray with its own version of Umpire and cuTT, and did not change anything in the way TiledArray compiles those libraries.

@evaleev
Copy link
Member

evaleev commented Jun 2, 2021

@victor-anisimov did only block permutation tests break? Did any block expression tests succeed?

@victor-anisimov
Copy link
Collaborator Author

gnu.8.2.manual.test.log.gz
gnu.8.2.tests.log
@evaleev I'm not sure what is the name for block expression tests in the test suite, therefore I'm attaching two log files. The smaller one (gnu.8.2.tests.log) is the stdout of "make check". The gzipped one (gnu.8.2.manual.test.log.gz) is the stdout of "./ta_test --log_level=all", which is executed manually.

@victor-anisimov
Copy link
Collaborator Author

The test group "um_expressions_suite" still produces numerical errors for me when using the latest TA. All cutt unit tests pass.

@evaleev
Copy link
Member

evaleev commented Sep 29, 2023

finally resolved by #421

@evaleev evaleev closed this as completed Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants