cuTT issue with callback #271

victor-anisimov · 2021-05-04T23:12:10Z

I do not see how to open an issue against the TiledArray fork of cuTT. With that, I'm describing the problem here.

I extended the integration with Umpire library, which is present in the TiledArray fork of cuTT library, to cuTT tests. The purpose is to make the stand-alone cuTT tests running in the cuTT+Umpire library, so that the integration is complete. That required adding cuttInitialize() and cuttFinalize() calls into main() in cutt_test.cpp. Both Umpire and cuTT have been compiled on V100 node on JLSE. Now, if I start bin/cutt_test, I get Umpire error right after the first micro-test:

terminate called after throwing an instance of 'umpire::util::Exception' what(): ! Umpire Exception [/home/vanisimov/Umpire/umpire-TA/src/umpire/alloc/CudaMallocAllocator.hpp:52]: deallocate cudaFree( ptr = 0x7fb3e7601000 ) failed with error: operation not permitted

I tracked the crash to cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp (line 329). For some reason, the line "delete plan" causes Umpire exception. Any idea why it happens? Do we actually need the line "delete plan", or perhaps it should be done differently? Any suggestions?

The text was updated successfully, but these errors were encountered:

pchong90 · 2021-05-04T23:23:31Z

Hi Victor,

I was the one trying to add the option to use Umpire in cuTT. But it never worked out for me, so it was left as an unfinished project. I don't remember what was exactly the issue back then.

As you have seen, it is not working properly.

In tiledarray, it will always build cuTT without Umpire. https://github.com/ValeevGroup/tiledarray/blob/master/external/cutt.cmake#L68

Regards,
Chong

victor-anisimov · 2021-05-04T23:30:05Z

I compiled the TiledArray fork of cuTT with Umpire support (CUTT_HAS_UMPIRE) for CUDA platform. If I comment out the line cudaStreamAddCallback(stream, cuttDestroy_callback, plan, 0) in cutt.cpp, all cutt_test test successfully pass. So it seems to me that the pair cuTT+Umpire works well in the stand-alone tests. I was just curious if you need to keep that Callback, which causes trouble.

pchong90 · 2021-05-04T23:34:54Z

I think the reason for me to have a cuda callback to delete the plan here https://github.com/ValeevGroup/cutt/blob/master/src/cutt.cpp#L329, is that the original cuTT delete will deallocate GPU memory, which forces a synchronization. so the cuTT tasks are always finished before the deallocation.

While with Umpire, this implicit synchronization is gone, so it might deallocate the GPU memory before the cuTT tasks finished.

I tried to solve this by adding a cuda callback to deallocate the GPU memory. However, the issue is, cuda callback can only make CPU function calls, and it can not make function calls to deallocate GPU memory. This's where the error message deallocate cudaFree( ptr = 0x7fb3e7601000 ) failed with error: operation not permitted from.

Yes, removing this callback function will work. But it will also lead to memory leak. Since cuTT allocate GPU memory for each plan.

victor-anisimov · 2021-05-04T23:50:05Z

Hi Chong,

Thank you for your explanation! I will proceed with that Callback function disabled in my copy of the cuTT code. I will think about the impact of the memory leak.

Thank you,
Victor

pchong90 · 2021-05-05T03:34:01Z

Now I remember that I have faced similar situation in TiledArray when deallocating the Tensor with GPU memory in SUMMA. The solution I made was to call the deallocation function from another cpu thread inside the callback.

For example, https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/reduce_task.h#L336

Maybe you can try something like this below by updating the function cuttDestroy_callback in cutt.cpp:

void CUDART_CB cuttDestroy_callback(cudaStream_t stream, cudaError_t status, void *userData){
  cuttPlan_t* plan = (cuttPlan_t*) userData;
  auto delete_plan = [plan] () {
    delete plan;
  };
  std::async(delete_plan);
}

Sorry that I don't have a GPU device to test this, please try and see if this will fix the exception.

victor-anisimov · 2021-05-05T15:54:13Z

Hi Chong,

This implementation of Callback causes a deadlock at runtime. Looks like the existing synchronization between threads is not working.

Thank you,
Victor

pchong90 · 2021-05-05T16:44:02Z

Can you try this below. The goal is to do the memory deallocation in the callback from another thread, rather than in the callback directly.

void CUDART_CB cuttDestroy_callback(cudaStream_t stream, cudaError_t status, void *userData){
  cuttPlan_t* plan = (cuttPlan_t*) userData;
  auto delete_plan = [plan] () {
    delete plan;
  };
  auto result = std::async(delete_plan);
  result.get();
}

victor-anisimov · 2021-05-05T16:49:27Z

It made no difference. This code hangs.

evaleev · 2021-05-06T01:06:39Z

@victor-anisimov I think per @pchong90 's suggestion you should turn off cuTT's use of Umpire (CUTT_HAS_UMPIRE=OFF).

p.s. where does the deadlock occur when you use @pchong90's workaround?

evaleev · 2021-05-25T23:24:36Z

@victor-anisimov #277 will partially fix this, many test are already passing (including permutation tests, e.g. um_expressions_suite//permute), but not all.

evaleev · 2021-05-28T01:44:11Z

@victor-anisimov #279 should fix most of the problems

evaleev · 2021-05-29T17:14:57Z

@victor-anisimov master contains all fixes now, all tests (incl. CUDA) pass locally. Please try and report.

victor-anisimov · 2021-06-01T18:07:34Z

Thanks for restoring CUDA unit tests, @evaleev and @asadchev. I downloaded May 31 commit 8c471b9 of TiledArray from the master brunch. No hangs, but I get numerically incorrect results for um_expressions_suite. Here is a sample error message:

`/home/vanisimov/tiledarray/tiledarray/tests/expressions_cuda_um.cpp(720): error: in "um_expressions_suite/assign_subblock_permute_block": check result_tile[j] == 2 * (3 * a_tile[j] + 4 * b_tile[j]) has failed [4.6449680328369141 != 6.8127832412719727]

*** 2233 failures are detected in the test module "TiledArray Tests"`

The TA code is compiled with CUDA/11.2 and GNU/8.2. I get the same errors as above if I use icpx (Clang) compiler from the latest release of oneAPI with CUDA/11.2. The single-rank and multiple-rank tests identically fail. No improvement if I run ta_test manually without mpirun. Also tried compiling TA for CUDA architecture 60 or 70 and that did not change anything. The nodes I run the tests are nVidia V100.

pchong90 · 2021-06-02T16:44:06Z

Thanks for restoring CUDA unit tests, @evaleev and @asadchev. I downloaded May 31 commit 8c471b9 of TiledArray from the master brunch. No hangs, but I get numerically incorrect results for um_expressions_suite. Here is a sample error message:

`/home/vanisimov/tiledarray/tiledarray/tests/expressions_cuda_um.cpp(720): error: in "um_expressions_suite/assign_subblock_permute_block": check result_tile[j] == 2 * (3 * a_tile[j] + 4 * b_tile[j]) has failed [4.6449680328369141 != 6.8127832412719727]

*** 2233 failures are detected in the test module "TiledArray Tests"`

The TA code is compiled with CUDA/11.2 and GNU/8.2. I get the same errors as above if I use icpx (Clang) compiler from the latest release of oneAPI with CUDA/11.2. The single-rank and multiple-rank tests identically fail. No improvement if I run ta_test manually without mpirun. Also tried compiling TA for CUDA architecture 60 or 70 and that did not change anything. The nodes I run the tests are nVidia V100.

@victor-anisimov For this build, did you use the default cuTT build by TiledArray or cuTT you built with umpire support? The failing test has a permute operation. I suspect that it might related to cuTT.

victor-anisimov · 2021-06-02T16:49:12Z

@pchong90 I compiled TiledArray with its own version of Umpire and cuTT, and did not change anything in the way TiledArray compiles those libraries.

evaleev · 2021-06-02T18:06:44Z

@victor-anisimov did only block permutation tests break? Did any block expression tests succeed?

victor-anisimov · 2021-06-02T20:55:26Z

gnu.8.2.manual.test.log.gz
gnu.8.2.tests.log
@evaleev I'm not sure what is the name for block expression tests in the test suite, therefore I'm attaching two log files. The smaller one (gnu.8.2.tests.log) is the stdout of "make check". The gzipped one (gnu.8.2.manual.test.log.gz) is the stdout of "./ta_test --log_level=all", which is executed manually.

victor-anisimov · 2021-06-29T02:20:21Z

The test group "um_expressions_suite" still produces numerical errors for me when using the latest TA. All cutt unit tests pass.

evaleev · 2023-09-29T13:28:20Z

finally resolved by #421

evaleev linked a pull request May 25, 2021 that will close this issue

cuda callback is prebuf-aware #277

Merged

evaleev removed a link to a pull request May 26, 2021

cuda callback is prebuf-aware #277

Merged

evaleev mentioned this issue May 29, 2021

TiledArray unit tests fail on nVIDIA V100 GPU #258

Closed

victor-anisimov mentioned this issue Jun 29, 2021

Substitution of cuTT with LibreTT #290

Closed

evaleev closed this as completed Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuTT issue with callback #271

cuTT issue with callback #271

victor-anisimov commented May 4, 2021

pchong90 commented May 4, 2021

victor-anisimov commented May 4, 2021

pchong90 commented May 4, 2021 •

edited

Loading

victor-anisimov commented May 4, 2021

pchong90 commented May 5, 2021

victor-anisimov commented May 5, 2021

pchong90 commented May 5, 2021

victor-anisimov commented May 5, 2021

evaleev commented May 6, 2021

evaleev commented May 25, 2021 •

edited

Loading

evaleev commented May 28, 2021

evaleev commented May 29, 2021

victor-anisimov commented Jun 1, 2021

pchong90 commented Jun 2, 2021

victor-anisimov commented Jun 2, 2021

evaleev commented Jun 2, 2021

victor-anisimov commented Jun 2, 2021

victor-anisimov commented Jun 29, 2021

evaleev commented Sep 29, 2023

cuTT issue with callback #271

cuTT issue with callback #271

Comments

victor-anisimov commented May 4, 2021

pchong90 commented May 4, 2021

victor-anisimov commented May 4, 2021

pchong90 commented May 4, 2021 • edited Loading

victor-anisimov commented May 4, 2021

pchong90 commented May 5, 2021

victor-anisimov commented May 5, 2021

pchong90 commented May 5, 2021

victor-anisimov commented May 5, 2021

evaleev commented May 6, 2021

evaleev commented May 25, 2021 • edited Loading

evaleev commented May 28, 2021

evaleev commented May 29, 2021

victor-anisimov commented Jun 1, 2021

pchong90 commented Jun 2, 2021

victor-anisimov commented Jun 2, 2021

evaleev commented Jun 2, 2021

victor-anisimov commented Jun 2, 2021

victor-anisimov commented Jun 29, 2021

evaleev commented Sep 29, 2023

pchong90 commented May 4, 2021 •

edited

Loading

evaleev commented May 25, 2021 •

edited

Loading