Prevent race condition in accelerator copies management #575

bosilca · 2023-09-30T19:37:51Z

Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer.

devreal · 2023-10-01T01:27:27Z

parsec/mca/device/cuda/device_cuda_module.c

+                /* We are trying to use a candidate that is repurposed by the owner device. Let's find another one */
+                parsec_atomic_fetch_add_int32(&candidate->readers, -1);
+                continue;
+            }


After the increment we also need to check that the candidate has not been detached?

Not here, this is the correction path where we put back the readers we tried to increase. The thread doing this code will not use this candidate, so there is no need for further checks.

That's the path if readers < 0 (the case where we don't use the copy). But what about the alternative path (anything below this block)? We could encounter a copy with readers == 1 that is detached from the data we're looking for, no?

I don't think this scenario is possible anymore. Either the copy is not repurposed in which case the readers is > 0, or is currently repurposed by the owner in which case the atomicity guarantees that either readers is positive and the current thread can use it, is negative and the owner will get it from there.

However, as my comment in another of the comments states, if the current thread is significantly delayed at considering the copy, the owner could have completed the transfer and the copy is now associated with a different data_t. Without a lock I don't know how to avoid that.

I think it is sufficient to check for the data after the increment (properly sequenced with a read memory barrier). There can be three cases:

readers >= 0 and the right data: good, can use

readers < 0: bad, cannot use (another thread is in the process of detaching the copy)

readers >= 0 and a different data: bad, cannot use (another thread has detached the copy and reset readers to 0)

Number 3 would be covered by checking for the data_t after the increment if the readers looked right, i.e., after this block. That is also why the detach needs to be properly sequenced with a write memory barrier (so that the detach cannot occur before the atomic operation on readers).

The readers field serves as a poor-man's rwlock: if negative we have a writer (the thread detaching); if positive we have, well, readers. No need for an extra lock, just have to handle the case where readers has been reset already after we checked the data_t the first time.

devreal · 2023-10-01T01:28:52Z

parsec/mca/device/cuda/device_cuda_module.c

        gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
-        gpu_elem->version = 0;
+        gpu_elem->version = UINT_MAX;  /* scrap value for now */
+        gpu_elem->readers = 0;         /* safe to set to no access */


I'm not sure this is safe: there might be someone looking at the copy seeing negative readers and decrementing by one. I think this should be an atomic increment by INT_MAX so that eventually readers reaches 0 in all cases.

Or an atomic increment by -INT_MIN ? (INT_MIN is probably not equal to INT_MAX)

At this point all the protection that could be offered was offered by the atomic update protocols. Now, I know this code remains unsafe for extreme cases, but the only way to fix it is to add a lock. Adding yet another atomic operation will not help.

If another thread is trying to use this copy as a candidate for d2d, and that thread is delayed, it might see the readers set to 0 and assume everything is fine and then move ahead and use the copy. Hopefully that will not happen because the version will be incorrect and the coherency is invalid.

This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415

The sequence would be: Thread A sets readers to INT_MIN; Thread B increments readers and finds it negative; Thread A detaches the copy and sets readers to 0; Thread B does the rollback by atomically decrementing readers. The result would be readers == -1. I don't think that should be valid.

Instead of assigning 0 we should atomically add -INT_MIN which would guarantee readers >= 0.

devreal · 2023-10-01T01:30:46Z

parsec/mca/device/cuda/device_cuda_module.c

                    parsec_atomic_unlock( &oldmaster->lock );
                    goto find_another_data;
                }
+                /* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */
                int do_unlock = oldmaster->super.obj_reference_count != 1;
                parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index);


This needs to be guaranteed sequenced after the CAS so we need a write memory barrier before the detach.

Also, parsec_data_copy_detach() doesn't update data->owner_device, so this detach can lead to a situation where the data->device_copies[data->owner_device] is NULL. Is that a behavior we want to support?

No, there is no need for write memory barrier. There is a single owner of the copy, the thread that will do the CAS and the detach, and the same that will eventually free the copy. Every other threads is limited to updates of the reader field, if they want to manipulate the copy more, they do it via a message to the owner thread.

I don't see why changing the owner_device is needed here, we only change the data_t this copy is associated with, the owner_device will remain the same. The detach will clear the origin so data->device_copies[data->owner_device] will segfault because data will be NULL.

In parsec_data_copy_detach we do data->device_copies[device] = copy->older; (where data is copy->original before we set copy->original to NULL. If copy->older is NULL, then that data_t has a device_copies[owner_device] set to NULL?

That is a bigger problem overall.

NULL == copy->older we have the owner_device pointing to a NULL data. But, we don't know what is legitimately the good version of the remaining copies, so in the generic code we are unable to redirect the owner_device to the correct version.

NULL != copy->older we are setting the owner_device to point to an older version of the data that might exists on the device. Not clear why that version would be the correct one.

therault · 2023-10-02T08:42:00Z

parsec/mca/device/cuda/device_cuda_module.c

                    parsec_atomic_unlock( &oldmaster->lock );
                    goto find_another_data;
                }
+                /* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */
                int do_unlock = oldmaster->super.obj_reference_count != 1;
                parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index);


Also, parsec_data_copy_detach() doesn't update data->owner_device, so this detach can lead to a situation where the data->device_copies[data->owner_device] is NULL. Is that a behavior we want to support?

therault · 2023-10-02T08:45:11Z

parsec/mca/device/cuda/device_cuda_module.c

        gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
-        gpu_elem->version = 0;
+        gpu_elem->version = UINT_MAX;  /* scrap value for now */
+        gpu_elem->readers = 0;         /* safe to set to no access */


Or an atomic increment by -INT_MIN ? (INT_MIN is probably not equal to INT_MAX)

therault · 2023-10-02T08:49:46Z

parsec/mca/device/cuda/device_cuda_module.c

@@ -1295,7 +1306,7 @@ parsec_gpu_data_stage_in( parsec_device_cuda_module_t* cuda_device,
    transfer_from = parsec_data_start_transfer_ownership_to_copy(original, gpu_device->super.device_index, (uint8_t)type);

    if( PARSEC_FLOW_ACCESS_WRITE & type && gpu_task->task_type != PARSEC_GPU_TASK_TYPE_PREFETCH ) {
-        gpu_elem->version++;  /* on to the next version */
+        gpu_elem->version = candidate->version + 1;  /* on to the next version */


I agree we need to do this here, but I think we also need to fix at line 1487 (before the patch) / 1506 (after the patch): we do gpu_elem->version = candidate->version; when the transfer (if a transfer is needed) is done, and the version of candidate is one less than what we want, and this will overwrite the effect of this action here.

This doesn't typically fail with GEMM or POTRF as benchmarks / tests because they only send the result back to the host memory once, and then we discard / release the GPU copy, so the fact that the version is wrong doesn't trigger a problem, but if you remove the release_memory from the test this should create a problem.

In the original code, aka. long ago, there was no need to play around with this version until completion of the transfer, because nobody could have used that copy meanwhile. As the code stands today, they can, so the version of the device copy must be carefully handled. At the time where this code executes, the device copy is not yet usable, it will become usable only when the transfer completes, but it needs to be able to be found by others with the correct version.

I removed the release_memory on the tester, and even with checks enabled it still succeed. We might not really understand that code as good as we think we are.

My problem is I don't understand (neither in the code before the PR or inside the PR) what is supposed to increase the version of the GPU copy to what I believe it should be.

I did a run (or the PR, at commit fc82786) with M=1 tile, N=1 tile, K=2 tiles, a single GPU under gdb.

I break on parsec_cuda_kernel_scheduler to see all kernels passed to the GPU, and on parsec_gpu_data_reserve_device_space. When I'm in kernel scheduler, I print the task name. When I'm in reserve_device_space, I go until the GPU copy for C is allocated (line 927), then I install a watchpoint on any change of this copy's version.

Here is what I see:

we start GEMM(0, 0, 0), we allocate the GPU copy for C and set the version to 0 (inside the allocation)

version goes to UINT_MAX at line 1149 according to the patch

it is then set to 2 at line 1311, because it's a write (C) and we take candidate (CPU copy)'s version + 1. transfer_from is 0, everything is logic

next step, it is reset to 1 at line 1512 after we have started the copy. You seem to expect that, OK, why not.

then it remains unmodified until we start scheduling GEMM(0, 0, 1) on the GPU. But definitely here, its version should be 2. Shouldn't it? The CPU version is 2 (has been updated by the DSL at the completion of the task), so it appears like if the GPU copy is too old (which it isn't).

then we reach the call to parsec_data_start_transfer_ownership_to_copy at line 1308:

original->owner_device is set to 2, and the GPU copy is in the state SHARED.

there is no other copy with the state OWNED, and all other copies are in the state SHARED, so we never check if another copy has a version higher than the GPU version.

that explains why we don't consider a new transfer is necessary, and we return -1

then we set the GPU copy version to 3 directly, by line 1311 (candidate->version+1)

we don't schedule the transfer, so we don't reset the version to candidate->version

It works, but I'm surprised that this is the expected behavior. I would have expected the GPU copy to get the version 2 when the transfer is complete, and the transfer_ownership to return -1 because the GPU copy has the highest version. At the time transfer_ownership returns -1, it looks to me it returns -1 for the wrong reason.

This is not what I expect to see here. But, that is another problem for another PR, here we are just fixing the race condition between accelerator copies management.

therault · 2023-10-02T18:25:46Z

I get some asserts triggering with dplasma using this branch for parsec.

dplasma: master (749c912911ed970ef292d7b1d49e6312dd156485)
parsec: latest commit of this branch (4a4fad0)

Compile mode: CMAKE_BUILD_TYPE=Debug, PARSEC_DEBUG_NOISIER=ON and PARSEC_DEBUG_PARANOID=ON

Machine: xsdk (with CUDA enabled)

test: ./tests/testing_dgemm -g 1 -N 1024 -K 1024 -M 1024 -t 1024 -x 0 --nruns 1

Symptom:
testing_dgemm: /home/herault/dplasma/parsec/parsec/data.c:418: parsec_data_start_transfer_ownership_to_copy: Assertion `(!transfer_required) || (data->device_copies[valid_copy]->version >= copy->version)' failed.

Context: trying to stage_in flow 0 of GEMM(0, 0, 0): Flows 0, 1 and 2 of GEMM(0, 0, 0) have been allocated on the device in parsec_gpu_data_reserve_device_space and the version of the new copies set to UINT_MAX at line 1147. Then the code reaches line 1306 in stage_in, calls parsec_data_start_transfer_ownership_to_copy and the assert fails, because the device's copy has version UINT_MAX, which is never smaller than the CPU copy version. Or at least, that's what I understand (and yes, I reckon I don't understand the GPU code... :) ).

therault · 2023-10-02T21:35:39Z

I get some asserts triggering with dplasma using this branch for parsec.

dplasma: master (749c912911ed970ef292d7b1d49e6312dd156485) parsec: latest commit of this branch (4a4fad0)

Compile mode: CMAKE_BUILD_TYPE=Debug, PARSEC_DEBUG_NOISIER=ON and PARSEC_DEBUG_PARANOID=ON

Machine: xsdk (with CUDA enabled)

test: ./tests/testing_dgemm -g 1 -N 1024 -K 1024 -M 1024 -t 1024 -x 0 --nruns 1

Symptom: testing_dgemm: /home/herault/dplasma/parsec/parsec/data.c:418: parsec_data_start_transfer_ownership_to_copy: Assertion `(!transfer_required) || (data->device_copies[valid_copy]->version >= copy->version)' failed.

Context: trying to stage_in flow 0 of GEMM(0, 0, 0): Flows 0, 1 and 2 of GEMM(0, 0, 0) have been allocated on the device in parsec_gpu_data_reserve_device_space and the version of the new copies set to UINT_MAX at line 1147. Then the code reaches line 1306 in stage_in, calls parsec_data_start_transfer_ownership_to_copy and the assert fails, because the device's copy has version UINT_MAX, which is never smaller than the CPU copy version. Or at least, that's what I understand (and yes, I reckon I don't understand the GPU code... :) ).

this is solved

parsec/mca/device/cuda/device_cuda_module.c

devreal · 2023-10-03T19:04:24Z

parsec/mca/device/cuda/device_cuda_module.c

        gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
-        gpu_elem->version = 0;
+        gpu_elem->version = UINT_MAX;  /* scrap value for now */
+        gpu_elem->readers = 0;         /* safe to set to no access */


This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415

The sequence would be: Thread A sets readers to INT_MIN; Thread B increments readers and finds it negative; Thread A detaches the copy and sets readers to 0; Thread B does the rollback by atomically decrementing readers. The result would be readers == -1. I don't think that should be valid.

Instead of assigning 0 we should atomically add -INT_MIN which would guarantee readers >= 0.

Signed-off-by: George Bosilca <[email protected]>

Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer. Signed-off-by: George Bosilca <[email protected]>

therault

LGTM

bosilca requested a review from a team as a code owner September 30, 2023 19:37

devreal requested changes Oct 1, 2023

View reviewed changes

therault reviewed Oct 2, 2023

View reviewed changes

bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch 3 times, most recently from f47b917 to fc82786 Compare October 2, 2023 20:43

devreal requested changes Oct 3, 2023

View reviewed changes

bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch from fc82786 to a9a806e Compare October 5, 2023 20:01

bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch 6 times, most recently from d30f416 to 609bbf2 Compare October 14, 2023 16:49

bosilca added 2 commits October 17, 2023 19:18

Fix the device statistics.

ac4799d

Signed-off-by: George Bosilca <[email protected]>

Prevent race condition in accelerator copies management

f69460c

Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer. Signed-off-by: George Bosilca <[email protected]>

bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch from 609bbf2 to f69460c Compare October 18, 2023 20:55

therault approved these changes Oct 19, 2023

View reviewed changes

bosilca merged commit 7eba1f4 into ICLDisco:master Oct 19, 2023

devreal mentioned this pull request Nov 29, 2023

Support for oversubscription broken #601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent race condition in accelerator copies management #575

Prevent race condition in accelerator copies management #575

bosilca commented Sep 30, 2023

devreal Oct 1, 2023

bosilca Oct 2, 2023

devreal Oct 2, 2023

bosilca Oct 2, 2023

devreal Oct 2, 2023

devreal Oct 2, 2023

devreal Oct 1, 2023

therault Oct 2, 2023

bosilca Oct 2, 2023

devreal Oct 3, 2023

devreal Oct 1, 2023

therault Oct 2, 2023

bosilca Oct 2, 2023

therault Oct 2, 2023

bosilca Oct 2, 2023

therault Oct 2, 2023

therault Oct 2, 2023

therault Oct 2, 2023

bosilca Oct 2, 2023

therault Oct 2, 2023

bosilca Oct 2, 2023

therault commented Oct 2, 2023

therault commented Oct 2, 2023

devreal Oct 3, 2023

therault left a comment

Prevent race condition in accelerator copies management #575

Prevent race condition in accelerator copies management #575

Conversation

bosilca commented Sep 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

therault commented Oct 2, 2023

therault commented Oct 2, 2023

Choose a reason for hiding this comment

therault left a comment

Choose a reason for hiding this comment