Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent race condition in accelerator copies management #575

Conversation

bosilca
Copy link
Contributor

@bosilca bosilca commented Sep 30, 2023

Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer.

@bosilca bosilca requested a review from a team as a code owner September 30, 2023 19:37
/* We are trying to use a candidate that is repurposed by the owner device. Let's find another one */
parsec_atomic_fetch_add_int32(&candidate->readers, -1);
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the increment we also need to check that the candidate has not been detached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, this is the correction path where we put back the readers we tried to increase. The thread doing this code will not use this candidate, so there is no need for further checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the path if readers < 0 (the case where we don't use the copy). But what about the alternative path (anything below this block)? We could encounter a copy with readers == 1 that is detached from the data we're looking for, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this scenario is possible anymore. Either the copy is not repurposed in which case the readers is > 0, or is currently repurposed by the owner in which case the atomicity guarantees that either readers is positive and the current thread can use it, is negative and the owner will get it from there.

However, as my comment in another of the comments states, if the current thread is significantly delayed at considering the copy, the owner could have completed the transfer and the copy is now associated with a different data_t. Without a lock I don't know how to avoid that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is sufficient to check for the data after the increment (properly sequenced with a read memory barrier). There can be three cases:

  1. readers >= 0 and the right data: good, can use
  2. readers < 0: bad, cannot use (another thread is in the process of detaching the copy)
  3. readers >= 0 and a different data: bad, cannot use (another thread has detached the copy and reset readers to 0)

Number 3 would be covered by checking for the data_t after the increment if the readers looked right, i.e., after this block. That is also why the detach needs to be properly sequenced with a write memory barrier (so that the detach cannot occur before the atomic operation on readers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readers field serves as a poor-man's rwlock: if negative we have a writer (the thread detaching); if positive we have, well, readers. No need for an extra lock, just have to handle the case where readers has been reset already after we checked the data_t the first time.

gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
gpu_elem->version = 0;
gpu_elem->version = UINT_MAX; /* scrap value for now */
gpu_elem->readers = 0; /* safe to set to no access */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is safe: there might be someone looking at the copy seeing negative readers and decrementing by one. I think this should be an atomic increment by INT_MAX so that eventually readers reaches 0 in all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or an atomic increment by -INT_MIN ? (INT_MIN is probably not equal to INT_MAX)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point all the protection that could be offered was offered by the atomic update protocols. Now, I know this code remains unsafe for extreme cases, but the only way to fix it is to add a lock. Adding yet another atomic operation will not help.

If another thread is trying to use this copy as a candidate for d2d, and that thread is delayed, it might see the readers set to 0 and assume everything is fine and then move ahead and use the copy. Hopefully that will not happen because the version will be incorrect and the coherency is invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415

The sequence would be: Thread A sets readers to INT_MIN; Thread B increments readers and finds it negative; Thread A detaches the copy and sets readers to 0; Thread B does the rollback by atomically decrementing readers. The result would be readers == -1. I don't think that should be valid.

Instead of assigning 0 we should atomically add -INT_MIN which would guarantee readers >= 0.

parsec_atomic_unlock( &oldmaster->lock );
goto find_another_data;
}
/* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */
int do_unlock = oldmaster->super.obj_reference_count != 1;
parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be guaranteed sequenced after the CAS so we need a write memory barrier before the detach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, parsec_data_copy_detach() doesn't update data->owner_device, so this detach can lead to a situation where the data->device_copies[data->owner_device] is NULL. Is that a behavior we want to support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no need for write memory barrier. There is a single owner of the copy, the thread that will do the CAS and the detach, and the same that will eventually free the copy. Every other threads is limited to updates of the reader field, if they want to manipulate the copy more, they do it via a message to the owner thread.

I don't see why changing the owner_device is needed here, we only change the data_t this copy is associated with, the owner_device will remain the same. The detach will clear the origin so data->device_copies[data->owner_device] will segfault because data will be NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parsec_data_copy_detach we do data->device_copies[device] = copy->older; (where data is copy->original before we set copy->original to NULL. If copy->older is NULL, then that data_t has a device_copies[owner_device] set to NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bigger problem overall.

  1. NULL == copy->older we have the owner_device pointing to a NULL data. But, we don't know what is legitimately the good version of the remaining copies, so in the generic code we are unable to redirect the owner_device to the correct version.
  2. NULL != copy->older we are setting the owner_device to point to an older version of the data that might exists on the device. Not clear why that version would be the correct one.

parsec_atomic_unlock( &oldmaster->lock );
goto find_another_data;
}
/* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */
int do_unlock = oldmaster->super.obj_reference_count != 1;
parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, parsec_data_copy_detach() doesn't update data->owner_device, so this detach can lead to a situation where the data->device_copies[data->owner_device] is NULL. Is that a behavior we want to support?

gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
gpu_elem->version = 0;
gpu_elem->version = UINT_MAX; /* scrap value for now */
gpu_elem->readers = 0; /* safe to set to no access */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or an atomic increment by -INT_MIN ? (INT_MIN is probably not equal to INT_MAX)

@@ -1295,7 +1306,7 @@ parsec_gpu_data_stage_in( parsec_device_cuda_module_t* cuda_device,
transfer_from = parsec_data_start_transfer_ownership_to_copy(original, gpu_device->super.device_index, (uint8_t)type);

if( PARSEC_FLOW_ACCESS_WRITE & type && gpu_task->task_type != PARSEC_GPU_TASK_TYPE_PREFETCH ) {
gpu_elem->version++; /* on to the next version */
gpu_elem->version = candidate->version + 1; /* on to the next version */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to do this here, but I think we also need to fix at line 1487 (before the patch) / 1506 (after the patch): we do gpu_elem->version = candidate->version; when the transfer (if a transfer is needed) is done, and the version of candidate is one less than what we want, and this will overwrite the effect of this action here.

This doesn't typically fail with GEMM or POTRF as benchmarks / tests because they only send the result back to the host memory once, and then we discard / release the GPU copy, so the fact that the version is wrong doesn't trigger a problem, but if you remove the release_memory from the test this should create a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code, aka. long ago, there was no need to play around with this version until completion of the transfer, because nobody could have used that copy meanwhile. As the code stands today, they can, so the version of the device copy must be carefully handled. At the time where this code executes, the device copy is not yet usable, it will become usable only when the transfer completes, but it needs to be able to be found by others with the correct version.

I removed the release_memory on the tester, and even with checks enabled it still succeed. We might not really understand that code as good as we think we are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem is I don't understand (neither in the code before the PR or inside the PR) what is supposed to increase the version of the GPU copy to what I believe it should be.

I did a run (or the PR, at commit fc82786) with M=1 tile, N=1 tile, K=2 tiles, a single GPU under gdb.

I break on parsec_cuda_kernel_scheduler to see all kernels passed to the GPU, and on parsec_gpu_data_reserve_device_space. When I'm in kernel scheduler, I print the task name. When I'm in reserve_device_space, I go until the GPU copy for C is allocated (line 927), then I install a watchpoint on any change of this copy's version.

Here is what I see:

  • we start GEMM(0, 0, 0), we allocate the GPU copy for C and set the version to 0 (inside the allocation)
  • version goes to UINT_MAX at line 1149 according to the patch
  • it is then set to 2 at line 1311, because it's a write (C) and we take candidate (CPU copy)'s version + 1. transfer_from is 0, everything is logic
  • next step, it is reset to 1 at line 1512 after we have started the copy. You seem to expect that, OK, why not.
  • then it remains unmodified until we start scheduling GEMM(0, 0, 1) on the GPU. But definitely here, its version should be 2. Shouldn't it? The CPU version is 2 (has been updated by the DSL at the completion of the task), so it appears like if the GPU copy is too old (which it isn't).
  • then we reach the call to parsec_data_start_transfer_ownership_to_copy at line 1308:
    • original->owner_device is set to 2, and the GPU copy is in the state SHARED.
    • there is no other copy with the state OWNED, and all other copies are in the state SHARED, so we never check if another copy has a version higher than the GPU version.
    • that explains why we don't consider a new transfer is necessary, and we return -1
  • then we set the GPU copy version to 3 directly, by line 1311 (candidate->version+1)
  • we don't schedule the transfer, so we don't reset the version to candidate->version

It works, but I'm surprised that this is the expected behavior. I would have expected the GPU copy to get the version 2 when the transfer is complete, and the transfer_ownership to return -1 because the GPU copy has the highest version. At the time transfer_ownership returns -1, it looks to me it returns -1 for the wrong reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I expect to see here. But, that is another problem for another PR, here we are just fixing the race condition between accelerator copies management.

@therault
Copy link
Contributor

therault commented Oct 2, 2023

I get some asserts triggering with dplasma using this branch for parsec.

dplasma: master (749c912911ed970ef292d7b1d49e6312dd156485)
parsec: latest commit of this branch (4a4fad0)

Compile mode: CMAKE_BUILD_TYPE=Debug, PARSEC_DEBUG_NOISIER=ON and PARSEC_DEBUG_PARANOID=ON

Machine: xsdk (with CUDA enabled)

test: ./tests/testing_dgemm -g 1 -N 1024 -K 1024 -M 1024 -t 1024 -x 0 --nruns 1

Symptom:
testing_dgemm: /home/herault/dplasma/parsec/parsec/data.c:418: parsec_data_start_transfer_ownership_to_copy: Assertion `(!transfer_required) || (data->device_copies[valid_copy]->version >= copy->version)' failed.

Context: trying to stage_in flow 0 of GEMM(0, 0, 0): Flows 0, 1 and 2 of GEMM(0, 0, 0) have been allocated on the device in parsec_gpu_data_reserve_device_space and the version of the new copies set to UINT_MAX at line 1147. Then the code reaches line 1306 in stage_in, calls parsec_data_start_transfer_ownership_to_copy and the assert fails, because the device's copy has version UINT_MAX, which is never smaller than the CPU copy version. Or at least, that's what I understand (and yes, I reckon I don't understand the GPU code... :) ).

@bosilca bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch 3 times, most recently from f47b917 to fc82786 Compare October 2, 2023 20:43
@therault
Copy link
Contributor

therault commented Oct 2, 2023

I get some asserts triggering with dplasma using this branch for parsec.

dplasma: master (749c912911ed970ef292d7b1d49e6312dd156485) parsec: latest commit of this branch (4a4fad0)

Compile mode: CMAKE_BUILD_TYPE=Debug, PARSEC_DEBUG_NOISIER=ON and PARSEC_DEBUG_PARANOID=ON

Machine: xsdk (with CUDA enabled)

test: ./tests/testing_dgemm -g 1 -N 1024 -K 1024 -M 1024 -t 1024 -x 0 --nruns 1

Symptom: testing_dgemm: /home/herault/dplasma/parsec/parsec/data.c:418: parsec_data_start_transfer_ownership_to_copy: Assertion `(!transfer_required) || (data->device_copies[valid_copy]->version >= copy->version)' failed.

Context: trying to stage_in flow 0 of GEMM(0, 0, 0): Flows 0, 1 and 2 of GEMM(0, 0, 0) have been allocated on the device in parsec_gpu_data_reserve_device_space and the version of the new copies set to UINT_MAX at line 1147. Then the code reaches line 1306 in stage_in, calls parsec_data_start_transfer_ownership_to_copy and the assert fails, because the device's copy has version UINT_MAX, which is never smaller than the CPU copy version. Or at least, that's what I understand (and yes, I reckon I don't understand the GPU code... :) ).

this is solved

gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID;
gpu_elem->version = 0;
gpu_elem->version = UINT_MAX; /* scrap value for now */
gpu_elem->readers = 0; /* safe to set to no access */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415

The sequence would be: Thread A sets readers to INT_MIN; Thread B increments readers and finds it negative; Thread A detaches the copy and sets readers to 0; Thread B does the rollback by atomically decrementing readers. The result would be readers == -1. I don't think that should be valid.

Instead of assigning 0 we should atomically add -INT_MIN which would guarantee readers >= 0.

@bosilca bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch from fc82786 to a9a806e Compare October 5, 2023 20:01
@bosilca bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch 6 times, most recently from d30f416 to 609bbf2 Compare October 14, 2023 16:49
Signed-off-by: George Bosilca <[email protected]>
Prevent the owner device from repurposing a GPU copy while another GPU
is planning to use it as a candidate for a d2d transfer.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca force-pushed the topic/fix_race_condition_in_candidate_selection branch from 609bbf2 to f69460c Compare October 18, 2023 20:55
Copy link
Contributor

@therault therault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bosilca bosilca merged commit 7eba1f4 into ICLDisco:master Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants