-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent race condition in accelerator copies management #575
Prevent race condition in accelerator copies management #575
Conversation
/* We are trying to use a candidate that is repurposed by the owner device. Let's find another one */ | ||
parsec_atomic_fetch_add_int32(&candidate->readers, -1); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the increment we also need to check that the candidate has not been detached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not here, this is the correction path where we put back the readers we tried to increase. The thread doing this code will not use this candidate, so there is no need for further checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the path if readers < 0
(the case where we don't use the copy). But what about the alternative path (anything below this block)? We could encounter a copy with readers == 1
that is detached from the data we're looking for, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this scenario is possible anymore. Either the copy is not repurposed in which case the readers is > 0
, or is currently repurposed by the owner in which case the atomicity guarantees that either readers is positive and the current thread can use it, is negative and the owner will get it from there.
However, as my comment in another of the comments states, if the current thread is significantly delayed at considering the copy, the owner could have completed the transfer and the copy is now associated with a different data_t. Without a lock I don't know how to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is sufficient to check for the data after the increment (properly sequenced with a read memory barrier). There can be three cases:
readers >= 0
and the right data: good, can usereaders < 0
: bad, cannot use (another thread is in the process of detaching the copy)readers >= 0
and a different data: bad, cannot use (another thread has detached the copy and reset readers to 0)
Number 3 would be covered by checking for the data_t after the increment if the readers looked right, i.e., after this block. That is also why the detach needs to be properly sequenced with a write memory barrier (so that the detach cannot occur before the atomic operation on readers
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readers
field serves as a poor-man's rwlock: if negative we have a writer (the thread detaching); if positive we have, well, readers. No need for an extra lock, just have to handle the case where readers
has been reset already after we checked the data_t the first time.
gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID; | ||
gpu_elem->version = 0; | ||
gpu_elem->version = UINT_MAX; /* scrap value for now */ | ||
gpu_elem->readers = 0; /* safe to set to no access */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is safe: there might be someone looking at the copy seeing negative readers and decrementing by one. I think this should be an atomic increment by INT_MAX
so that eventually readers
reaches 0 in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an atomic increment by -INT_MIN
? (INT_MIN
is probably not equal to INT_MAX
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point all the protection that could be offered was offered by the atomic update protocols. Now, I know this code remains unsafe for extreme cases, but the only way to fix it is to add a lock. Adding yet another atomic operation will not help.
If another thread is trying to use this copy as a candidate for d2d, and that thread is delayed, it might see the readers set to 0 and assume everything is fine and then move ahead and use the copy. Hopefully that will not happen because the version will be incorrect and the coherency is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415
The sequence would be: Thread A sets readers
to INT_MIN
; Thread B increments readers
and finds it negative; Thread A detaches the copy and sets readers
to 0; Thread B does the rollback by atomically decrementing readers
. The result would be readers == -1
. I don't think that should be valid.
Instead of assigning 0 we should atomically add -INT_MIN
which would guarantee readers >= 0
.
parsec_atomic_unlock( &oldmaster->lock ); | ||
goto find_another_data; | ||
} | ||
/* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */ | ||
int do_unlock = oldmaster->super.obj_reference_count != 1; | ||
parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be guaranteed sequenced after the CAS so we need a write memory barrier before the detach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, parsec_data_copy_detach()
doesn't update data->owner_device
, so this detach can lead to a situation where the data->device_copies[data->owner_device]
is NULL
. Is that a behavior we want to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no need for write memory barrier. There is a single owner of the copy, the thread that will do the CAS and the detach, and the same that will eventually free the copy. Every other threads is limited to updates of the reader field, if they want to manipulate the copy more, they do it via a message to the owner thread.
I don't see why changing the owner_device
is needed here, we only change the data_t this copy is associated with, the owner_device
will remain the same. The detach will clear the origin
so data->device_copies[data->owner_device]
will segfault because data
will be NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parsec_data_copy_detach
we do data->device_copies[device] = copy->older;
(where data
is copy->original
before we set copy->original
to NULL
. If copy->older
is NULL, then that data_t
has a device_copies[owner_device]
set to NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a bigger problem overall.
NULL == copy->older
we have the owner_device pointing to a NULL data. But, we don't know what is legitimately the good version of the remaining copies, so in the generic code we are unable to redirect theowner_device
to the correct version.NULL != copy->older
we are setting theowner_device
to point to an older version of the data that might exists on the device. Not clear why that version would be the correct one.
parsec_atomic_unlock( &oldmaster->lock ); | ||
goto find_another_data; | ||
} | ||
/* Check if this copy is the last dangling reference to the oldmaster. This is safe to do as we own one of the data refcounts. */ | ||
int do_unlock = oldmaster->super.obj_reference_count != 1; | ||
parsec_data_copy_detach(oldmaster, lru_gpu_elem, gpu_device->super.device_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, parsec_data_copy_detach()
doesn't update data->owner_device
, so this detach can lead to a situation where the data->device_copies[data->owner_device]
is NULL
. Is that a behavior we want to support?
gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID; | ||
gpu_elem->version = 0; | ||
gpu_elem->version = UINT_MAX; /* scrap value for now */ | ||
gpu_elem->readers = 0; /* safe to set to no access */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an atomic increment by -INT_MIN
? (INT_MIN
is probably not equal to INT_MAX
)
@@ -1295,7 +1306,7 @@ parsec_gpu_data_stage_in( parsec_device_cuda_module_t* cuda_device, | |||
transfer_from = parsec_data_start_transfer_ownership_to_copy(original, gpu_device->super.device_index, (uint8_t)type); | |||
|
|||
if( PARSEC_FLOW_ACCESS_WRITE & type && gpu_task->task_type != PARSEC_GPU_TASK_TYPE_PREFETCH ) { | |||
gpu_elem->version++; /* on to the next version */ | |||
gpu_elem->version = candidate->version + 1; /* on to the next version */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we need to do this here, but I think we also need to fix at line 1487 (before the patch) / 1506 (after the patch): we do gpu_elem->version = candidate->version;
when the transfer (if a transfer is needed) is done, and the version of candidate
is one less than what we want, and this will overwrite the effect of this action here.
This doesn't typically fail with GEMM or POTRF as benchmarks / tests because they only send the result back to the host memory once, and then we discard / release the GPU copy, so the fact that the version is wrong doesn't trigger a problem, but if you remove the release_memory from the test this should create a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code, aka. long ago, there was no need to play around with this version until completion of the transfer, because nobody could have used that copy meanwhile. As the code stands today, they can, so the version of the device copy must be carefully handled. At the time where this code executes, the device copy is not yet usable, it will become usable only when the transfer completes, but it needs to be able to be found by others with the correct version.
I removed the release_memory
on the tester, and even with checks enabled it still succeed. We might not really understand that code as good as we think we are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My problem is I don't understand (neither in the code before the PR or inside the PR) what is supposed to increase the version of the GPU copy to what I believe it should be.
I did a run (or the PR, at commit fc82786) with M=1 tile, N=1 tile, K=2 tiles, a single GPU under gdb.
I break on parsec_cuda_kernel_scheduler
to see all kernels passed to the GPU, and on parsec_gpu_data_reserve_device_space
. When I'm in kernel scheduler, I print the task name. When I'm in reserve_device_space, I go until the GPU copy for C is allocated (line 927), then I install a watchpoint on any change of this copy's version.
Here is what I see:
- we start GEMM(0, 0, 0), we allocate the GPU copy for C and set the version to 0 (inside the allocation)
- version goes to UINT_MAX at line 1149 according to the patch
- it is then set to 2 at line 1311, because it's a write (C) and we take candidate (CPU copy)'s version + 1. transfer_from is 0, everything is logic
- next step, it is reset to 1 at line 1512 after we have started the copy. You seem to expect that, OK, why not.
- then it remains unmodified until we start scheduling GEMM(0, 0, 1) on the GPU. But definitely here, its version should be 2. Shouldn't it? The CPU version is 2 (has been updated by the DSL at the completion of the task), so it appears like if the GPU copy is too old (which it isn't).
- then we reach the call to
parsec_data_start_transfer_ownership_to_copy
at line 1308:- original->owner_device is set to 2, and the GPU copy is in the state SHARED.
- there is no other copy with the state OWNED, and all other copies are in the state SHARED, so we never check if another copy has a version higher than the GPU version.
- that explains why we don't consider a new transfer is necessary, and we return -1
- then we set the GPU copy version to 3 directly, by line 1311 (candidate->version+1)
- we don't schedule the transfer, so we don't reset the version to candidate->version
It works, but I'm surprised that this is the expected behavior. I would have expected the GPU copy to get the version 2 when the transfer is complete, and the transfer_ownership to return -1 because the GPU copy has the highest version. At the time transfer_ownership returns -1, it looks to me it returns -1 for the wrong reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I expect to see here. But, that is another problem for another PR, here we are just fixing the race condition between accelerator copies management.
I get some asserts triggering with dplasma using this branch for parsec. dplasma: master (749c912911ed970ef292d7b1d49e6312dd156485) Compile mode: CMAKE_BUILD_TYPE=Debug, PARSEC_DEBUG_NOISIER=ON and PARSEC_DEBUG_PARANOID=ON Machine: xsdk (with CUDA enabled) test: Symptom: Context: trying to stage_in flow 0 of GEMM(0, 0, 0): Flows 0, 1 and 2 of GEMM(0, 0, 0) have been allocated on the device in |
f47b917
to
fc82786
Compare
this is solved |
gpu_elem->coherency_state = PARSEC_DATA_COHERENCY_INVALID; | ||
gpu_elem->version = 0; | ||
gpu_elem->version = UINT_MAX; /* scrap value for now */ | ||
gpu_elem->readers = 0; /* safe to set to no access */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment conflicts with the roll back a potential reader would do here: https://github.com/ICLDisco/parsec/pull/575/files#diff-a42087f8b140e7fc55fd74480d49a4334d21ff2810f51ca57e0d1955534286e8R1415
The sequence would be: Thread A sets readers
to INT_MIN
; Thread B increments readers
and finds it negative; Thread A detaches the copy and sets readers
to 0; Thread B does the rollback by atomically decrementing readers
. The result would be readers == -1
. I don't think that should be valid.
Instead of assigning 0 we should atomically add -INT_MIN
which would guarantee readers >= 0
.
fc82786
to
a9a806e
Compare
d30f416
to
609bbf2
Compare
Signed-off-by: George Bosilca <[email protected]>
Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer. Signed-off-by: George Bosilca <[email protected]>
609bbf2
to
f69460c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Prevent the owner device from repurposing a GPU copy while another GPU is planning to use it as a candidate for a d2d transfer.