-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix synchronisation problems in the clusterizer #102
Fix synchronisation problems in the clusterizer #102
Conversation
be5c775
to
d448e03
Compare
moduleStart_d, | ||
clusInModule_d, moduleId_d, | ||
clus_d, | ||
wordCounter | ||
); | ||
cudaDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cudaDeviceSynchronize()
leftover from the debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes...
#endif | ||
|
||
first += threadIdx.x; | ||
if (first>= numElements) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why is this replaced with the
bool active = (first < numElements);
if (active) {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure how __syncthreads
is affected by threads that have return
ed, so I have removed all return
statements.
I can see if putting them back reintroduces a problem or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the explanation!
Looks like we just pushed back the crash to event 400+... back to debugging. |
177c055
to
0260b3e
Compare
@VinInn can you help us looking into a related crash ?
It should crash after few events with
:-( |
ok. will need to understand your changes |
Validation summaryReference release CMSSW_10_2_0_pre6 at a674e1f
|
Improve documentation.
0260b3e
to
3c139fc
Compare
I think I found the error in this PR:
was not a very useful loop :-( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing #66.
countModules<<<blocks, threadsPerBlock, 0, stream.id()>>>(moduleInd_d, moduleStart_d, clus_d, wordCounter); | ||
cudaCheck(cudaGetLastError()); | ||
|
||
// read the number of modules into a data memeber, used by getProduct()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, memeber -> member
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
|
||
// clusters | ||
cudaCheck(cudaMemcpyAsync(clus_h, clus_d, wordCounter*sizeof(uint32_t), cudaMemcpyDefault, stream.id())); | ||
cudaCheck(cudaGetLastError()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the last cudaGetLastError()
is not needed after removing the cudaStreamSynchronize()
, possible errors in queueing the async copy should get caught already by its return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
break; | ||
} | ||
default: if (debug) printf("Cabling check returned unexpected result, status = %i\n", status); | ||
case(1) : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you "point" to them, I find these parentheses weird.
Validation summaryReference release CMSSW_10_2_0_pre6 at a674e1f
|
Validation summaryReference release CMSSW_10_2_0_pre6 at a674e1f
|
Fix synchronisation problems in the
findClus
kernel, and improve its documentation.Avoid unnecessary
cudaStreamSynchronize
inmakeClustersAsync
.