Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/pt2pt multithread fixes #2615

Closed
wants to merge 2 commits into from

Conversation

jjhursey
Copy link
Member

Fixes for Issue #2614

Currently a work in progress.

 * The `opal_list_remove_first` is protected by the accumulation lock
   not the module lock. So make sure to use the same lock to protect
   the `opal_list_append` to prevent current access, which can cause
   a segv and/or lost enties in the list.

Signed-off-by: Joshua Hursey <[email protected]>
 * The `ompi_osc_pt2pt_sync_pscw_peer` call needs ot be protected when
   called, and the `sync` structure need to be protected in the `start`
   function.
   - This prevents the `ompi_osc_pt2pt_sync_pscw_peer` function from
     processing while `start` is also in progress on the same window in
     two different threads (e.g., `progress` and `main` thread)
   - This seems to happen when the 'main' thread is part way through
     the `start` function then the progress thread starts processing
     the `post` message received from another peer for this window.
     Both functions try to access the `peer_list` portion of the
     structure and a NULL is stepped on in the `ompi_osc_pt2pt_sync_pscw_peer`
     function.
   - This patch locks the `sync` structure for the duration of the `start`
     and ensures that whenever `ompi_osc_pt2pt_sync_pscw_peer` is called
     the caller is holding the `sync->lock`. This provides exclusivity
     to this structure ensuring that the latter function sees a fully
     updated structure.

Signed-off-by: Joshua Hursey <[email protected]>
@hjelmn
Copy link
Member

hjelmn commented Jan 12, 2017

@jjhursey I am addressed these issues. Looks like I forgot to change some things when I added the synchronization structure. Will have a PR up soon.

@jjhursey
Copy link
Member Author

@hjelmn The sounds good. Once you have your version of the PR we can close this one, and @markalle and I will test/review. Thanks!

@hjelmn
Copy link
Member

hjelmn commented Jan 12, 2017

@jjhursey Sorry it took so long for me to look at this. The commits here helped me narrow down some of the more egregious errors... Did I ever mention I intensely dislike active target? We really need to remove that from the standard. What an odd concept.

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ac6a1f2f8a43992f5432c6f872195c8d

@jjhursey
Copy link
Member Author

@markalle @hjelmn Do we need this PR any more?

@jjhursey
Copy link
Member Author

This PR has grown stale. We can reopen if we need it later. @markalle If the problem is still present let's make sure there is an Issue to track progress.

@jjhursey jjhursey closed this Oct 10, 2017
@jjhursey jjhursey deleted the topic/osc-pt2pt-mt-fixes branch October 10, 2017 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants