-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global symbol name pollution #10708
Comments
I added Severity: critical for v5.0.0. |
@gpaulsen asked me to post how many symbols there are that need changed, based on my run just now it's ~250. Admittedly if we're only being pragmatic and non-strict we could ignore long names like The compromise solution I advocate though is to be strict about requiring some kind of prefix, but pragmatic about letting it be a large list of "acceptable" prefixes. |
Because some of these are so generic, such as "distance" and "dfs", changing this to a blocker. |
I've tried various different things to try and force an error - @markalle do you have a simple reproducer that will show a failure if a user defines one of these global symbols? |
Ah, yes I did make an actual test that shows how redefining one of those global symbols in an MPI user applications breaks OMPI: *** demonstrating why pollution is a problem: I tried a few functions and not everything gets called in every run of course, but one example that I was able to get called was "append_frag_to_ordered_list". A user could have a function with that name in their user app, and if they do, maybe creating an app that looks like this: #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
static void
myprint(char *mesg)
{
char str[4096];
printf("*** %s -- in App's function\n", mesg);
fflush(stdout);
sprintf(str,
"gdb -q -batch -ex bt -p %d < "
"/dev/null 2>&1 | grep -v "
" -e '^Missing separate debuginfo for ' "
" -e '^Warning:.*No such file or directory' "
" -e 'Inferior .*process .*detached' "
" -e ' zypper install -C ' | "
"sed -e 's/^/[p%d] /'", getpid(), getpid());
system(str);
fflush(stdout);
}
void append_frag_to_ordered_list() { myprint("append_frag_to_ordered_list"); }
int
main(int argc, char *argv[])
{
int myrank, nranks;
char myhost[MPI_MAX_PROCESSOR_NAME];
int len;
char *sbuf, *rbuf;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Get_processor_name(myhost, &len);
sbuf = malloc(10000000);
rbuf = malloc(10000000);
MPI_Bcast(sbuf, 100, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Bcast(sbuf, 1000, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Bcast(sbuf, 10000, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Bcast(sbuf, 100000, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Bcast(sbuf, 1000000, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Bcast(sbuf, 10000000, MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Sendrecv(sbuf, 10, MPI_CHAR, (myrank+1)%nranks, 99,
rbuf, 10, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(sbuf, 1000, MPI_CHAR, (myrank+1)%nranks, 99,
rbuf, 1000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(sbuf, 100000, MPI_CHAR, (myrank+1)%nranks, 99,
rbuf, 100000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(sbuf, 10000000, MPI_CHAR, (myrank+1)%nranks, 99,
rbuf, 10000000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Finalize();
free(sbuf);
free(rbuf);
printf("past finalize at rank %d/%d (on %s)\n", myrank, nranks, myhost);
exit(0);
} And run with something like
The stack trace comes up as deep in OMPI code where suddenly it thinks it's calling OMPI's "append_frag_to_ordered_list" but it's actually calling the user Application's redefinition.
|
See also #8262. |
Per Open MPI teleconf Sept. 6:
|
See also #10770 |
Purely FWIW: in PMIx and PRRTE we replaced it with |
Sounds like I should eliminate PMIX from the starter doc PR #10770, if for no other reason than to avoid speaking about things beyond just OMPI/OPAL's use of DECLSPEC. |
Ah, I hadn't looked at it - yes, that would be good as we eliminated DECLSPEC from our code. |
I looked at the list of symbols generated by @markalle testcase and tagged them in 3 categories,
After discussion with @gpaulsen, my plan is
There are about 160 symbols in treematch out of the about 250 symbols in this list The list is attached |
In my first pass thru the 3rdparty components, i got the number of symbols that need to be fixed for the following components
|
Make opal/ompi symbols used in only one file 'static' Signed-off-by: David Wootton <[email protected]>
Part 1 of solving this, making symbols in opal and ompi subdirectories static is solved by pull request #10825 |
Part 1 of resolving global symbol name pollution issue #10708
The changes in the two open pull requests associated with this issue, #10859 and #10863 resolve the remaining incorrect symbols other than the following symbols in the 3rd-party/romio341 component.
The test script complains about some symbols, which I renamed treematch_* since the script is not expecting treematch as a valid prefix, but that's not a problem these symbols are renamed as expected. |
Part 2 of resolving global symbol name pollution issue #10708
Part 1 of resolving global symbol name pollution issue #10708
Part 2 of resolving global symbol name pollution issue #10708 for OMPI v5.0.x
Part 3 of resolving global symbol name pollution #10708 - treematch
Fix review comments Signed-off-by: David Wootton <[email protected]> (cherry picked from commit 2a95bb9)
Once pull requests #10825 #10859 and #10863 were merged into the main branch, the test case output listing improperly named external symbols is
Corresponding pull requests for the V5.0.x branch are #10936 #10956 and #10981 |
Fix review comments Signed-off-by: David Wootton <[email protected]> (cherry picked from commit 2a95bb9)
All the remaining symbols are from ROMIO, so we intentionally won't change them. So I think this ticket is complete for After PR #10981 is merged into |
Part 3 of resolving global symbol name pollution #10708 for V5.0.x
…ematch Fix review comments Signed-off-by: David Wootton <[email protected]>
…ematch Fix review comments Signed-off-by: David Wootton <[email protected]>
Closing - @drwootton please post results of v5.0.x here when you have a chance. If those reveal some more items we can re-open. Thanks for pushing on this! |
@awlauria This is the v5.0.x output after all pull requests were merged |
It would be nifty if someone had the time to setup a CI test to make sure we don't regress here. |
Make opal/ompi symbols used in only one file 'static' Fix review comments Signed-off-by: David Wootton <[email protected]>
Rename external symbols with proper ompi_* or opal_* component prefixes Fix review comments Resolve remaining incorrect external symbols and clean up unused functions Rebase to current main External functions should not have _mca_ as part of their function name Signed-off-by: David Wootton <[email protected]>
…ematch Fix review comments Signed-off-by: David Wootton <[email protected]>
You can download testcase files here:
https://github.com/open-mpi/ompi-tests-public/blob/master/packaging/run_nmcheck.c
https://github.com/open-mpi/ompi-tests-public/blob/master/packaging/nmcheck_prefix.pl
With those in place and a build of OMPI available, running the test is just:
And a big list of questionable global exports should get listed.
Or a lighter weight way to get similar output is simply
although I recommend the ompi-tests-public testcase since it checks more libraries and has a longer list of global exports it allows as "acceptable" prefixes. Those prefixes are a somewhat arbitrary decision based on pragmatism -- eg what symbol names are likely to collide with symbols from a user's application or possibly their other support libraries.
When I run the test on branch
v5.0.x
I getNote, when it comes to fixing them sometimes you can make things static, but often the symbols will span multiple files so in that case a prefix can be added to the symbol. Also when symbols like
adapt_topology_cache_item_t_class
show up (things that end with _class) that means there's a line likeand the occurrances of
adapt_topology_cache_item_t
need updated. Eg I'm just noting that you can't search the code foradapt_topology_cache_item_t_class
, you have to just search foradapt_topology_cache_item_t
to find the cause of theadapt_topology_cache_item_t_class
export.The text was updated successfully, but these errors were encountered: