Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global symbol name pollution #10708

Closed
markalle opened this issue Aug 23, 2022 · 21 comments
Closed

global symbol name pollution #10708

markalle opened this issue Aug 23, 2022 · 21 comments

Comments

@markalle
Copy link
Contributor

You can download testcase files here:
https://github.com/open-mpi/ompi-tests-public/blob/master/packaging/run_nmcheck.c
https://github.com/open-mpi/ompi-tests-public/blob/master/packaging/nmcheck_prefix.pl

With those in place and a build of OMPI available, running the test is just:

#!/bin/sh
export OPAL_PREFIX=/path/to/some/ompi_install
export LD_LIBRARY_PATH=${OPAL_PREFIX}/lib
$OPAL_PREFIX/bin/mpicc -o x run_nmcheck.c
$OPAL_PREFIX/bin/mpirun -np 1 ./x

And a big list of questionable global exports should get listed.

Or a lighter weight way to get similar output is simply

nm ompi_install/lib/libmpi.so | grep ' [TWBCDVR] ' | grep -v -e ' MPI_' -e ' mpi_' -e ' PMPI_' -e ' MPIX_' -e ' MPL_' -e ' ADIOI_' -e ' ADIO_' -e ' ompi_' -e ' OMPI_' -e ' mca_' | less

although I recommend the ompi-tests-public testcase since it checks more libraries and has a longer list of global exports it allows as "acceptable" prefixes. Those prefixes are a somewhat arbitrary decision based on pragmatism -- eg what symbol names are likely to collide with symbols from a user's application or possibly their other support libraries.

When I run the test on branch v5.0.x I get

*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libopen-pal.so.0
    [error]   common_sm_mpool_create
    [error]   selected_component
    [error]   shmem_posix_shm_open
    [error]   wait_sync_global_wakeup_mt
    [error]   wait_sync_global_wakeup_st
    [error]   wait_sync_list
*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libpmix.so.0
    [error]   file_caddy_t_class
    [error]   file_tracker_t_class
    [error]   heartbeat_caddy_t_class
*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libmpi.so.0
    [error]   _dict_free
    [error]   _dict_malloc
    [error]   ad_get_env_vars
    [error]   adapt_module_cached_topology
    [error]   adapt_topology_cache_item_t_class
    [error]   add_edge_3
    [error]   add_to_bucket
    [error]   add_to_list
    [error]   adjacency_asc
    [error]   adjacency_dsc
    [error]   aggregate_obj_weight
    [error]   algo
    [error]   allocate_vertex
    [error]   allocate_vertex2
    [error]   append_frag_to_ordered_list
    [error]   available_components
    [error]   balancing
    [error]   bottom_up_build_tree_from_topology
    [error]   bucket_grouping
    [error]   bucket_id
    [error]   build_cost_matrix
    [error]   build_level_topology
    [error]   build_p_vector
    [error]   build_synthetic_proc_id
    [error]   built_pivot_tree
    [error]   check_bucket
    [error]   check_cantmatch_for_match
    [error]   check_constraints
    [error]   choose
    [error]   clone_tree
    [error]   common_monitoring_translation_ht
    [error]   complete_aff_mat
    [error]   complete_obj_weight
    [error]   complete_tab_node
    [error]   compute_gain
    [error]   compute_nb_leaves_from_level
    [error]   compute_weighted_degree
    [error]   create_dumb_tree
    [error]   create_work
    [error]   delete_group_list
    [error]   depth_first
    [error]   destroy_work
    [error]   destruction
    [error]   dfs
    [error]   dict_destroy
    [error]   dict_int_cmp
    [error]   dict_itor_destroy
    [error]   dict_long_cmp
    [error]   dict_ptr_cmp
    [error]   dict_set_free
    [error]   dict_set_malloc
    [error]   dict_str_cmp
    [error]   dict_uint_cmp
    [error]   dict_ulong_cmp
    [error]   display_bucket
    [error]   display_bucket_list
    [error]   display_grouping
    [error]   display_node
    [error]   display_pivots
    [error]   display_selection
    [error]   display_sol
    [error]   display_sol_sum_com
    [error]   display_tab
    [error]   display_tab_group
    [error]   distance
    [error]   era_agreement_info_t_class
    [error]   era_comm_agreement_specific_t_class
    [error]   era_iagree_request_t_class
    [error]   era_rank_item_t_class
    [error]   era_value_t_class
    [error]   eval_cost
    [error]   eval_cost2
    [error]   eval_grouping
    [error]   eval_sol
    [error]   exchange
    [error]   fast_group
    [error]   fast_grouping
    [error]   fbtl_posix_max_aio_active_reqs
    [error]   fiboTreeConsolidate
    [error]   fiboTreeDel
    [error]   fiboTreeExit
    [error]   fiboTreeFree
    [error]   fiboTreeInit
    [error]   fiboTreeMin
    [error]   file_stats
   ...
    [error]   zero_dte_mapping
*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libmpi_mpifh.so
*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libmpi_usempif08.so
*** checking /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libmpi_usempi_ignore_tkr.so

Note, when it comes to fixing them sometimes you can make things static, but often the symbols will span multiple files so in that case a prefix can be added to the symbol. Also when symbols like adapt_topology_cache_item_t_class show up (things that end with _class) that means there's a line like

OBJ_CLASS_INSTANCE(adapt_topology_cache_item_t, ...)

and the occurrances of adapt_topology_cache_item_t need updated. Eg I'm just noting that you can't search the code for adapt_topology_cache_item_t_class, you have to just search for adapt_topology_cache_item_t to find the cause of the adapt_topology_cache_item_t_class export.

@gpaulsen
Copy link
Member

I added Severity: critical for v5.0.0.
#8704 exposed many more incorrectly scoped symbols to the user in libmpi.so than were exposed in v4.x series. This may cause linking issues (or worse?) for our customers, and should be fixed before v5.0.0.

@markalle
Copy link
Contributor Author

@gpaulsen asked me to post how many symbols there are that need changed, based on my run just now it's ~250.

Admittedly if we're only being pragmatic and non-strict we could ignore long names like bottom_up_build_tree_from_topology since even though it lacks any OMPI-reserved prefix, it seems extraordinarily unlikely a user app would just happen to use that same function name. Vs say create_work or distance that would be super easy for an app to also have.

The compromise solution I advocate though is to be strict about requiring some kind of prefix, but pragmatic about letting it be a large list of "acceptable" prefixes.

@awlauria
Copy link
Contributor

Because some of these are so generic, such as "distance" and "dfs", changing this to a blocker.

@awlauria
Copy link
Contributor

I've tried various different things to try and force an error - @markalle do you have a simple reproducer that will show a failure if a user defines one of these global symbols?

@markalle
Copy link
Contributor Author

markalle commented Aug 30, 2022

Ah, yes I did make an actual test that shows how redefining one of those global symbols in an MPI user applications breaks OMPI:

*** demonstrating why pollution is a problem:

I tried a few functions and not everything gets called in every run of course, but one example that I was able to get called was "append_frag_to_ordered_list". A user could have a function with that name in their user app, and if they do, maybe creating an app that looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
static void
myprint(char *mesg)
{
    char str[4096];
    printf("*** %s -- in App's function\n", mesg);
    fflush(stdout);
    sprintf(str,
        "gdb -q -batch -ex bt -p %d < "
        "/dev/null 2>&1 | grep -v "
        "  -e '^Missing separate debuginfo for ' "
        "  -e '^Warning:.*No such file or directory' "
        "  -e 'Inferior .*process .*detached' "
        "  -e ' zypper install -C ' | "
        "sed -e 's/^/[p%d] /'", getpid(), getpid());
    system(str);
    fflush(stdout);
}
void append_frag_to_ordered_list() { myprint("append_frag_to_ordered_list"); }
int
main(int argc, char *argv[])
{
    int   myrank, nranks;
    char  myhost[MPI_MAX_PROCESSOR_NAME];
    int   len;
    char  *sbuf, *rbuf;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    MPI_Get_processor_name(myhost, &len);
    sbuf = malloc(10000000);
    rbuf = malloc(10000000);
    MPI_Bcast(sbuf, 100, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 1000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 10000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 100000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 1000000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 10000000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Sendrecv(sbuf, 10, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 10, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 1000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 1000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 100000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 100000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 10000000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 10000000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Finalize();
    free(sbuf);
    free(rbuf);
    printf("past finalize at rank %d/%d (on %s)\n", myrank, nranks, myhost);
    exit(0);
}

And run with something like

export OPAL_PREFIX=/path/to/ompinstall
export LD_LIBRARY_PATH="${OPAL_PREFIX}/lib"
$OPAL_PREFIX/bin/mpicc -o x simplempi.c
hosts=hostA:1,hostB:1
$OPAL_PREFIX/bin/mpirun \
    --host $hosts \
    --mca coll ^hcoll \
    --mca pml ob1 --mca btl tcp,self \
    ./x

The stack trace comes up as deep in OMPI code where suddenly it thinks it's calling OMPI's "append_frag_to_ordered_list" but it's actually calling the user Application's redefinition.

[p295482] #3  0x0000000010000dc8 in myprint ()
[p295482] #4  0x0000000010000e34 in append_frag_to_ordered_list ()
[p295482] #5  0x000020000036b368 in mca_pml_ob1_recv_frag_match () from /myinstall/lib/libmpi.so.0
[p295482] #6  0x0000200000d2a658 in mca_btl_tcp_endpoint_recv_handler () from /myinstall/lib/libopen-pal.so.0
[p295482] #7  0x000020000112a58c in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
[p295482] #8  0x000020000112b020 in event_base_loop () from /lib64/libevent_core-2.1.so.6
[p295482] #9  0x0000200000c736b8 in opal_progress_events () from /myinstall/lib/libopen-pal.so.0
[p295482] #10 0x0000200000c73838 in opal_progress () from /myinstall/lib/libopen-pal.so.0
[p295482] #11 0x000020000013d070 in ompi_request_default_wait () from /myinstall/lib/libmpi.so.0
[p295482] #12 0x00002000001a36e8 in PMPI_Sendrecv () from /myinstall/lib/libmpi.so.0
[p295482] #13 0x00000000100010f4 in main ()

@jsquyres
Copy link
Member

See also #8262.

@jjhursey
Copy link
Member

jjhursey commented Sep 6, 2022

Per Open MPI teleconf Sept. 6:

  • This ticket deals primarily with name prefixing.
  • A question about the use of OPAL_DECLSPEC was noted:
    • We need to add something to the developer docs (maybe a paragraph) on its function and intended usage in the Open MPI codebase.
    • Scenario: Shared library DSO would need to use this attribute if they want a symbol to be usable outside of that library/DSO.
    • It exports this symbol outside of the immediate library scope
    • Related to Windows declspec

@naughtont3
Copy link
Contributor

We need to add something to the developer docs (maybe a paragraph) on its function and intended usage in the Open MPI codebase.

See also #10770

@rhc54
Copy link
Contributor

rhc54 commented Sep 6, 2022

Purely FWIW: in PMIx and PRRTE we replaced it with PMIX_EXPORT for the very reasons you are experiencing - the DECLSPEC notation is too confusing/obtuse and developers kept getting it wrong.

@naughtont3
Copy link
Contributor

naughtont3 commented Sep 6, 2022

Purely FWIW: in PMIx and PRRTE we replaced it with PMIX_EXPORT for the very reasons you are experiencing - the DECLSPEC notation is too confusing/obtuse and developers kept getting it wrong.

Sounds like I should eliminate PMIX from the starter doc PR #10770, if for no other reason than to avoid speaking about things beyond just OMPI/OPAL's use of DECLSPEC.

@rhc54
Copy link
Contributor

rhc54 commented Sep 6, 2022

Ah, I hadn't looked at it - yes, that would be good as we eliminated DECLSPEC from our code.

@drwootton drwootton self-assigned this Sep 7, 2022
@drwootton
Copy link
Contributor

I looked at the list of symbols generated by @markalle testcase and tagged them in 3 categories,

  • >> static: Appear to be referenced in only one .c file and can be made static
  • >> 3rdparty/...: Belong to a 3rd party package, which I did not analyze further to see if they could be made static
  • >> symbol_name: Part of ompi or opal, and the right hand name is the proposed new symbol name

After discussion with @gpaulsen, my plan is

  • deal with static symbols in ompi/opal
  • deal with renaming symbols in ompi/opal as noted
  • deal with symbols in treematch, making them static if possible, otherwise rename to treematch_xxx
  • I will not touch the remaining components in 3rdparty

There are about 160 symbols in treematch out of the about 250 symbols in this list

The list is attached
symbols.txt

@drwootton
Copy link
Contributor

In my first pass thru the 3rdparty components, i got the number of symbols that need to be fixed for the following components

  • treematch: 158
  • openpmix: 3
  • romio341: 8

drwootton pushed a commit that referenced this issue Sep 20, 2022
Make opal/ompi symbols used in only one file 'static'

Signed-off-by: David Wootton <[email protected]>
@drwootton
Copy link
Contributor

Part 1 of solving this, making symbols in opal and ompi subdirectories static is solved by pull request #10825

@drwootton
Copy link
Contributor

The changes in the two open pull requests associated with this issue, #10859 and #10863 resolve the remaining incorrect symbols other than the following symbols in the 3rd-party/romio341 component.

[error]   ad_get_env_vars
[error]   file_stats
[error]   romio_onesided_always_rmw
[error]   romio_onesided_inform_rmw
[error]   romio_onesided_no_rmw
[error]   romio_read_aggmethod
[error]   romio_tunegather
[error]   romio_write_aggmethod

The test script complains about some symbols, which I renamed treematch_* since the script is not expecting treematch as a valid prefix, but that's not a problem these symbols are renamed as expected.

jjhursey added a commit that referenced this issue Oct 19, 2022
Part 2 of resolving global symbol name pollution issue #10708
gpaulsen added a commit that referenced this issue Oct 20, 2022
Part 1 of resolving global symbol name pollution issue #10708
awlauria added a commit that referenced this issue Oct 24, 2022
Part 2 of resolving global symbol name pollution issue #10708 for OMPI v5.0.x
jjhursey added a commit that referenced this issue Oct 25, 2022
Part 3 of resolving global symbol name pollution #10708 - treematch
drwootton pushed a commit that referenced this issue Oct 25, 2022
Fix review comments

Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 2a95bb9)
@drwootton
Copy link
Contributor

drwootton commented Oct 25, 2022

Once pull requests #10825 #10859 and #10863 were merged into the main branch, the test case output listing improperly named external symbols is

Checking for bad symbol names:
*** checking /home/dave/ompi/lib/libopen-pal.so.80
*** checking /home/dave/ompi/lib/libmpi.so.80
    [error]   ad_get_env_vars
    [error]   file_stats
    [error]   romio_onesided_always_rmw
    [error]   romio_onesided_inform_rmw
    [error]   romio_onesided_no_rmw
    [error]   romio_read_aggmethod
    [error]   romio_tunegather
    [error]   romio_write_aggmethod
*** checking /home/dave/ompi/lib/libmpi_mpifh.so
*** checking /home/dave/ompi/lib/libmpi_usempif08.so
*** checking /home/dave/ompi/lib/libmpi_usempi_ignore_tkr.so

Corresponding pull requests for the V5.0.x branch are #10936 #10956 and #10981

drwootton pushed a commit that referenced this issue Oct 25, 2022
Fix review comments

Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 2a95bb9)
@jjhursey
Copy link
Member

All the remaining symbols are from ROMIO, so we intentionally won't change them. So I think this ticket is complete for main.

After PR #10981 is merged into v5.0.x then we should re-run the check and confirm that it is also clean.

gpaulsen added a commit that referenced this issue Oct 26, 2022
Part 3 of resolving global symbol name pollution #10708 for V5.0.x
MamziB pushed a commit to MamziB/ompi that referenced this issue Oct 26, 2022
…ematch

Fix review comments

Signed-off-by: David Wootton <[email protected]>
MamziB pushed a commit to MamziB/ompi that referenced this issue Oct 27, 2022
…ematch

Fix review comments

Signed-off-by: David Wootton <[email protected]>
@awlauria
Copy link
Contributor

awlauria commented Oct 28, 2022

Closing - @drwootton please post results of v5.0.x here when you have a chance. If those reveal some more items we can re-open. Thanks for pushing on this!

@drwootton
Copy link
Contributor

@awlauria This is the v5.0.x output after all pull requests were merged
Checking for bad symbol names:
*** checking /u/dwootton/ompi/lib/libmpi.so.80
[error] ad_get_env_vars
[error] file_stats
[error] romio_onesided_always_rmw
[error] romio_onesided_inform_rmw
[error] romio_onesided_no_rmw
[error] romio_read_aggmethod
[error] romio_tunegather
[error] romio_write_aggmethod
*** checking /u/dwootton/ompi/lib/libhwloc.so.15
*** checking /u/dwootton/ompi/lib/libopen-pal.so.80
*** checking /u/dwootton/ompi/lib/libpmix.so.0
[error] fc_pair_t_class
[error] file_caddy_t_class
[error] file_tracker_t_class
[error] heartbeat_caddy_t_class
*** checking /u/dwootton/ompi/lib/libmpi_mpifh.so
*** checking /u/dwootton/ompi/lib/libmpi_usempif08.so
*** checking /u/dwootton/ompi/lib/libmpi_usempi_ignore_tkr.so

@jjhursey
Copy link
Member

It would be nifty if someone had the time to setup a CI test to make sure we don't regress here.

yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
Make opal/ompi symbols used in only one file 'static'

Fix review comments

Signed-off-by: David Wootton <[email protected]>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
Rename external symbols with proper ompi_* or opal_* component prefixes

Fix review comments

Resolve remaining incorrect external symbols and clean up unused functions

Rebase to current main

External functions should not have _mca_ as part of their function name

Signed-off-by: David Wootton <[email protected]>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
…ematch

Fix review comments

Signed-off-by: David Wootton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants