Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial symbol versioning #1955

Closed
wants to merge 1 commit into from

Conversation

amckinstry
Copy link

Attached is an initial patch implementing symbol versioning.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@rhc54
Copy link
Contributor

rhc54 commented Aug 12, 2016

Interesting - I'm not claiming to grok all of this, but one thing it definitely shows is that we are exposing a lot of internal symbols that shouldn't be exposed. I'd say we have a lot of cleanup to do! This is good timing as we have a developer's meeting next week, so I'll put this on the agenda.

Thanks!

@lanl-ompi
Copy link
Contributor

Test FAILed.

@hppritcha
Copy link
Member

What is the motivation for this PR?

@amckinstry
Copy link
Author

Maintainability in distributions, specifically Debian. See the notes attached to dup issue #1956.

As a library grows in usage and importance, the practice of "rebuild everything on upgrade of SO Version number" becomes untenable, so symbol versioning was invented. The best example of this is GNU glibc, which has been at version 6 and holding for decades now, despite ABI upgrades.

Current practice basically requires a "recompile and test everything" approach, which in the case of a downstream distro like Debian / Fedora, etc. translates to "nothing works until everything works": moving from openmpi 1.10.3 to openmpi 2.0 (without sym, versioning) means removing all existing, working MPI software from the integration suite (distribution), re-including it over time as it builds and is tested with openmpi2. This breaks hundreds (soon thousands?) of packages for 4-8 weeks while everyone drops everything to upgrade and test against the new openmpi. It makes simultaneous development of those packages impossible.

@jsquyres
Copy link
Member

@amckinstry Thank you for putting together this patch! (and the supporting documentation -- wow!) This makes it much easier to discuss the issue / understand what is being proposed.

A few notes:

  1. With the the current Open MPI versioning scheme, we have guaranteed that all releases of Open MPI v2.x will be backwards compatible with prior Open MPI v2.x releases.
    • Specifically: if you compile your app/library against Open MPI v2.x.y, it will still run correctly with Open MPI v2.a.b (where a.b >= x.y).
    • Hence, we will not be increasing the major .so number (via the Libtool c:r:a scheme) of the the MPI bindings libraries or the OSHMEM bindings libraries.
    • However, this does not mean that mpirun from Open MPI v2.x.y will work with an installation of Open MPI v2.a.b (where a.b != x.y). Specifically: the version of Open MPI used must be the same across run-time instances and all MPI processes in a job.
    • Here's the primary use case we want to support:
      1. Have Open MPI v2.x.y installed on all servers.
      2. Compile my_mpi_app against Open MPI v2.x.y.
      3. Run my_mpi_app on all servers; see that it works properly.
      4. Upgrade Open MPI to v2.a.b (where a.b > x.y) on all servers (to include mpirun and all supporting header files and libraries)
      5. Run my_mpi_app (without recompiling or relinking) on all servers; see that it (still) works properly.
    • Here's a use case that we do not guarantee:
      1. Have Open MPI v2.x.y installed on all servers.
      2. Compile my_mpi_app against Open MPI v2.x.y.
      3. Run my_mpi_app on all servers; see that it works properly.
      4. Upgrade Open MPI to v2.a.b (where a.b > x.y) on only some of the servers
      5. Run my_mpi_app on all servers; it may or may not work
  2. From my understanding of your patch, the main issue that it solves is unintentional leakage of unnecessary symbols to the user. I.e., we should only be exposing the MPI and OSHMEM APIs (and whatever underlying "glue" symbols may be necessary, e.g., the MPI_* constants are actually #defines, so the actual underlying symbols must be exposed, even though they are technically not prefixed with mpi_ or MPI_).
  3. I am surprised to see some symbols listed in your lists -- did you simply check for public symbols in the library from ldd, or did those get generated out of necessity? I.e., are all of those needed -- or just what are currently public?
    • I suspect that those are just the current public symbols, and we need to do some cleaning up (i.e., not all of those should be public). Can you confirm?
  4. Can we really list C++ symbols like you did? I don't think that will work for different C++ compilers, since the symbol munging mechanisms are different. Is there a way to list the actual C++ source code name for the symbol?
    • Ditto for Fortran.

@jsquyres
Copy link
Member

I note two minor problems with this PR:

  1. make distcheck fails because the .map files are not found. Perhaps they weren't included in the tarball...?
  2. make check fails because some of our tests intentionally use internal symbols. A quick fix would be the also add those symbols to the map to expose them (which is apparently what we did with visibility, anyway). Here's one example -- I don't know if there are others:
09:40:22 external32.o: In function `pack_unpack_datatype':
09:40:22 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/test/datatype/external32.c:119: undefined reference to `ompi_datatype_pack_external_size'
09:40:22 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/test/datatype/external32.c:126: undefined reference to `ompi_datatype_pack_external'
09:40:22 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/test/datatype/external32.c:137: undefined reference to `ompi_datatype_unpack_external'
09:40:22 collect2: error: ld returned 1 exit status

@jsquyres jsquyres added this to the v2.x milestone Aug 16, 2016
@amckinstry
Copy link
Author

@jsquyres On (2), the primary benefit for us (Debian + Ubuntu in particular, distro maintainers in general) is that the SOVERSION on the MPI and OSHMEM libraries does not change. In moving from OpenMPI 2.x.y to 2.a.b the version number of the main libs remains 20.. . For internal libraries, open-rte, etc. this is not important, and I expect these will change over 2.x.

This is important because otherwise integrating a new OpenMPI release requires rebuilding and testing all linked libraries and applications as a single piece. (An upgrade of OpenMPI that changes libmpi.so.20 -> libmpi.so.21 means that version20 drops from the integration suite, breaking all binaries that depend on it). This requires a freeze (typically weeks) of all non-related development on the dozens of affected packages, something we dearly wish to avoid.

On the selection of symbols: this is in need of review. I started by whitelisting all MPI__, mpi__, ompi_* , PMPI_* as being assumed ok, then included others such as comm__, netpatterns__ namespaces as necessary. Some seem to be 'leaks': roundup_to_power_radix for example almost certainly shouldn't be public, but OpenMPI fails to compile (make check) unless its visible. As you note, the testing uses hidden internal symbols; maybe a refactor is needed.

On C++ / Fortran names, the mangling scheme is compiler-dependent, while symbol versioning is done at the linker level, so I don't think its possible. It may be necessary to include the mangled versions for each compiler :-(

I copied over by-hand the files from my local git to a github branch, so did I miss something? I do need to do ./autogen.pl to include auto* changes.

@jsquyres
Copy link
Member

@amckinstry 2 issues:

Intent of this PR

Gotcha. I understand that the intent is that you can upgrade Open MPI in Debian/Ubuntu/etc. from 2.0.x to 2.a.b and not have to recompile all packages that are dependent upon it. This is definitely a highly desirable goal. Although I'll say: you should still test the apps that are dependent upon Open MPI when Open MPI is upgraded. 😄

(...after thinking about this for a minute...)

I'm still not sure I understand -- can you clarify for me?

The symbol versioning in this PR is orthogonal to the shared library version number, right?

E.g., if we bump the Libtool c:r:a to 3:0:0, the symbol versioning in this PR won't save you from having to recompile / relink all packages that are dependent upon Open MPI, right?

Auditing of symbols / make check / etc.

I didn't look closely; I'll start a build of this PR and have a look.

@amckinstry
Copy link
Author

The key point is that on upgrade, the linking foo -> libmpi.so.20 should not break.

While MPI / OpenMPI has a better upgrade path and record than most (and expectation that the API will not change incompatibly), note that if an "opaque" struct like MPI_Comm / MPI_Win, etc. were to change 'internally' , the ABI would break and the major version number would need to change, unless symbol versioning were used. So versioning is the "answer" to the problem rather than orthogonal.

On testing, yes :-) we do regular re-compilation and test runs of the whole archive.

@hppritcha
Copy link
Member

bot:lanl:retest

@lanl-ompi
Copy link
Contributor

Test FAILed.

@jsquyres
Copy link
Member

@amckinstry We're actually together for a face-to-face Open MPI engineering meeting this week (https://github.com/open-mpi/ompi/wiki/Meeting-2016-08), and a bunch of us talked about this in detail yesterday.

We think there are 3 issues getting conflated here:

  1. Symbol visibility.
    • Per this PR, we clearly have too many symbols exported. 😦 We need to / will fix that.
    • Note that we do have a symbol visibility mechanism in place -- we apparently just got a little sloppy with it.
  2. Library .so version number.
    • Per Open MPI's versioning scheme, we will ensure that backwards compatibility (which includes ABI) is preserved from v2.x.y to v2.a.b (where a.b > x.y).
    • It seems like this is your highest priority, and one we definitely need to support properly.
  3. Symbol versioning.
    • Doing symbol versioning is one way of cleaning up the visibility issue (i.e., hiding all non-versioned symbols), but our conclusion was that we should just clean up the mechanism we already have.
    • After much discussion, we could not come up with a case where we will need to version MPI or OSHMEM APIs. Specifically: the MPI and OSHMEM APIs are tightly regulated their their own respective specification documents. The standards bodies who maintain these spec documents pay very, very close attention to backwards compatibility. For example, they're absolutely not going to change the arguments/types to MPI_Send(...) between releases of the spec.
    • Finally, symbol versioning is problematic for the MPI profiling interface (i.e., where users can change their LD_LIBRARY_PATH and/or LD_PRELOAD a different MPI library to run-time replace the MPI_*(...) symbols with their own).

That first two bullet points under 3 are probably subjective, and could be argued. E.g., past behavior from the MPI / OSHMEM standards bodies is not necessarily indicative of future behavior. But the last point -- that it interferes with the MPI profiling interface -- is probably the clincher, and the definitive reason why we can't use symbol versioning for the MPI / OSHMEM APIs.

All this being said, let's bring @opoplawski into the conversation -- he's the Fedora packager for Open MPI. Orion: do you have any thoughts on this?

@amckinstry
Copy link
Author

Ok, backing up to (2), and leaving aside symbol versioning:

I understand that the MPI and OSHMEM APIs are tightly regulated and won't change in 2.*, but are the ABIs? For the .so version number to remain constant, the internals of opaque data types that are passed also need to remain constant. Things like ompi_status_public_t make me assume no.

@jsquyres
Copy link
Member

You are correct -- the respective standards bodies do not govern the ABIs of the MPI and OSHMEM APIs.

It's our job as a community to ensure that we maintain our backwards compatibility promises per our version numbering scheme.

And just to be clear: the .so versions will change if the libraries change in successive releases (and if the libraries don't change, the .so version will be identical between successive releases). We update the .so version numbers according to the GNU Libtool guidelines, so the all-important use case is maintained:

  • compile MPI application against Open MPI v2.x.y
  • run, see that it works properly
  • upgrade Open MPI to v2.a.b (where a.b > x.y)
  • update LD_LIBRARY_PATH to point to Open MPI v2.a.b installation
  • run same MPI application (without recompiling or relinking), see that it works properly

(forgive me for being a bit pedantic here -- I just want to make sure that we're all 100% agreeing on precisely what we're talking about)

@opoplawski
Copy link
Contributor

Some comments:

  • You can change the internals of opaque structures without breaking ABI - that's the point of them. As long as code outside of the library doesn't access or know the internals. But ompi_status_public_t is not an opaque type as its contents are known. You could probably append to the structure though without breaking ABI.
  • There are a couple tools available for checking ABI compatibility, see https://fedoraproject.org/wiki/How_to_check_for_ABI_changes_in_a_package
  • Symbol versioning is helpful with making sure that applications compiled against newer versions of the library (with newer symbols) do not attempt to run against older versions of the library (without the needed symbol). rpm can detect these and build them in to the package requires/provides.

@amckinstry
Copy link
Author

@jsquyres thanks for being pedantic. I expect the soversion to change between major releases, eg 2.* -> 3.* . The libtool guidelines point out when it is necessary to increment the version number, the point of the symbol versioning was a technology to make it not necessary (as shown with glibc).

ABI stability is not required by the MPI standard, as you point out; its a decision within the OpenMPI developers, hence my question. Are there agreed policies//expected changes to opaque objects over the 2.* timeframe?

@opoplawski : Not quite true on opaque objects: if they change in size then pre-existing binaries that are expecting opaque objects will crash if a new object of a (e.g larger) size arrives on the stack (or if you were keeping a copy of the object locally in a user struct, etc.) This can be avoided by setting a policy that they won't change size, or adding padding into structs for future expansion). Tools are useful to check this, thanks for the pointers.

On symbol versioning, I think you're confusing library versioning with symbol versioning. The aim is to ensure that binaries built against old library versions will continue to work when a new version is put in place, despite the behaviour of a method or size of a struct having changed, because multiple versions of a symbol (method) will be shipped in the same library. This enables eg. backports, where we can drop in a new version of a library on to a running system, providing new functionality / bugfixes without breaking existing binaries (including user binaries as well as stuff shipped by Debian/ Fedora, etc.).

@amckinstry
Copy link
Author

@jsquyres
Thinking about this while cycling home, I realised: public symbols in OMPI/OSHMEM are likely to be stable for the next 2-3 years, and hence SOVERSION remain 20, except for a potential cleanup you mentioned above!

This matters for a potential transition (20 -> 21) on the timetable for Debian stretch (9.0). For this next release, the deadline for completing transitions is Nov 2016, so if there is a 2.1 release before then with symbol cleanup, it might not make it in.
Do you have any plans at this stage?

bosilca added a commit that referenced this pull request Aug 20, 2016
@bosilca
Copy link
Member

bosilca commented Aug 20, 2016

I went over all the others from the MPI library, and fixed the one I could (e8425eb and b96ec77). For all the others, with the exception of the netpatterns_* and comm_*_pml (both related the ML collective component) they are all legit and must be exposed (so that our own .so can find them).

bosilca added a commit to bosilca/ompi that referenced this pull request Oct 3, 2016
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

(we just voted this policy in today as a community, so I'm adding this request for changes to all pending PRs)

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/6e3fe3e4b03936cb4a2d1c479625e1ac

@jsquyres jsquyres modified the milestones: v2.x, v3.0.0 Dec 5, 2016
@jsquyres jsquyres removed this from the v2.x milestone Dec 5, 2016
@hppritcha hppritcha modified the milestones: v3.1.0, v3.0.0 Mar 14, 2017
@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/89876d57a2677efc2f124b3c0f47076f

@bwbarrett bwbarrett modified the milestones: Future, v3.1.0 Sep 12, 2017
@rhc54
Copy link
Contributor

rhc54 commented Oct 3, 2017

I believe this PR is stale - IBM contributed a test for non-hidden variables, and we have scrubbed the code base to the extent possible (we cannot hide everything as our plugins need to see core library symbols). So I'm going to close this one now - please reopen if you feel something more needs to be done.

@rhc54 rhc54 closed this Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants