-
Notifications
You must be signed in to change notification settings - Fork 0
MPI profiling library
License
walkup/libmpitrace
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This directory has code for a MPI wrapper library : libmpitrace.so . To build, type ./configure.sh ... and if that succeeds, type make. The script just checks that you have mpicc in your PATH, and checks that the underlying C compiler is gcc. If both checks pass, the script copies "makefile.in" to "makefile", and then "make" should build the wrapper library. By using your system's gcc compiler, the resulting libmpitrace.so should be compatible with all other compilers on your system. To use libmpitrace.so : (1) on most systems, add one line to your job script : export LD_PRELOAD=/path/to/libmpitrace.so for IBM Spectrum MPI : export OMPI_LD_PRELOAD_POSTPEND=/path/to/libmpitrace.so (2) run your code as normal (example : mpirun -np 32 your.exe ... ) (3) unset LD_PRELOAD (or unset OMPI_LD_PRELOAD_POSTPEND) The wrapper library intercepts MPI calls and prints a summary in text files : mpi_profile.jobid.rank. Optional features are controlled via env variables. A list is included in the text file env_variables.txt, and the basic features are described below. It is recommended to test your build of libmpitrace.so using the examples provided in the "examples" directory, or with your own codes. Extra care may be required for codes that call MPI from a mix of languages that include Fortran and C or C++ because some implementations of MPI handle profiling interfaces for Fortran and C in ways that could result in double-counting or missed function calls. There are always solutions for such cases, but these solutions are not included in the present repository. Most Linux systems use shared-libraries for MPI. If your system uses statically linked executables, you would need to build a static library, libmpitrace.a, and link with that when you build your executable file. For that path, type "make static" to build a static libmpitrace.a . Note about thread safety : most applications make calls to MPI routines from the master thread. For efficiency, this MPI wrapper library is not thread-safe unless you specifically request it. The library can be made to be thread safe by turning on -DUSE_LOCKS as a compiler option in the makefile. Doing that will add mutex lock/unlock calls around updates to static data. This adds overhead, and the resulting profile data will be cumulative over all threads that make MPI calls. The resulting library is thread safe, but interpretation will be nearly impossible when multiple threads make concurrent calls to MPI, because the total time accumulated in MPI is bounded by the total number of threads times the elapsed time, not by the elapsed time for the job. There is no facility for tracking MPI data on a per-thread basis. ======================================================================== Controlling the profiled code region : By default, libmpitrace.so will start tracking MPI calls when your application calls MPI_Init() and stops with MPI_Finalize(). You can optionally add calls to MPI_Pcontrol() to limit data collection to a specific code block : C / C++ : MPI_Pcontrol(1); code block; MPI_Pcontrol(0); Fortran : call mpi_pcontrol(1); code block; call mpi_pcontrol(0) The MPI_Pcontrol() routine is part of the MPI specification, and it is there to provide control for profiling tools like libmpitrace.so. These functions have no effect in the normal MPI library, so it is safe to add them as desired. ======================================================================== Output control : The normal outputs for libmpitrace.so are just a few text files : mpi_profile.jobid.rank for rank 0 and the ranks with the min, max, and median times in MPI. This limits the number of output files and provides details for several ranks of interest. You can control the ranks that produce output : export SAVE_LIST=0,10,20,30 (explicit list of ranks to save) export SAVE_ALL_TASKS=yes (saves output from each rank) If you use a batch system and have multiple MPI job-steps within the context of the same batch job, the jobid is not sufficient to uniquely label the MPI profile outputs. You can optionally add a timestamp to the name of the output files : export ADD_TIMESTAMP=yes The output files are normally written to the working directory for the MPI ranks that write output. You can optionally specify an output directory : export TRACE_DIR=/path/to/your/mpi_profile/output ======================================================================== Communication patterns : You can optionally track point-to-point messages that have specific destination ranks, to get an idea of the overall communication pattern. export TRACE_SEND_PATTERN=yes This will produce pattern files for the ranks that output data, plus a sparse matrix that lists the number of bytes and number of messages sent from each rank to all of it's destination ranks. There is a utility, readsparse, to read the sparse matrix data and write out text ... see the utility directory. Tracking communication patterns will add some overhead to the MPI wrappers. ======================================================================== Associating MPI calls with source file / line-number : You can optionally associate MPI calls with their call site in the application by setting an environment variable : export PROFILE_BY_CALL_SITE=yes For this to be effective, you need -g as a compile option when you build your code. When this option is enabled, the MPI wrapper library saves the instruction address for every MPI call, and prints that information in the mpi_profile outputs. You will need to use the GNU addr2line utility to translate from instruction address to source file and line number. You can do this one address at a time, or all in one shot : grep "site address" mpi_profile.jobid.rank | awk '{print $10}' | addr2line -e your.exe Sometimes you might want to track the address one or more levels up the call stack, instead of the specific call site for the MPI function. You can do that with an environment variable : export TRACEBACK_LEVEL=1 (or 2, or 3, etc.) ======================================================================== Adding barriers to shed light on load imbalance : It is very common that much of the time spent in MPI is time waiting on other processes to finish their work. This effect frequently shows up in collective MPI routines. You can get a handle on the synchronization time by adding a barrier before the collective call. You can do that with environment variables : export COLLECTIVE_BARRIER=yes (adds a barrier before all collective calls) export COLLECTIVE_BARRIER=MPI_Allreduce (adds a barrier before allreduce) More generally, you can explicitly list the collective routines where you want to insert barriers; MPI_Allreduce above is just one example. ======================================================================== Profiling by communicator size : With collective operations, you might want to separate out the time spent in different communicators. For technical reasons, it is simpler to sort by the size of the communicator. You can get that information by setting an environment variable : export PROFILE_BY_COMMUNICATOR=yes The output will then contain sections for each distinct communicator size, as well as a section that is inclusive of all communicators. ======================================================================== Time-window profiling : The MPI wrapper library normally writes output when the application calls MPI_Finalize. Sometimes it is convenient to have a method that can collect data in a predefined time window, and write it out as soon as the time-window closes. You can do that by setting a few environment variables : export SAVE_LIST=0,10,20,30 (arbitrary list of ranks) export PROFILE_BEGIN_TIME=100 (start collecting data 100 sec after MPI_Init) export PROFILE_END_TIME=150 (stop collecting data 150 sec after MPI_Init) This method works by checking elapsed time every time a MPI function gets called. As a result, this approach works best when MPI calls occur fairly frequently on the ranks of interest. With this method you must explicitly list the ranks that will produce output, via the SAVE_LIST environment variable. ======================================================================== Event tracing : The libmpitrace.so library includes support for collecting a timestamped history of all MPI calls for graphical display. There is a visualization tool required to display the event records, using the traditional method where the x-axis is time and the y-axis is MPI rank ... see the traceview directory. This tool is designed to provide simple and efficient visualization ... it is not a comprehensive trace analysis tool. The trace files produced by libmpitrace.so are binary files with a concatenation of records, 48 bytes per record. For successful visualization, one should limit the total volume of trace data to less than a few GiB. A relatively small buffer is reserved in memory to hold trace data for each MPI rank : enough for 10^5 MPI calls per rank, which would total to ~1.1 GiB for 256 MPI ranks. You can optionally set the trace buffer size, but it is often best to limit trace collection in time and by MPI rank to keep the volume of data at a manageable level. The current default setting is to write out trace data for ranks 0-255, or fewer if your job has less than 256 MPI ranks. You can set env variables to request trace collection from a contiguous block of ranks : export TRACE_MIN_RANK=min (default is 0) export TRACE_MAX_RANK=max (default is 255) or you can request trace collection from all MPI ranks : export TRACE_ALL_TASKS=yes To collect trace data for a specific block of code, you can add calls to MPI_Pcontrol() with specific arguments, 101 to start, 100 to stop : C / C++ : MPI_Pcontrol(101); ... code block ...; MPI_Pcontrol(100); Fortran : call mpi_pcontrol(101); code block; call mpi_pcontrol(100) If you have already instrumented your code with MPI_Pcontrol(1/0) to specify a region for cumulative MPI profile data, you can optionally enable event tracing for the same region by setting : export ENABLE_TRACING=yes (default is no) You can optionally enable tracing in a specified time window. For example, to start tracing 100 seconds after MPI_Init, and stop 120 seconds after MPI_Init, you can set these variables : export TRACE_BEGIN_TIME=100 (integer number of seconds after MPI_Init) export TRACE_END_TIME=120 This time-window method can be very convenient because it does not require code modification. To control the trace buffer size : export TRACE_BUFFER_SIZE=value_in_bytes (default is 4800000 bytes) Some applications have huge numbers of MPI calls for certain routines. For example, MPI_Iprobe() may be called in a loop, waiting for messages to arrive. That kind of scenario would quickly overflow any reasonable trace buffer, so there is a mechanism to disable tracing for a specified list of MPI routines. For example : export TRACE_DISABLE_LIST=MPI_Iprobe,MPI_Comm_rank (comma separated list) The trace data is written to a single file when the application calls MPI_Finalize(). This file is usually named jobid.trc. The same output controls for MPI profile data apply to the trace file : you can optionally add a timestamp to the file name, and you can specify an output directory by setting env variable TRACE_DIR. For trace visualization, it is best to transfer the trace file to your workstation or laptop, to ensure fast local rendering. For more information on the trace visualization tool, look in the traceview directory.
About
MPI profiling library
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published