Add profiling with perf chapter OCaml manual

This covers the use of Linux perf to profile OCaml native code including the overlap with frame pointers and DWARF CFI information
tmcgilchrist · Jan 22, 2025 · cff15cf · cff15cf
1 parent 137dd26
commit cff15cf
Show file tree

Hide file tree

Showing 6 changed files with 218 additions and 5 deletions.
diff --git a/Changes b/Changes
@@ -182,6 +182,9 @@ Working version
 - #13694: Fix name for caml_hash_variant in the C interface.
   (Michael Hendricks)
 
+- #?????: Document support for profiling with Linux perf and frame pointers.
+  (Tim McGilchrist, review by ???)
+
 ### Compiler user-interface and warnings:
 
 - #13428: support dump=[source | parsetree | lambda | ... | cmm | ...]

diff --git a/manual/README.md b/manual/README.md
@@ -85,6 +85,8 @@ chapters (or sometimes sections) are mapped to a distinct `.etex` file:
     - Optimisation with Flambda: `flambda.etex`
     - Fuzzing with afl-fuzz: `afl-fuzz.etex`
     - Runtime tracing with Runtime_events: `runtime_tracing.etex`
+    - The “Tail Modulo Constructor” program transformation: `tail-mod-cons.etex`
+    - Runtime detection of data races with ThreadSanitizer: `tsan.etex`
 
 Note that ocamlc,ocamlopt and the toplevel options overlap a lot.
 Consequently, these options are described together in the file
@@ -104,6 +106,8 @@ of `unified-options.etex` contains the relevant information.
     - The runtime_events library: `libruntime_events.etex`
     - The dynlink library: dynamic loading and linking of object files:
       `libdynlink.etex`
+    - Recently removed or moved libraries (Graphics, Bigarray, Num, LablTk):
+      `old.etex`
 
 Latex extensions
 ----------------

diff --git a/manual/src/allfiles.etex b/manual/src/allfiles.etex
@@ -70,6 +70,7 @@ and as a
 \input{ocamldoc.tex}
 \input{debugger.tex}
 \input{profil.tex}
+\input{profile-perf.tex}
 \input{intf-c.tex}
 \input{flambda.tex}
 \input{afl-fuzz.tex}

diff --git a/manual/src/cmds/Makefile b/manual/src/cmds/Makefile
@@ -12,7 +12,8 @@ TRANSF = $(OCAMLRUN) $(TOOLS)/transf
 FILES = comp.tex top.tex runtime.tex native.tex lexyacc.tex intf-c.tex \
   ocamldep.tex profil.tex debugger.tex ocamldoc.tex \
   warnings-help.tex flambda.tex tail-mod-cons.tex \
-  afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex
+  afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex \
+  profile-perf.tex
 
 etex-files: $(FILES)
 all: $(FILES)

diff --git a/manual/src/cmds/profil.etex b/manual/src/cmds/profil.etex
@@ -140,7 +140,5 @@ Display a short usage summary and exit.
 
 Profiling with "ocamlprof" only records execution counts, not the actual
 time spent within each function. There is currently no way to perform
-time profiling on bytecode programs generated by "ocamlc".  For time
-profiling of native code, users are recommended to use standard tools
-such as perf (on Linux), Instruments (on macOS) and DTrace.  Profiling
-with "gprof" is no longer supported.
+time profiling on bytecode programs generated by "ocamlc". For time profiling
+of native code (see chapter~\ref{c:profiler-perf}). 
diff --git a/manual/src/cmds/profile-perf.etex b/manual/src/cmds/profile-perf.etex
@@ -0,0 +1,206 @@
+\chapter{Profiling (perf)} \label{c:profiler-perf}
+%HEVEA\cutname{profiler-perf.html}
+
+This chapter describes how to use \texttt{perf} to profile OCaml programs.
+
+Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events, and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features. For more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).
+
+\section{s:ocamlperf-call-graph}{Background}
+
+CPU profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, \texttt{perf} needs to understand the call graph of OCaml. \texttt{perf} supports multiple options for recording call graphs:
+\begin{itemize}
+\item Frame Pointers, which is the default.
+\item DWARF's Call Frame Information (CFI).
+\item Hardware Last Branch Record (LBR).
+\end{itemize}
+
+Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register use for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5), which work better with the copying nature of perf and frame pointers.
+
+DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in larger binaries.
+
+Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).
+
+Of these options, frame pointers are recommended for profiling OCaml code as it has the following advantages:
+\begin{itemize}
+\item Unwinding is faster to calculate.
+\item Tracing data produced is smaller.
+\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
+\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
+\end{itemize}
+
+\section{s:ocamlperf-compiling}{Compiling for Profiling}
+
+The process for compiling an executable for CPU profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.
+
+To enable frame pointers, configure the compiler with \texttt{--enable-frame-pointers}. Alternatively, you can install an opam switch with frame pointers enabled, as follows:
+
+\begin{verbatim}
+        opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp
+\end{verbatim}
+
+Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.
+
+\section{s:ocamlperf-profiling}{Profiling an Execution}
+
+The basic \texttt{perf} command for profiling is:
+\begin{verbatim}
+        perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
+\end{verbatim}
+
+The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The \texttt{--call-graph fp} instructs \texttt{perf} to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a \texttt{perf.data} file in the current directory. Alternatively use \texttt{--output} to choose a more descriptive filename.
+
+The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.
+
+\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. OCaml 5.3 onwards uses a name mangling scheme of \texttt{caml<MODULE_NAME>\$<FUNCTION_NAME>_<NNN>}, where \texttt{NNN} is a randomly generated number. In OCaml 5.2 the separator is a dot \texttt{caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>}, and before OCaml 5.1 the separator is a double underscore. All formats are supported by \texttt{perf}.
+
+Consider the following program:
+
+\begin{caml_example*}{verbatim}
+module Compute = struct
+  let rec fib n = 
+    if n = 0 then 0
+    else if n = 1 then 1
+    else fib (n-1) + fib (n-2)
+end
+
+let main () = 
+  let r = Compute.fib 20 in 
+  Printf.printf "fib(20) = %d" r
+
+let _ = main ()
+\end{caml_example*}
+
+This program produces the names \texttt{camlFib\$main_274} for the \texttt{main} function and \texttt{camlFib\$fib_271} for the \texttt{fib} function in the \texttt{Compute} module.
+
+\section{s:ocamlperf-printing}{Printing profiling information}
+
+The \texttt{perf report} command summarises data in the \texttt{perf.data} file.
+The basic \texttt{perf report} command is:
+
+\begin{verbatim}
+        perf report -f --no-children -i perf.data
+\end{verbatim}
+
+This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.
+
+Consider the following program which calculates the Tak function.
+\begin{caml_example*}{verbatim}
+let (x,y,z) =
+  try
+    let x = int_of_string Sys.argv.(1) in
+    let y = int_of_string Sys.argv.(2) in
+    let z = int_of_string Sys.argv.(3) in
+    (x,y,z)
+  with _ -> (18,12,6)
+
+let rec tak x y z =
+  if x > y then
+    tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
+  else z
+
+let main () =
+  let r = tak x y z in
+  Printf.printf "tak %d %d %d = %d\n" x y z r
+
+let _ = main ()
+\end{caml_example*}
+
+The \texttt{perf} report for this might resemble the following.
+\begin{verbatim}
+Samples: 809  of event 'cycles', Event count (approx.): 24701952617
+  Overhead  Command     Shared Object      Symbol
+-  100.00%  tak-fp.exe  tak-fp.exe         [.] Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+     Tak\$tak_402
+   + Tak\$tak_402
+     0.00%  tak-fp.exe  [kernel.kallsyms]  [k] 0xffffb9a5ff79d854
+     0.00%  perf-exec   [kernel.kallsyms]  [k] 0xffffb9a5ff719c34
+\end{verbatim}
+
+For a more interesting program implementing a Sudoku solver using Norvig's Backtracking algorithm, the output might resemble.
+\begin{verbatim}
+Samples: 7K of event 'cycles:P', Event count (approx.): 141874642467
+  Overhead  Command          Shared Object      Symbol
+-   35.90%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig._pow_747               
+   - 35.29% Sudoku_norvig._pow_747                                                       
+      + 17.70% Sudoku_norvig.eliminate_value_760                                         
+      + 15.74% Sudoku_norvig._pow_747                                                    
+      + 1.03% Sudoku_norvig.fun_1435                                                     
+     0.61% Sudoku_norvig.eliminate_value_760                                             
+-   12.99%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList.findnext_1543                
+   - 12.00% BatList.findnext_1543                                                        
+      - 11.71% BatList.find_all_1539                                                     
+         + 8.07% Sudoku_norvig.fun_1421                                                  
+         + 2.73% Sudoku_norvig.eliminate_948                                             
+         + 0.90% Sudoku_norvig.assign_949                                                
+-   10.63%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.eliminate_value_760    
+   + 8.29% Sudoku_norvig.eliminate_value_760                                             
+   + 1.93% Sudoku_norvig._pow_747                                                        
++    9.50%  sudoku_norvig.e  [kernel.kallsyms]  [k] asm_exc_nmi                          
++    7.24%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.fun_1435               
++    5.06%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_modify                          
++    3.10%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.fun_1360               
++    2.11%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.eliminate_948          
++    2.00%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List.iter_366                 
++    1.54%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.fun_1421               
++    1.47%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_apply2                          
++    1.42%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList.find_all_1539                
++    1.20%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig.fun_1366               
++    1.09%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List.fold_left_380            
+\end{verbatim}
+Note this implementation clearly has some optimsation possible in `_pow_747` a recursive
+Power of 2 function and parts of the OCaml runtime `caml_modify` are present.
+
+Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts \texttt{stackcollapse-perf.pl} and \texttt{flamegraph.pl} can be found at Brendan Gregg's \href{https://www.brendangregg.com/flamegraphs.html}{Flame Graphs} web page and work as follows:
+
+\begin{verbatim}
+git clone https://github.com/brendangregg/FlameGraph
+cd FlameGraph
+
+# Collect the results into perf.data
+perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
+perf script -i perf.data | ./stackcollapse-perf.pl out.folded
+flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg
+\end{verbatim}
+
+Alternatively \href{https://github.com/jonhoo/inferno}{inferno} is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.
+
+\begin{verbatim}
+cargo install inferno
+
+# Collect the results into perf.data
+perf script -i perf.data | inferno-collapse-perf > stacks.folded
+cat stacks.folded | inferno-flamegraph > flamegraph.svg
+\end{verbatim}
+
+Some \texttt{perf} tools (e.g., \texttt{perf-report(1)} and \texttt{perf-annotate(1)}) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it's dependencies need to be compiled with \texttt{-g} to include debugging information in the executable.
+
+Captured profile data can also be processed using \texttt{perf script} in various ways or with online tools like \href{https://www.speedscope.app}{speedscope.app} and \href{https://profiler.firefox.com/}{profiler.firefox.com}, or any other tool that supports \texttt{perf}-formatted data.
+
+\section{s:ocamlperf-conclusion}{Conclusion}
+
+For CPU profiling of native code, standard tools such as \texttt{perf}, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with \texttt{gprof} is no longer supported.
+
+Enabling frame pointers can impact performance on certain architectures (up to 10\% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.
+
+\section{s:ocamlperf-glossary}{Glossary}
+
+The following terminology is used in this chapter of the manual.
+
+\begin{itemize}
+\item[{\bf Call graph}] The chain of function calls that have lead to the current function (also referred to as a call stack).
+\item[{\bf Unwinding}] The process of restoring the program's state to how it was before some function(s) were called and possibly giving a profiler or debugger much more information. (also called stack unwinding).
+\item[{\bf Stack frame}] Refers to the portion of the stack allocated to a single function call. (also called an activation frame, activation record or simply frame).
+\end{itemize}