Skip to content

Commit

Permalink
Add profiling with perf chapter OCaml manual
Browse files Browse the repository at this point in the history
This covers the use of Linux perf to profile OCaml native code
including the overlap with frame pointers and DWARF CFI information
  • Loading branch information
tmcgilchrist committed Jan 22, 2025
1 parent 137dd26 commit cff15cf
Show file tree
Hide file tree
Showing 6 changed files with 218 additions and 5 deletions.
3 changes: 3 additions & 0 deletions Changes
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,9 @@ Working version
- #13694: Fix name for caml_hash_variant in the C interface.
(Michael Hendricks)

- #?????: Document support for profiling with Linux perf and frame pointers.
(Tim McGilchrist, review by ???)

### Compiler user-interface and warnings:

- #13428: support dump=[source | parsetree | lambda | ... | cmm | ...]
Expand Down
4 changes: 4 additions & 0 deletions manual/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ chapters (or sometimes sections) are mapped to a distinct `.etex` file:
- Optimisation with Flambda: `flambda.etex`
- Fuzzing with afl-fuzz: `afl-fuzz.etex`
- Runtime tracing with Runtime_events: `runtime_tracing.etex`
- The “Tail Modulo Constructor” program transformation: `tail-mod-cons.etex`
- Runtime detection of data races with ThreadSanitizer: `tsan.etex`

Note that ocamlc,ocamlopt and the toplevel options overlap a lot.
Consequently, these options are described together in the file
Expand All @@ -104,6 +106,8 @@ of `unified-options.etex` contains the relevant information.
- The runtime_events library: `libruntime_events.etex`
- The dynlink library: dynamic loading and linking of object files:
`libdynlink.etex`
- Recently removed or moved libraries (Graphics, Bigarray, Num, LablTk):
`old.etex`

Latex extensions
----------------
Expand Down
1 change: 1 addition & 0 deletions manual/src/allfiles.etex
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ and as a
\input{ocamldoc.tex}
\input{debugger.tex}
\input{profil.tex}
\input{profile-perf.tex}
\input{intf-c.tex}
\input{flambda.tex}
\input{afl-fuzz.tex}
Expand Down
3 changes: 2 additions & 1 deletion manual/src/cmds/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ TRANSF = $(OCAMLRUN) $(TOOLS)/transf
FILES = comp.tex top.tex runtime.tex native.tex lexyacc.tex intf-c.tex \
ocamldep.tex profil.tex debugger.tex ocamldoc.tex \
warnings-help.tex flambda.tex tail-mod-cons.tex \
afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex
afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex \
profile-perf.tex

etex-files: $(FILES)
all: $(FILES)
Expand Down
6 changes: 2 additions & 4 deletions manual/src/cmds/profil.etex
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,5 @@ Display a short usage summary and exit.

Profiling with "ocamlprof" only records execution counts, not the actual
time spent within each function. There is currently no way to perform
time profiling on bytecode programs generated by "ocamlc". For time
profiling of native code, users are recommended to use standard tools
such as perf (on Linux), Instruments (on macOS) and DTrace. Profiling
with "gprof" is no longer supported.
time profiling on bytecode programs generated by "ocamlc". For time profiling
of native code (see chapter~\ref{c:profiler-perf}).
206 changes: 206 additions & 0 deletions manual/src/cmds/profile-perf.etex
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
\chapter{Profiling (perf)} \label{c:profiler-perf}
%HEVEA\cutname{profiler-perf.html}

This chapter describes how to use \texttt{perf} to profile OCaml programs.

Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events, and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features. For more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).

\section{s:ocamlperf-call-graph}{Background}

CPU profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, \texttt{perf} needs to understand the call graph of OCaml. \texttt{perf} supports multiple options for recording call graphs:
\begin{itemize}
\item Frame Pointers, which is the default.
\item DWARF's Call Frame Information (CFI).
\item Hardware Last Branch Record (LBR).
\end{itemize}

Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register use for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5), which work better with the copying nature of perf and frame pointers.

DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in larger binaries.

Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).

Of these options, frame pointers are recommended for profiling OCaml code as it has the following advantages:
\begin{itemize}
\item Unwinding is faster to calculate.
\item Tracing data produced is smaller.
\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
\end{itemize}

\section{s:ocamlperf-compiling}{Compiling for Profiling}

The process for compiling an executable for CPU profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.

To enable frame pointers, configure the compiler with \texttt{--enable-frame-pointers}. Alternatively, you can install an opam switch with frame pointers enabled, as follows:

\begin{verbatim}
opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp
\end{verbatim}

Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.

\section{s:ocamlperf-profiling}{Profiling an Execution}

The basic \texttt{perf} command for profiling is:
\begin{verbatim}
perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
\end{verbatim}

The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The \texttt{--call-graph fp} instructs \texttt{perf} to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a \texttt{perf.data} file in the current directory. Alternatively use \texttt{--output} to choose a more descriptive filename.

The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.

\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. OCaml 5.3 onwards uses a name mangling scheme of \texttt{caml<MODULE_NAME>\$<FUNCTION_NAME>_<NNN>}, where \texttt{NNN} is a randomly generated number. In OCaml 5.2 the separator is a dot \texttt{caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>}, and before OCaml 5.1 the separator is a double underscore. All formats are supported by \texttt{perf}.

Consider the following program:

\begin{caml_example*}{verbatim}
module Compute = struct
let rec fib n =
if n = 0 then 0
else if n = 1 then 1
else fib (n-1) + fib (n-2)
end

let main () =
let r = Compute.fib 20 in
Printf.printf "fib(20) = %d" r

let _ = main ()
\end{caml_example*}

This program produces the names \texttt{camlFib\$main_274} for the \texttt{main} function and \texttt{camlFib\$fib_271} for the \texttt{fib} function in the \texttt{Compute} module.

\section{s:ocamlperf-printing}{Printing profiling information}

The \texttt{perf report} command summarises data in the \texttt{perf.data} file.
The basic \texttt{perf report} command is:

\begin{verbatim}
perf report -f --no-children -i perf.data
\end{verbatim}

This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.

Consider the following program which calculates the Tak function.
\begin{caml_example*}{verbatim}
let (x,y,z) =
try
let x = int_of_string Sys.argv.(1) in
let y = int_of_string Sys.argv.(2) in
let z = int_of_string Sys.argv.(3) in
(x,y,z)
with _ -> (18,12,6)

let rec tak x y z =
if x > y then
tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
else z

let main () =
let r = tak x y z in
Printf.printf "tak %d %d %d = %d\n" x y z r

let _ = main ()
\end{caml_example*}

The \texttt{perf} report for this might resemble the following.
\begin{verbatim}
Samples: 809 of event 'cycles', Event count (approx.): 24701952617
Overhead Command Shared Object Symbol
- 100.00% tak-fp.exe tak-fp.exe [.] Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
Tak\$tak_402
+ Tak\$tak_402
0.00% tak-fp.exe [kernel.kallsyms] [k] 0xffffb9a5ff79d854
0.00% perf-exec [kernel.kallsyms] [k] 0xffffb9a5ff719c34
\end{verbatim}

For a more interesting program implementing a Sudoku solver using Norvig's Backtracking algorithm, the output might resemble.
\begin{verbatim}
Samples: 7K of event 'cycles:P', Event count (approx.): 141874642467
Overhead Command Shared Object Symbol
- 35.90% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig._pow_747
- 35.29% Sudoku_norvig._pow_747
+ 17.70% Sudoku_norvig.eliminate_value_760
+ 15.74% Sudoku_norvig._pow_747
+ 1.03% Sudoku_norvig.fun_1435
0.61% Sudoku_norvig.eliminate_value_760
- 12.99% sudoku_norvig.e sudoku_norvig.exe [.] BatList.findnext_1543
- 12.00% BatList.findnext_1543
- 11.71% BatList.find_all_1539
+ 8.07% Sudoku_norvig.fun_1421
+ 2.73% Sudoku_norvig.eliminate_948
+ 0.90% Sudoku_norvig.assign_949
- 10.63% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.eliminate_value_760
+ 8.29% Sudoku_norvig.eliminate_value_760
+ 1.93% Sudoku_norvig._pow_747
+ 9.50% sudoku_norvig.e [kernel.kallsyms] [k] asm_exc_nmi
+ 7.24% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.fun_1435
+ 5.06% sudoku_norvig.e sudoku_norvig.exe [.] caml_modify
+ 3.10% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.fun_1360
+ 2.11% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.eliminate_948
+ 2.00% sudoku_norvig.e sudoku_norvig.exe [.] Stdlib.List.iter_366
+ 1.54% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.fun_1421
+ 1.47% sudoku_norvig.e sudoku_norvig.exe [.] caml_apply2
+ 1.42% sudoku_norvig.e sudoku_norvig.exe [.] BatList.find_all_1539
+ 1.20% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig.fun_1366
+ 1.09% sudoku_norvig.e sudoku_norvig.exe [.] Stdlib.List.fold_left_380
\end{verbatim}
Note this implementation clearly has some optimsation possible in `_pow_747` a recursive
Power of 2 function and parts of the OCaml runtime `caml_modify` are present.

Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts \texttt{stackcollapse-perf.pl} and \texttt{flamegraph.pl} can be found at Brendan Gregg's \href{https://www.brendangregg.com/flamegraphs.html}{Flame Graphs} web page and work as follows:

\begin{verbatim}
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Collect the results into perf.data
perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
perf script -i perf.data | ./stackcollapse-perf.pl out.folded
flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg
\end{verbatim}

Alternatively \href{https://github.com/jonhoo/inferno}{inferno} is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.

\begin{verbatim}
cargo install inferno

# Collect the results into perf.data
perf script -i perf.data | inferno-collapse-perf > stacks.folded
cat stacks.folded | inferno-flamegraph > flamegraph.svg
\end{verbatim}

Some \texttt{perf} tools (e.g., \texttt{perf-report(1)} and \texttt{perf-annotate(1)}) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it's dependencies need to be compiled with \texttt{-g} to include debugging information in the executable.

Captured profile data can also be processed using \texttt{perf script} in various ways or with online tools like \href{https://www.speedscope.app}{speedscope.app} and \href{https://profiler.firefox.com/}{profiler.firefox.com}, or any other tool that supports \texttt{perf}-formatted data.

\section{s:ocamlperf-conclusion}{Conclusion}

For CPU profiling of native code, standard tools such as \texttt{perf}, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with \texttt{gprof} is no longer supported.

Enabling frame pointers can impact performance on certain architectures (up to 10\% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.

\section{s:ocamlperf-glossary}{Glossary}

The following terminology is used in this chapter of the manual.

\begin{itemize}
\item[{\bf Call graph}] The chain of function calls that have lead to the current function (also referred to as a call stack).
\item[{\bf Unwinding}] The process of restoring the program's state to how it was before some function(s) were called and possibly giving a profiler or debugger much more information. (also called stack unwinding).
\item[{\bf Stack frame}] Refers to the portion of the stack allocated to a single function call. (also called an activation frame, activation record or simply frame).
\end{itemize}

0 comments on commit cff15cf

Please sign in to comment.