Nbench - Flexible benchmarking of Nimbus internals #641

mratsim · 2019-12-11T02:40:07Z

This lays out the ground work for nbench a CLI tool to benchmark specific parts of Nimbus internals.

Any proc can be added to the benchmark by tagging it with the {.nbench.} pragma.

Unfortunately it is not possible to just {.push nbench.} at the moment due to nim-lang/Nim#12867.

WIP but can be merge as it's mostly living in it's own repo besides adding {.nbench.} in choice places.

Some highlights:

I created macros that hook into the function entry and exit points. This can be reused for Chronicles logging at the tracing level without having to add those manually like what was done here:
https://github.com/status-im/nim-beacon-chain/blob/ffd0c052ee759fdbe0ea7b74588bd470d45ca15a/beacon_chain/spec/state_transition_epoch.nim#L409-L413
The macro accepts simple templates
https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/bench_lab.nim#L173-L187
Those can be expanded with our benchmarking need.
The insertion at exit point is approximative, it doesn't count the last return statement expression so if it's costly, it won't show up at the moment
Ultimately it should be merged with ncli. Or at least the precise state_transition configuration should be shared as ncli can only do full state transition at the moment but cannot be asked to do only process_deposits for example.
I didn't choose to use nimprof sampling approach (https://github.com/nim-lang/Nim/blob/v1.0.4/lib/pure/nimprof.nim):
- It piggybacks on Nim Stacktraces, which are eaten by templates and are slow.
- We probably want to be selective instead of including all stack traces.
- Tools like perf, Apple Instruments, or Intel VTune does sampling really well already down to the offending line with Assembly in front.
- What they don't do is
  - giving you domain-specific metrics
  - flexible backends
  - the possibility to be enabled for a long time without requiring GB of data due to sampling per micro or milli seconds instead of per function call.
  - Tracing without debugging symbols
- I.e. it should be a complement that can be enabled flexibly with perf for drilling down if we detect something unusual.
The counters should be migrated to nim-metrics so that they are available on Grafana as they would be valuable for Metrics/Public Grafana dashboard of the testnet #637 and probably for anti-perf regressions.
The hooks could be exported to nim-metrics and used more widely.
CSV backend coming (or JSON or both)
tags to get summary of subsystems like "crypto", "block processing", "epoch processing" are planned.
It requires -std=gnu99 to compile the assembly statement properly which might conflicts with Milagro -std=c99 (it didn't on my machine but who knows)
Notable things missing:
- Repeating the bench multiple times to obtain a standard deviation and be less sensitive to noise
- Memory monitoring
  - Stack / heap usage
  - Page faults / cache misses
- ARM support
Polish: If someone wants to do a pass on the command-line param feel free
The examples scenarios are taken from the test suite but we probably want to craft our own.
Maybe an external scenario repo that is submoduled is worthwhile. It can also be used for fuzzing
or for testing specific state transitions that are not part of the EF suite.
The clock / cycles count will be off if the CPU is overclocked, you can run this to check if they match: https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/platforms/x86.nim#L116-L122
Unfortunately correcting this / finding the CpuFrequency programmatically is non-trivial

Screenshot

Usage:

nim c -d:const_preset=mainnet -d:nbench -r nbench/nbench.nim
nbench/nbench cmdFullStateTransition -d=nbench/scenarios/mainnet/full_state_transitions/voluntary_exit -q=2

nbench/bench_lab.nim

arnetheduck · 2019-12-11T10:50:57Z

keep in mind that any numbers coming out of here will be fraught with issues that other tools have solved already:

biased because of injected code
can't use stack traces because nim stack traces are broken / extremely inefficient, meaning information is limited
doesn't use hardware counters or other platform-specific support causing poor accuracy and performance
heat maps, perf diffs and lots of other "surrounding" tooling exists already that will have to be redeveloped
the numbers coming out will be misleading at best

the described problems with established tools like perf sound superficial:

domain-specific metrics are enabled by using SDK's that come with tools like vtune
long time sampling can be controlled using tool options (ie frequency, call stack depth etc)
tracing without debugging symbols.. er, this macro injects an inefficient equivalent of debugging symbols
flexible backends is a red herring - one would want to use backend-specific benchmarking tools that take advantage of platform specific features, ie hardware perf counters or wasm interpreter metrics

instead of doing this from scratch, would be nice if it could plug into some existing framework like perf, generating appropriate filters, configs, etc - this way there's less wheel being reinvented poorly.

anyway, as a 5-minute job to quickly gain some intuition, sure.. but usually the poor mans profiler (pstack) works for that too.. otherwise, this looks a bit like a time and maintenance sink where on can keep adding just one more feature indefinitely.

stefantalpalaru · 2019-12-11T10:52:04Z

nbench/platforms/x86.nim

+# From Intel SDM
+# https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
+
+proc getTicks*(): int64 {.inline.} =


You're not measuring CPU cycles here, because modern CPUs give you a constant rate TSC (usually at the max CPU freq, regardless of the actual frequency): https://en.wikipedia.org/wiki/Time_Stamp_Counter

Yes I know, I mentionned it here:

The clock / cycles count will be off if the CPU is overclocked, you can run this to check if they match: https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/platforms/x86.nim#L116-L122
Unfortunately correcting this / finding the CpuFrequency programmatically is non-trivial

I tried to find a way to evaluate the difference but on Linux there is no QueryPerformanceFreq equivalent to Windows which would allow to evaluate the ratio

Not just when it's overclocked, but when it scales its frequency with respect to the load and when it throttles it due to thermal constraints.

This means that, in practice, just reading TSC will never give accurate CPU cycles.

You can get accurate average CPU frequency over a given interval with the APERF/MPERF ratio. These are MSR registers. Some info here: https://lwn.net/Articles/283769/

They're used in cpupower - https://github.com/torvalds/linux/blob/master/tools/power/cpupower/utils/idle_monitor/mperf_monitor.c - to get these average frequencies:

cpupower monitor -m Mperf | Mperf CPU| C0 | Cx | Freq 0| 4.70| 95.30| 2163 1| 4.69| 95.31| 1825 6| 4.97| 95.03| 2140 3| 4.93| 95.07| 1986 7| 15.06| 84.94| 3353 4| 5.15| 94.85| 1822 2| 12.13| 87.87| 3304 5| 13.76| 86.24| 3176

mratsim · 2019-12-11T12:55:04Z

keep in mind that any numbers coming out of here will be fraught with issues that other tools have solved already:

Yes of course, however I don't agree they are solved

* biased because of injected code

Any measurement is biaised, the injection is a constant small factor, linear with the number of function calls, +- cache misses but it should stay hot in L1/L2 cache. Otherwise all frameworks suffer from non-determinism due to all other background processes and the kernel potentially not giving them a CPU timeslice due to process with competing priorities

* can't use stack traces because nim stack traces are broken / extremely inefficient, meaning information is limited

I don't have the call graph sure, but as I said, better use perf/Instruments/VTune for that

* doesn't use hardware counters or other platform-specific support causing poor accuracy and performance

The overhead is negligeable when it's in the 10th of cycles while a BLS pairing is in the
~20 ms, i.e about 20 000 000x more costly

* heat maps, perf diffs and lots of other "surrounding" tooling exists already that will have to be redeveloped

The tooling around is poor, there is Apple Instruments, VTune, KCacheGrind. Dumping to CSV and passing that to proper data visualization in R or Python is much better.

* the numbers coming out will be misleading at best

Statistics can always be misleading, but you can draw conclusions from noisy samples (even if it's the absence of conclusion).

the described problems with established tools like perf sound superficial:

* domain-specific metrics are enabled by using SDK's that come with tools like vtune

* long time sampling can be controlled using tool options (ie frequency, call stack depth etc)

* tracing without debugging symbols.. er, this macro injects an inefficient equivalent of debugging symbols

Assume you have a user on Windows which tells you that they seem to have a performance issue:
You can't ask them to install perf, installing VTune is also very complex due to requiring sign up and a license. Instead you can have a metrics enabled build as part of CD and ask them to run that and give you the perf report.
Lastly they all require sudo rights to trace a process with perf_event_paranoid and yama_ptrace.

* flexible backends is a red herring - one would want to use backend-specific benchmarking tools that take advantage of platform specific features, ie hardware perf counters or wasm interpreter metrics

You can'tuse backend-specific benchmarking tools without being flexible in your Linux, Windows, MacOS backend

instead of doing this from scratch, would be nice if it could plug into some existing framework like perf, generating appropriate filters, configs, etc - this way there's less wheel being reinvented poorly.

Nothing prevents that, however perf is non-portable and the majority of people are on Windows.
Also #637 is about having some performance metrics on our Grafana dashboard and I expect for research purposes having hooks for specific functions would be valuable.

anyway, as a 5-minute job to quickly gain some intuition, sure.. but usually the poor mans profiler (pstack) works for that too.. otherwise, this looks a bit like a time and maintenance sink where on can keep adding just one more feature indefinitely.

The main features to add right now are mentioned in the first post:

high-resolution timer for ARM
doing a state transition multiple times
CSV/JSON output
Memory numbers
choosing the transition more precisely i.e. process_deposits, bls_sign, etc ...
eth-metrics

If we go over that:

high resolution timer is just the matter of finding and testing the equivalent of RDTSC
state transition multiple times is doing a for loop and adding standard deviations. Note that VTune is unable to do that without adding a kernel module and loading it for some reason
CSV/JSON output, would have to be done whichever backend we choose to support and Python and R have much more tools to analyze and visualize data than any other alternatives.
Memory numbers: either from Nim GC getOccupiedMem or from Linux getrusage or equivalent
choosing the transition more precisely is something that will also benefit ncli
eth-metrics will allow monitoring of a node over the long-term with the same interface as the rest of the functional metrics

Lastly here are the alternative routes that were evaluated:

Have a benchmark suite similar to our test suite.
Problem: it's not flexible, and does not help benchmarking on the testnet when everything is intermixed
Windows-specific + MacOS-specific + Linux-specific: That is a huge front cost and it's in my opinion even more costly to maintain due to the need of access to all 3 platforms. It assumes that those interested in performance have time to setup and understand those tools. Instead, since we have SSZ dumps enabled in testnet we can ask people, just put those dumps in this scenario, download nbench and pass the scenario folder as a parameter then give us the statistics:

no explanation on how to setup perfview, VTune, Instruments, perf, how to get root access, how to navigate through the UI
we can even automate the workflow or auto upload the perf figures
Also if we wanted a grafana of those metrics it would also be a timesink

The nimprof route, piggybacking on Nim stacktraces, well we have many issues with Nim stacktraces, I didn't want to depend on them

In terms of usage, we want to measure bottlenecks or regressions, we are not interested in the 1%, but in the 5-7%, the overhead of 2 atomic increments in an application dominated by network IO and crypto is negligeable. As we've seen with our woes with the go-libp2p-daemon, the standard library, pcre or RocksDB, the ability to control our dependencies and how we deliver the code to users is incredibly valuable to streamline our instructions.
Currently I know that I can fit the instruction for benchmarking in less than 10 lines in the README, and users can follow them. There is no way to do that with Vtune, perf or Instruments, and the Windows story is even worse.
As we approach multi-client testnets, I'm convinced that the ability for anyone to run Nimbus and get performance metrics easily without having to install, signup, sudo to a third-party tool will be valuable for the following users:

The enthusiast that wants to know if is hardware is sufficiently powerful and how many peers can he connect to before choking on BLS signature
The (EF) researcher who wants to know what's the cost of surround vote checking
Protocol researchers who may want to aggregate statistics from thousands of heterogeneous nodes (ARM, X86, Windows, Linux, Mac) with a common output format
To address those, I believe it's much more sound to start generic than starting very specific with hardware/OS specific tools.

zah

Perhaps we should store the binary test files in a submodule. Don't we expect the number of test cases to grow over time?

zah · 2019-12-11T23:17:08Z

nbench/bench_lab.nim

+template fnEntry(id: int, startTime, startCycle: untyped): untyped =
+  ## Bench tracing to insert on function entry
+  {.noSideEffect.}:
+    discard BenchMetrics[id].numCalls.atomicInc()


It's possible to produce a very slightly better code by relying on the {.global.} pragma here:

var benchMeta {.global.} = Metadata(...) {.noSideEffect.}: atomicInt benchMeta.numCalls

The optimisation will come from the fact that the address of the metadata object will be known at compile-time and the field will be incremented without additional indirections. You may need an extra proc registerMetadata taking the address of the global variable and adding it to a registry.

In this case the address of BenchMetrics[id] is also known at compile-time since id is a const.

The registerMetadata would have to be done at runtime since we don't know the actual address at compile-time so unless I miss something we would have

var benchMeta {.global.} = Metadata(...) BenchMetrics.registerMetadata(benchMeta) {.noSideEffect.}: atomicInt benchMeta.numCalls

We can check the assembly produced but since BenchMetrics is a global var it's in the BSS and all accesses should have no indirection (i;e. BSS start + BenchMetrics offset). Also given that it will always be hot in cache, there shouldn't be any cache miss.

But the sequence is on the heap, so you are offsetting an unknown pointer.
There are two ways to do the registration:

var benchMeta {.global.} = Metadata(...) var dummy {.global.} = register(benchMeta.addr)

or

var benchMeta {.global.} = createBenchMeta(...) proc createBenchMeta(...): Metadata = ... register result.addr

The second one is slightly more tricky, but it should be fine due to the way Nim implements NRVO. You can consult the generated code to be sure.

Since the size is known at compile-time, it should be transformable to a global array

Another thing to consider is that the .global. var approach will work differently for generic functions. Each instantiation will get its own counter. I'm not completely sure whether this is the better treatment here, but since the different generic instantiations can have vastly different code, it seems that measuring them separately is an improvement.

mratsim · 2019-12-11T23:28:18Z

Perhaps we should store the binary test files in a submodule. Don't we expect the number of test cases to grow over time?

Yes that's what I proposed during the talk, having a scenario submodules for tests, fuzzing and benchmarking that are not covered by the EF suite.

zah · 2019-12-11T23:38:15Z

I agree with most of Jacek's comments, but having an intrusive profiler is still useful sometimes for obtaining a very high-level profiling information where you exploit your knowledge of the structure of the code to measure very specific operations of interest. Such a functionality already exists in nim-metrics though. It can benefit from the added sophistication here when it comes to detecting CPU features and obtaining the more "portable" results (e.g. number of cycles).

zah · 2019-12-11T23:51:56Z

Mamy also gives very good reasons for having the intrusive profiling as an option for end users. I guess we don't need to stick to a single tool and we can learn to love the plethora of options.

mratsim · 2019-12-12T15:54:27Z

I've looked into a mature backend that would work on Windows and Mac (i.e. not perf) and would not require signing up with either Apple or Intel (i.e. not Instruments or VTune)

I have found LLVM Xray:

https://llvm.org/docs/XRay.html
https://lists.llvm.org/pipermail/llvm-dev/2018-February/121237.html

I can add a -d:benchXray flag that would change the compiler to clang and instead of {.nbench.} inserting symbols at function entries and exits, it will change the functions signatures to __attribute__((xray_always_instrument)) to use the Xray backend.

This can be done with codegenDecl, like so {.pragma: xray, codegenDecl: "__attribute__((xray_always_instrument)) $# $#$#".}. Some care is needed if another pragma like inline is already present due to nim-lang/Nim#10682.

Assuming Xray is here to stay in LLVM I believe it fits both Jacek and my requirements:

Jacek:
- Industry grade
- No overhead / well coded
- Mature ecosystem
Mine:
- Easy to ship to users
- No extra dependencies to make it run
- Custom reporting (besides the Prometheus/eth-metrics bindings)

The reporting seems underdeveloped and need to be explored, Trail of Bits seem to like it: https://blog.trailofbits.com/2019/10/03/tsc-frequency-for-all-better-profiling-and-benchmarking/

zah · 2019-12-16T22:07:28Z

This will have to be rebased after the latest 0.9.3 spec changes got merged.

- status-im/nim-confutils#10 - status-im/nim-confutils#11 - status-im/nim-confutils#12

…re and expected to fail)

mratsim · 2019-12-20T15:32:55Z

Can be merged before 0.9.4 cc @tersec to avoid little merge conflicts

tersec · 2019-12-20T15:42:23Z

Can be merged before 0.9.4 cc @tersec to avoid little merge conflicts

Sounds good.

* nbench PoC * Remove the yaml files from the example scenarios * update README with current status * Add an alternative implementation that uses defer * Forgot to add the old proc body * slots-processing * allow benching state_transition failures * Add Attestations processing (workaround confutils bug: - status-im/nim-confutils#10 - status-im/nim-confutils#11 - status-im/nim-confutils#12 * Add CLI command in the readme * Filter report and add notes about CPU cycles * Report averages * Add debugecho style time/cycle print * Report when we skip BLS and state root verification * Update to 0.9.3 * Generalize scenario parsing * Support all block processing scenarios * parallel bench runner PoC * gitBetter load issues reporting (the load issues were invalid signature and expected to fail)

mratsim added the work in progress label Dec 11, 2019

mratsim changed the base branch from master to devel December 11, 2019 03:01

stefantalpalaru reviewed Dec 11, 2019

View reviewed changes

nbench/bench_lab.nim Outdated Show resolved Hide resolved

stefantalpalaru reviewed Dec 11, 2019

View reviewed changes

zah reviewed Dec 11, 2019

View reviewed changes

mratsim force-pushed the nbench branch from 5f80c94 to 9bcbc4c Compare December 13, 2019 10:08

mratsim force-pushed the nbench branch 3 times, most recently from 9d03435 to 26fdf8c Compare December 18, 2019 18:03

mratsim added 14 commits December 20, 2019 11:56

nbench PoC

b73aeb0

Remove the yaml files from the example scenarios

5603efe

update README with current status

a97948d

Add an alternative implementation that uses defer

e891b6f

Forgot to add the old proc body

9514754

slots-processing

4498787

allow benching state_transition failures

4cf99a2

Add Attestations processing (workaround confutils bug:

ea55db4

- status-im/nim-confutils#10 - status-im/nim-confutils#11 - status-im/nim-confutils#12

Add CLI command in the readme

76d3c52

Filter report and add notes about CPU cycles

d2b4d96

Report averages

bf7c332

Add debugecho style time/cycle print

1d4756e

Report when we skip BLS and state root verification

3d568de

Update to 0.9.3

bb7bd5e

mratsim added 3 commits December 20, 2019 13:03

Generalize scenario parsing

8875a0e

Support all block processing scenarios

3c44ae6

parallel bench runner PoC

8e86fef

mratsim force-pushed the nbench branch from 26fdf8c to 8e86fef Compare December 20, 2019 15:02

gitBetter load issues reporting (the load issues were invalid signatu…

1dddffb

…re and expected to fail)

mratsim added merge ready and removed work in progress labels Dec 20, 2019

mratsim merged commit 106352a into devel Dec 20, 2019

delete-merged-branch bot deleted the nbench branch December 20, 2019 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nbench - Flexible benchmarking of Nimbus internals #641

Nbench - Flexible benchmarking of Nimbus internals #641

mratsim commented Dec 11, 2019 •

edited

Loading

arnetheduck commented Dec 11, 2019

stefantalpalaru Dec 11, 2019

mratsim Dec 11, 2019

stefantalpalaru Dec 11, 2019

mratsim commented Dec 11, 2019

zah left a comment

zah Dec 11, 2019 •

edited

Loading

mratsim Dec 11, 2019

zah Dec 12, 2019 •

edited

Loading

mratsim Dec 13, 2019

zah Dec 13, 2019

mratsim commented Dec 11, 2019

zah commented Dec 11, 2019 •

edited

Loading

zah commented Dec 11, 2019

mratsim commented Dec 12, 2019 •

edited

Loading

zah commented Dec 16, 2019

mratsim commented Dec 20, 2019

tersec commented Dec 20, 2019

Nbench - Flexible benchmarking of Nimbus internals #641

Nbench - Flexible benchmarking of Nimbus internals #641

Conversation

mratsim commented Dec 11, 2019 • edited Loading

arnetheduck commented Dec 11, 2019

stefantalpalaru Dec 11, 2019

Choose a reason for hiding this comment

mratsim Dec 11, 2019

Choose a reason for hiding this comment

stefantalpalaru Dec 11, 2019

Choose a reason for hiding this comment

mratsim commented Dec 11, 2019

zah left a comment

Choose a reason for hiding this comment

zah Dec 11, 2019 • edited Loading

Choose a reason for hiding this comment

mratsim Dec 11, 2019

Choose a reason for hiding this comment

zah Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

mratsim Dec 13, 2019

Choose a reason for hiding this comment

zah Dec 13, 2019

Choose a reason for hiding this comment

mratsim commented Dec 11, 2019

zah commented Dec 11, 2019 • edited Loading

zah commented Dec 11, 2019

mratsim commented Dec 12, 2019 • edited Loading

zah commented Dec 16, 2019

mratsim commented Dec 20, 2019

tersec commented Dec 20, 2019

mratsim commented Dec 11, 2019 •

edited

Loading

zah Dec 11, 2019 •

edited

Loading

zah Dec 12, 2019 •

edited

Loading

zah commented Dec 11, 2019 •

edited

Loading

mratsim commented Dec 12, 2019 •

edited

Loading