Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nbench - Flexible benchmarking of Nimbus internals #641

Merged
merged 18 commits into from
Dec 20, 2019
Merged

Nbench - Flexible benchmarking of Nimbus internals #641

merged 18 commits into from
Dec 20, 2019

Conversation

mratsim
Copy link
Contributor

@mratsim mratsim commented Dec 11, 2019

This lays out the ground work for nbench a CLI tool to benchmark specific parts of Nimbus internals.

Any proc can be added to the benchmark by tagging it with the {.nbench.} pragma.

Unfortunately it is not possible to just {.push nbench.} at the moment due to nim-lang/Nim#12867.

WIP but can be merge as it's mostly living in it's own repo besides adding {.nbench.} in choice places.

Some highlights:

  • I created macros that hook into the function entry and exit points. This can be reused for Chronicles logging at the tracing level without having to add those manually like what was done here:
    https://github.com/status-im/nim-beacon-chain/blob/ffd0c052ee759fdbe0ea7b74588bd470d45ca15a/beacon_chain/spec/state_transition_epoch.nim#L409-L413
    The macro accepts simple templates
    https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/bench_lab.nim#L173-L187
    Those can be expanded with our benchmarking need.

  • The insertion at exit point is approximative, it doesn't count the last return statement expression so if it's costly, it won't show up at the moment

  • Ultimately it should be merged with ncli. Or at least the precise state_transition configuration should be shared as ncli can only do full state transition at the moment but cannot be asked to do only process_deposits for example.

  • I didn't choose to use nimprof sampling approach (https://github.com/nim-lang/Nim/blob/v1.0.4/lib/pure/nimprof.nim):

    • It piggybacks on Nim Stacktraces, which are eaten by templates and are slow.

    • We probably want to be selective instead of including all stack traces.

    • Tools like perf, Apple Instruments, or Intel VTune does sampling really well already down to the offending line with Assembly in front.

    • What they don't do is

      • giving you domain-specific metrics
      • flexible backends
      • the possibility to be enabled for a long time without requiring GB of data due to sampling per micro or milli seconds instead of per function call.
      • Tracing without debugging symbols
    • I.e. it should be a complement that can be enabled flexibly with perf for drilling down if we detect something unusual.

  • The counters should be migrated to nim-metrics so that they are available on Grafana as they would be valuable for Metrics/Public Grafana dashboard of the testnet #637 and probably for anti-perf regressions.

  • The hooks could be exported to nim-metrics and used more widely.

  • CSV backend coming (or JSON or both)

  • tags to get summary of subsystems like "crypto", "block processing", "epoch processing" are planned.

  • It requires -std=gnu99 to compile the assembly statement properly which might conflicts with Milagro -std=c99 (it didn't on my machine but who knows)

  • Notable things missing:

    • Repeating the bench multiple times to obtain a standard deviation and be less sensitive to noise
    • Memory monitoring
      • Stack / heap usage
      • Page faults / cache misses
    • ARM support
  • Polish: If someone wants to do a pass on the command-line param feel free

  • The examples scenarios are taken from the test suite but we probably want to craft our own.
    Maybe an external scenario repo that is submoduled is worthwhile. It can also be used for fuzzing
    or for testing specific state transitions that are not part of the EF suite.

  • The clock / cycles count will be off if the CPU is overclocked, you can run this to check if they match: https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/platforms/x86.nim#L116-L122
    Unfortunately correcting this / finding the CpuFrequency programmatically is non-trivial

Screenshot

DeepinScreenshot_select-area_20191211025812

Usage:

nim c -d:const_preset=mainnet -d:nbench -r nbench/nbench.nim
nbench/nbench cmdFullStateTransition -d=nbench/scenarios/mainnet/full_state_transitions/voluntary_exit -q=2

@mratsim mratsim changed the base branch from master to devel December 11, 2019 03:01
@arnetheduck
Copy link
Member

keep in mind that any numbers coming out of here will be fraught with issues that other tools have solved already:

  • biased because of injected code
  • can't use stack traces because nim stack traces are broken / extremely inefficient, meaning information is limited
  • doesn't use hardware counters or other platform-specific support causing poor accuracy and performance
  • heat maps, perf diffs and lots of other "surrounding" tooling exists already that will have to be redeveloped
  • the numbers coming out will be misleading at best

the described problems with established tools like perf sound superficial:

  • domain-specific metrics are enabled by using SDK's that come with tools like vtune
  • long time sampling can be controlled using tool options (ie frequency, call stack depth etc)
  • tracing without debugging symbols.. er, this macro injects an inefficient equivalent of debugging symbols
  • flexible backends is a red herring - one would want to use backend-specific benchmarking tools that take advantage of platform specific features, ie hardware perf counters or wasm interpreter metrics

instead of doing this from scratch, would be nice if it could plug into some existing framework like perf, generating appropriate filters, configs, etc - this way there's less wheel being reinvented poorly.

anyway, as a 5-minute job to quickly gain some intuition, sure.. but usually the poor mans profiler (pstack) works for that too.. otherwise, this looks a bit like a time and maintenance sink where on can keep adding just one more feature indefinitely.

# From Intel SDM
# https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

proc getTicks*(): int64 {.inline.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not measuring CPU cycles here, because modern CPUs give you a constant rate TSC (usually at the max CPU freq, regardless of the actual frequency): https://en.wikipedia.org/wiki/Time_Stamp_Counter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, I mentionned it here:

I tried to find a way to evaluate the difference but on Linux there is no QueryPerformanceFreq equivalent to Windows which would allow to evaluate the ratio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just when it's overclocked, but when it scales its frequency with respect to the load and when it throttles it due to thermal constraints.

This means that, in practice, just reading TSC will never give accurate CPU cycles.

You can get accurate average CPU frequency over a given interval with the APERF/MPERF ratio. These are MSR registers. Some info here: https://lwn.net/Articles/283769/

They're used in cpupower - https://github.com/torvalds/linux/blob/master/tools/power/cpupower/utils/idle_monitor/mperf_monitor.c - to get these average frequencies:

cpupower monitor -m Mperf
    | Mperf              
 CPU| C0   | Cx   | Freq  
   0|  4.70| 95.30|  2163
   1|  4.69| 95.31|  1825
   6|  4.97| 95.03|  2140
   3|  4.93| 95.07|  1986
   7| 15.06| 84.94|  3353
   4|  5.15| 94.85|  1822
   2| 12.13| 87.87|  3304
   5| 13.76| 86.24|  3176

@mratsim
Copy link
Contributor Author

mratsim commented Dec 11, 2019

keep in mind that any numbers coming out of here will be fraught with issues that other tools have solved already:

Yes of course, however I don't agree they are solved

* biased because of injected code

Any measurement is biaised, the injection is a constant small factor, linear with the number of function calls, +- cache misses but it should stay hot in L1/L2 cache. Otherwise all frameworks suffer from non-determinism due to all other background processes and the kernel potentially not giving them a CPU timeslice due to process with competing priorities

* can't use stack traces because nim stack traces are broken / extremely inefficient, meaning information is limited

I don't have the call graph sure, but as I said, better use perf/Instruments/VTune for that

* doesn't use hardware counters or other platform-specific support causing poor accuracy and performance

The overhead is negligeable when it's in the 10th of cycles while a BLS pairing is in the
~20 ms, i.e about 20 000 000x more costly

* heat maps, perf diffs and lots of other "surrounding" tooling exists already that will have to be redeveloped

The tooling around is poor, there is Apple Instruments, VTune, KCacheGrind. Dumping to CSV and passing that to proper data visualization in R or Python is much better.

* the numbers coming out will be misleading at best

Statistics can always be misleading, but you can draw conclusions from noisy samples (even if it's the absence of conclusion).

the described problems with established tools like perf sound superficial:

* domain-specific metrics are enabled by using SDK's that come with tools like vtune

* long time sampling can be controlled using tool options (ie frequency, call stack depth etc)

* tracing without debugging symbols.. er, this macro injects an inefficient equivalent of debugging symbols

Assume you have a user on Windows which tells you that they seem to have a performance issue:
You can't ask them to install perf, installing VTune is also very complex due to requiring sign up and a license. Instead you can have a metrics enabled build as part of CD and ask them to run that and give you the perf report.
Lastly they all require sudo rights to trace a process with perf_event_paranoid and yama_ptrace.

* flexible backends is a red herring - one would want to use backend-specific benchmarking tools that take advantage of platform specific features, ie hardware perf counters or wasm interpreter metrics

You can'tuse backend-specific benchmarking tools without being flexible in your Linux, Windows, MacOS backend

instead of doing this from scratch, would be nice if it could plug into some existing framework like perf, generating appropriate filters, configs, etc - this way there's less wheel being reinvented poorly.

Nothing prevents that, however perf is non-portable and the majority of people are on Windows.
Also #637 is about having some performance metrics on our Grafana dashboard and I expect for research purposes having hooks for specific functions would be valuable.

anyway, as a 5-minute job to quickly gain some intuition, sure.. but usually the poor mans profiler (pstack) works for that too.. otherwise, this looks a bit like a time and maintenance sink where on can keep adding just one more feature indefinitely.

The main features to add right now are mentioned in the first post:

  • high-resolution timer for ARM
  • doing a state transition multiple times
  • CSV/JSON output
  • Memory numbers
  • choosing the transition more precisely i.e. process_deposits, bls_sign, etc ...
  • eth-metrics

If we go over that:

  • high resolution timer is just the matter of finding and testing the equivalent of RDTSC
  • state transition multiple times is doing a for loop and adding standard deviations. Note that VTune is unable to do that without adding a kernel module and loading it for some reason
  • CSV/JSON output, would have to be done whichever backend we choose to support and Python and R have much more tools to analyze and visualize data than any other alternatives.
  • Memory numbers: either from Nim GC getOccupiedMem or from Linux getrusage or equivalent
  • choosing the transition more precisely is something that will also benefit ncli
  • eth-metrics will allow monitoring of a node over the long-term with the same interface as the rest of the functional metrics

Lastly here are the alternative routes that were evaluated:

  1. Have a benchmark suite similar to our test suite.
    Problem: it's not flexible, and does not help benchmarking on the testnet when everything is intermixed
  2. Windows-specific + MacOS-specific + Linux-specific: That is a huge front cost and it's in my opinion even more costly to maintain due to the need of access to all 3 platforms. It assumes that those interested in performance have time to setup and understand those tools. Instead, since we have SSZ dumps enabled in testnet we can ask people, just put those dumps in this scenario, download nbench and pass the scenario folder as a parameter then give us the statistics:
  • no explanation on how to setup perfview, VTune, Instruments, perf, how to get root access, how to navigate through the UI
  • we can even automate the workflow or auto upload the perf figures
    Also if we wanted a grafana of those metrics it would also be a timesink
  1. The nimprof route, piggybacking on Nim stacktraces, well we have many issues with Nim stacktraces, I didn't want to depend on them

In terms of usage, we want to measure bottlenecks or regressions, we are not interested in the 1%, but in the 5-7%, the overhead of 2 atomic increments in an application dominated by network IO and crypto is negligeable. As we've seen with our woes with the go-libp2p-daemon, the standard library, pcre or RocksDB, the ability to control our dependencies and how we deliver the code to users is incredibly valuable to streamline our instructions.
Currently I know that I can fit the instruction for benchmarking in less than 10 lines in the README, and users can follow them. There is no way to do that with Vtune, perf or Instruments, and the Windows story is even worse.
As we approach multi-client testnets, I'm convinced that the ability for anyone to run Nimbus and get performance metrics easily without having to install, signup, sudo to a third-party tool will be valuable for the following users:

  • The enthusiast that wants to know if is hardware is sufficiently powerful and how many peers can he connect to before choking on BLS signature
  • The (EF) researcher who wants to know what's the cost of surround vote checking
  • Protocol researchers who may want to aggregate statistics from thousands of heterogeneous nodes (ARM, X86, Windows, Linux, Mac) with a common output format
    To address those, I believe it's much more sound to start generic than starting very specific with hardware/OS specific tools.

Copy link
Contributor

@zah zah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should store the binary test files in a submodule. Don't we expect the number of test cases to grow over time?

template fnEntry(id: int, startTime, startCycle: untyped): untyped =
## Bench tracing to insert on function entry
{.noSideEffect.}:
discard BenchMetrics[id].numCalls.atomicInc()
Copy link
Contributor

@zah zah Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to produce a very slightly better code by relying on the {.global.} pragma here:

var benchMeta {.global.} = Metadata(...)
{.noSideEffect.}:
  atomicInt benchMeta.numCalls

The optimisation will come from the fact that the address of the metadata object will be known at compile-time and the field will be incremented without additional indirections. You may need an extra proc registerMetadata taking the address of the global variable and adding it to a registry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the address of BenchMetrics[id] is also known at compile-time since id is a const.

The registerMetadata would have to be done at runtime since we don't know the actual address at compile-time so unless I miss something we would have

var benchMeta {.global.} = Metadata(...)
BenchMetrics.registerMetadata(benchMeta)
{.noSideEffect.}:
  atomicInt benchMeta.numCalls

We can check the assembly produced but since BenchMetrics is a global var it's in the BSS and all accesses should have no indirection (i;e. BSS start + BenchMetrics offset). Also given that it will always be hot in cache, there shouldn't be any cache miss.

Copy link
Contributor

@zah zah Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the sequence is on the heap, so you are offsetting an unknown pointer.
There are two ways to do the registration:

var benchMeta {.global.} = Metadata(...)
var dummy {.global.} = register(benchMeta.addr) 

or

var benchMeta {.global.} = createBenchMeta(...)

proc createBenchMeta(...): Metadata =
  ...
  register result.addr

The second one is slightly more tricky, but it should be fine due to the way Nim implements NRVO. You can consult the generated code to be sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the size is known at compile-time, it should be transformable to a global array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider is that the .global. var approach will work differently for generic functions. Each instantiation will get its own counter. I'm not completely sure whether this is the better treatment here, but since the different generic instantiations can have vastly different code, it seems that measuring them separately is an improvement.

@mratsim
Copy link
Contributor Author

mratsim commented Dec 11, 2019

Perhaps we should store the binary test files in a submodule. Don't we expect the number of test cases to grow over time?

Yes that's what I proposed during the talk, having a scenario submodules for tests, fuzzing and benchmarking that are not covered by the EF suite.

@zah
Copy link
Contributor

zah commented Dec 11, 2019

I agree with most of Jacek's comments, but having an intrusive profiler is still useful sometimes for obtaining a very high-level profiling information where you exploit your knowledge of the structure of the code to measure very specific operations of interest. Such a functionality already exists in nim-metrics though. It can benefit from the added sophistication here when it comes to detecting CPU features and obtaining the more "portable" results (e.g. number of cycles).

@zah
Copy link
Contributor

zah commented Dec 11, 2019

Mamy also gives very good reasons for having the intrusive profiling as an option for end users. I guess we don't need to stick to a single tool and we can learn to love the plethora of options.

@mratsim
Copy link
Contributor Author

mratsim commented Dec 12, 2019

I've looked into a mature backend that would work on Windows and Mac (i.e. not perf) and would not require signing up with either Apple or Intel (i.e. not Instruments or VTune)

I have found LLVM Xray:

https://llvm.org/docs/XRay.html
https://lists.llvm.org/pipermail/llvm-dev/2018-February/121237.html

I can add a -d:benchXray flag that would change the compiler to clang and instead of {.nbench.} inserting symbols at function entries and exits, it will change the functions signatures to __attribute__((xray_always_instrument)) to use the Xray backend.

This can be done with codegenDecl, like so {.pragma: xray, codegenDecl: "__attribute__((xray_always_instrument)) $# $#$#".}. Some care is needed if another pragma like inline is already present due to nim-lang/Nim#10682.

Assuming Xray is here to stay in LLVM I believe it fits both Jacek and my requirements:

  • Jacek:
    • Industry grade
    • No overhead / well coded
    • Mature ecosystem
  • Mine:
    • Easy to ship to users
    • No extra dependencies to make it run
    • Custom reporting (besides the Prometheus/eth-metrics bindings)

The reporting seems underdeveloped and need to be explored, Trail of Bits seem to like it: https://blog.trailofbits.com/2019/10/03/tsc-frequency-for-all-better-profiling-and-benchmarking/

@zah
Copy link
Contributor

zah commented Dec 16, 2019

This will have to be rebased after the latest 0.9.3 spec changes got merged.

@mratsim mratsim force-pushed the nbench branch 3 times, most recently from 9d03435 to 26fdf8c Compare December 18, 2019 18:03
@mratsim
Copy link
Contributor Author

mratsim commented Dec 20, 2019

Can be merged before 0.9.4 cc @tersec to avoid little merge conflicts

@tersec
Copy link
Contributor

tersec commented Dec 20, 2019

Can be merged before 0.9.4 cc @tersec to avoid little merge conflicts

Sounds good.

@mratsim mratsim merged commit 106352a into devel Dec 20, 2019
@delete-merged-branch delete-merged-branch bot deleted the nbench branch December 20, 2019 16:14
zah pushed a commit that referenced this pull request Jan 6, 2020
* nbench PoC

* Remove the yaml files from the example scenarios

* update README with current status

* Add an alternative implementation that uses defer

* Forgot to add the old proc body

* slots-processing

* allow benching state_transition failures

* Add Attestations processing (workaround confutils bug:
- status-im/nim-confutils#10
- status-im/nim-confutils#11
- status-im/nim-confutils#12

* Add CLI command in the readme

* Filter report and add notes about CPU cycles

* Report averages

* Add debugecho style time/cycle print

* Report when we skip BLS and state root verification

* Update to 0.9.3

* Generalize scenario parsing

* Support all block processing scenarios

* parallel bench runner PoC

* gitBetter load issues reporting (the load issues were invalid signature and expected to fail)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants