-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nbench - Flexible benchmarking of Nimbus internals #641
Conversation
keep in mind that any numbers coming out of here will be fraught with issues that other tools have solved already:
the described problems with established tools like perf sound superficial:
instead of doing this from scratch, would be nice if it could plug into some existing framework like anyway, as a 5-minute job to quickly gain some intuition, sure.. but usually the poor mans profiler ( |
# From Intel SDM | ||
# https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf | ||
|
||
proc getTicks*(): int64 {.inline.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not measuring CPU cycles here, because modern CPUs give you a constant rate TSC (usually at the max CPU freq, regardless of the actual frequency): https://en.wikipedia.org/wiki/Time_Stamp_Counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I know, I mentionned it here:
- The clock / cycles count will be off if the CPU is overclocked, you can run this to check if they match: https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/platforms/x86.nim#L116-L122
Unfortunately correcting this / finding the CpuFrequency programmatically is non-trivial
I tried to find a way to evaluate the difference but on Linux there is no QueryPerformanceFreq
equivalent to Windows which would allow to evaluate the ratio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just when it's overclocked, but when it scales its frequency with respect to the load and when it throttles it due to thermal constraints.
This means that, in practice, just reading TSC will never give accurate CPU cycles.
You can get accurate average CPU frequency over a given interval with the APERF/MPERF ratio. These are MSR registers. Some info here: https://lwn.net/Articles/283769/
They're used in cpupower
- https://github.com/torvalds/linux/blob/master/tools/power/cpupower/utils/idle_monitor/mperf_monitor.c - to get these average frequencies:
cpupower monitor -m Mperf
| Mperf
CPU| C0 | Cx | Freq
0| 4.70| 95.30| 2163
1| 4.69| 95.31| 1825
6| 4.97| 95.03| 2140
3| 4.93| 95.07| 1986
7| 15.06| 84.94| 3353
4| 5.15| 94.85| 1822
2| 12.13| 87.87| 3304
5| 13.76| 86.24| 3176
Yes of course, however I don't agree they are solved
Any measurement is biaised, the injection is a constant small factor, linear with the number of function calls, +- cache misses but it should stay hot in L1/L2 cache. Otherwise all frameworks suffer from non-determinism due to all other background processes and the kernel potentially not giving them a CPU timeslice due to process with competing priorities
I don't have the call graph sure, but as I said, better use perf/Instruments/VTune for that
The overhead is negligeable when it's in the 10th of cycles while a BLS pairing is in the
The tooling around is poor, there is Apple Instruments, VTune, KCacheGrind. Dumping to CSV and passing that to proper data visualization in R or Python is much better.
Statistics can always be misleading, but you can draw conclusions from noisy samples (even if it's the absence of conclusion).
Assume you have a user on Windows which tells you that they seem to have a performance issue:
You can'tuse backend-specific benchmarking tools without being flexible in your Linux, Windows, MacOS backend
Nothing prevents that, however perf is non-portable and the majority of people are on Windows.
The main features to add right now are mentioned in the first post:
If we go over that:
Lastly here are the alternative routes that were evaluated:
In terms of usage, we want to measure bottlenecks or regressions, we are not interested in the 1%, but in the 5-7%, the overhead of 2 atomic increments in an application dominated by network IO and crypto is negligeable. As we've seen with our woes with the go-libp2p-daemon, the standard library, pcre or RocksDB, the ability to control our dependencies and how we deliver the code to users is incredibly valuable to streamline our instructions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should store the binary test files in a submodule. Don't we expect the number of test cases to grow over time?
template fnEntry(id: int, startTime, startCycle: untyped): untyped = | ||
## Bench tracing to insert on function entry | ||
{.noSideEffect.}: | ||
discard BenchMetrics[id].numCalls.atomicInc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to produce a very slightly better code by relying on the {.global.}
pragma here:
var benchMeta {.global.} = Metadata(...)
{.noSideEffect.}:
atomicInt benchMeta.numCalls
The optimisation will come from the fact that the address of the metadata object will be known at compile-time and the field will be incremented without additional indirections. You may need an extra proc registerMetadata
taking the address of the global variable and adding it to a registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the address of BenchMetrics[id] is also known at compile-time since id
is a const.
The registerMetadata
would have to be done at runtime since we don't know the actual address at compile-time so unless I miss something we would have
var benchMeta {.global.} = Metadata(...)
BenchMetrics.registerMetadata(benchMeta)
{.noSideEffect.}:
atomicInt benchMeta.numCalls
We can check the assembly produced but since BenchMetrics is a global var it's in the BSS and all accesses should have no indirection (i;e. BSS start + BenchMetrics offset). Also given that it will always be hot in cache, there shouldn't be any cache miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the sequence is on the heap, so you are offsetting an unknown pointer.
There are two ways to do the registration:
var benchMeta {.global.} = Metadata(...)
var dummy {.global.} = register(benchMeta.addr)
or
var benchMeta {.global.} = createBenchMeta(...)
proc createBenchMeta(...): Metadata =
...
register result.addr
The second one is slightly more tricky, but it should be fine due to the way Nim implements NRVO. You can consult the generated code to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the size is known at compile-time, it should be transformable to a global array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is that the .global.
var approach will work differently for generic functions. Each instantiation will get its own counter. I'm not completely sure whether this is the better treatment here, but since the different generic instantiations can have vastly different code, it seems that measuring them separately is an improvement.
Yes that's what I proposed during the talk, having a scenario submodules for tests, fuzzing and benchmarking that are not covered by the EF suite. |
I agree with most of Jacek's comments, but having an intrusive profiler is still useful sometimes for obtaining a very high-level profiling information where you exploit your knowledge of the structure of the code to measure very specific operations of interest. Such a functionality already exists in nim-metrics though. It can benefit from the added sophistication here when it comes to detecting CPU features and obtaining the more "portable" results (e.g. number of cycles). |
Mamy also gives very good reasons for having the intrusive profiling as an option for end users. I guess we don't need to stick to a single tool and we can learn to love the plethora of options. |
I've looked into a mature backend that would work on Windows and Mac (i.e. not perf) and would not require signing up with either Apple or Intel (i.e. not Instruments or VTune) I have found LLVM Xray: https://llvm.org/docs/XRay.html I can add a This can be done with codegenDecl, like so Assuming Xray is here to stay in LLVM I believe it fits both Jacek and my requirements:
The reporting seems underdeveloped and need to be explored, Trail of Bits seem to like it: https://blog.trailofbits.com/2019/10/03/tsc-frequency-for-all-better-profiling-and-benchmarking/ |
This will have to be rebased after the latest 0.9.3 spec changes got merged. |
9d03435
to
26fdf8c
Compare
…re and expected to fail)
Can be merged before 0.9.4 cc @tersec to avoid little merge conflicts |
Sounds good. |
* nbench PoC * Remove the yaml files from the example scenarios * update README with current status * Add an alternative implementation that uses defer * Forgot to add the old proc body * slots-processing * allow benching state_transition failures * Add Attestations processing (workaround confutils bug: - status-im/nim-confutils#10 - status-im/nim-confutils#11 - status-im/nim-confutils#12 * Add CLI command in the readme * Filter report and add notes about CPU cycles * Report averages * Add debugecho style time/cycle print * Report when we skip BLS and state root verification * Update to 0.9.3 * Generalize scenario parsing * Support all block processing scenarios * parallel bench runner PoC * gitBetter load issues reporting (the load issues were invalid signature and expected to fail)
This lays out the ground work for
nbench
a CLI tool to benchmark specific parts of Nimbus internals.Any proc can be added to the benchmark by tagging it with the
{.nbench.}
pragma.Unfortunately it is not possible to just
{.push nbench.}
at the moment due to nim-lang/Nim#12867.WIP but can be merge as it's mostly living in it's own repo besides adding {.nbench.} in choice places.
Some highlights:
I created macros that hook into the function entry and exit points. This can be reused for Chronicles logging at the tracing level without having to add those manually like what was done here:
https://github.com/status-im/nim-beacon-chain/blob/ffd0c052ee759fdbe0ea7b74588bd470d45ca15a/beacon_chain/spec/state_transition_epoch.nim#L409-L413
The macro accepts simple templates
https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/bench_lab.nim#L173-L187
Those can be expanded with our benchmarking need.
The insertion at exit point is approximative, it doesn't count the last return statement expression so if it's costly, it won't show up at the moment
Ultimately it should be merged with
ncli
. Or at least the precise state_transition configuration should be shared asncli
can only do full state transition at the moment but cannot be asked to do onlyprocess_deposits
for example.I didn't choose to use
nimprof
sampling approach (https://github.com/nim-lang/Nim/blob/v1.0.4/lib/pure/nimprof.nim):It piggybacks on Nim Stacktraces, which are eaten by templates and are slow.
We probably want to be selective instead of including all stack traces.
Tools like
perf
, Apple Instruments, or Intel VTune does sampling really well already down to the offending line with Assembly in front.What they don't do is
I.e. it should be a complement that can be enabled flexibly with perf for drilling down if we detect something unusual.
The counters should be migrated to nim-metrics so that they are available on Grafana as they would be valuable for Metrics/Public Grafana dashboard of the testnet #637 and probably for anti-perf regressions.
The hooks could be exported to nim-metrics and used more widely.
CSV backend coming (or JSON or both)
tags to get summary of subsystems like "crypto", "block processing", "epoch processing" are planned.
It requires
-std=gnu99
to compile the assembly statement properly which might conflicts with Milagro-std=c99
(it didn't on my machine but who knows)Notable things missing:
Polish: If someone wants to do a pass on the command-line param feel free
The examples scenarios are taken from the test suite but we probably want to craft our own.
Maybe an external scenario repo that is submoduled is worthwhile. It can also be used for fuzzing
or for testing specific state transitions that are not part of the EF suite.
The clock / cycles count will be off if the CPU is overclocked, you can run this to check if they match: https://github.com/status-im/nim-beacon-chain/blob/85bc134d06da9ec7bd915835355fcc659f3961e1/nbench/platforms/x86.nim#L116-L122
Unfortunately correcting this / finding the CpuFrequency programmatically is non-trivial
Screenshot
Usage: