-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The first run is significantly faster even if .warmup() is used #46
Comments
Have you tried adding a noop function? |
The issue is resolved only when .warmup() is used. So, is it an issue with .warmup()? import { Bench } from "tinybench"
(async () => {
const bench = new Bench()
bench.add("noop", () => { })
for (const [k, v] of Object.entries({
"a": (i) => i,
"b": (i) => i,
"c": (i) => i,
})) {
bench.add(k, () => {
for (let i = 0; i < 1000; i++) {
v(i)
}
})
}
await bench.warmup()
await bench.run()
console.table(bench.table())
})() $ node bench.mjs # with noop
┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│ 0 │ 'noop' │ '3,936,461' │ 254.03525639294233 │ '±1.55%' │ 1968231 │
│ 1 │ 'a' │ '106,091' │ 9425.814974822304 │ '±0.13%' │ 53046 │
│ 2 │ 'b' │ '106,435' │ 9395.372655685504 │ '±0.11%' │ 53218 │
│ 3 │ 'c' │ '108,210' │ 9241.250181165182 │ '±0.09%' │ 54106 │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘
$ node bench.mjs # without nnop
┌─────────┬───────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼───────────┼────────────────────┼──────────┼─────────┤
│ 0 │ 'a' │ '469,971' │ 2127.7896772307904 │ '±0.62%' │ 234986 │
│ 1 │ 'b' │ '88,462' │ 11304.176861550312 │ '±0.17%' │ 44232 │
│ 2 │ 'c' │ '107,052' │ 9341.195247464535 │ '±0.12%' │ 53527 │
└─────────┴───────────┴───────────┴────────────────────┴──────────┴─────────┘
$ node bench.mjs # with noop but without warmup
┌─────────┬───────────┬─────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼─────────────┼───────────────────┼──────────┼─────────┤
│ 0 │ 'noop' │ '4,277,324' │ 233.7910108129855 │ '±0.40%' │ 2138663 │
│ 1 │ 'a' │ '1,156,223' │ 864.8847551251565 │ '±0.25%' │ 578112 │
│ 2 │ 'b' │ '107,419' │ 9309.331318347942 │ '±0.29%' │ 53710 │
│ 3 │ 'c' │ '107,332' │ 9316.871354411698 │ '±0.34%' │ 53667 │
└─────────┴───────────┴─────────────┴───────────────────┴──────────┴─────────┘ |
This is a trait many benchmarking tools have because it's more related to the engine and JIT, so I don't have any solution to it currently except the warmup + noop, I need to look into it, meanwhile, let me know if you found anything. Thank you for the issue. |
This is interesting. Part of it seems to be something with for (const [k, v] of Object.entries({
"a": (i) => i,
"b": (i) => i,
"c": (i) => i,
})) {
bench.add(k, () => {
for (let i = 0; i < 1000; i++) {
v(i)
}
})
} I get the following result:
which corresponds with your results. Somewhat surprising we get a similar result, although not as 'extreme', with the following: function a(i: number) {
return i + 1;
}
for (const [k, v] of Object.entries({
a: a,
b: a,
c: a,
})) {
bench.add(k, () => {
for (let i = 0; i < 1000; i++) {
v(i)
}
})
} and the results:
(would assume This is consistent; you change the order and the first task is always way faster. But if we instead do the following: bench.add("a", () => {
for (let i = 0; i < 1000; i++) {
a(i);
}
});
bench.add("b", () => {
for (let i = 0; i < 1000; i++) {
a(i);
}
});
bench.add("c", () => {
for (let i = 0; i < 1000; i++) {
a(i);
}
}); the results changes:
And when we add a "noop" task (
Seems to be consistent (even without So it could be fine if we just run a "noop" operation for one second (or 'a while') in every call to (all tests were ran on linux 6.3.9 and node v20.3.1) |
That can be a good solution to be injected in the library! and then removed from the results. |
I encountered this too. I "solved" my problem by doing a bunch of warmup runs before running the actual runs. This shim seemed to work rather well for sync-only stuff: for (x = now(), y = x + 1000000000n; x < y; x = now()) {
f();
}
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
start = now();
f();
end = now();
total += end - start;
count++;
} const bench = {
tasks: [],
results: {},
add(n, f) {
this.tasks.push([n, f]);
},
run() {
let start;
let end;
const now = process.hrtime.bigint;
let n;
let f;
let x;
let y;
let count = 0n;
let total = 0n;
for (let i = 0; i < this.tasks.length; i++) {
n = this.tasks[i][0];
f = this.tasks[i][1];
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
f();
}
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
start = now();
f();
end = now();
total += end - start;
count++;
}
this.results[n] = total / count;
}
},
table() {
return this.results;
},
};
function a(i) {
return i + 1;
}
for (const [k, v] of Object.entries({
a: a,
b: a,
c: a,
})) {
bench.add(k, () => {
for (let i = 0; i < 1000; i++) {
v(i);
}
});
}
await bench.run();
console.table(bench.table()); |
I think we should somehow add a noop warmup to benchmarking, so users won't have to do it themselves. |
I think this might be fixable by re-working the way you divide between warmup() and run() and doing something like this: // default to TRUE
await bench.run()
await bench.run({ warmup: true })
await bench.run({ warmup: false }) That way you let the JIT optimize per run instead of all upfront which then gets destroyed after you're done with the first task. basically this: // WARMUP PART
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
f();
}
// ACTUAL BENCHMARK
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
start = now();
f();
end = now();
} 👆 at least that kind of strategy worked for me 🤷♂️ might be worth a try |
Hello, a java passerby here. First things first. If you deoptimize the code, there's little sense in benchmarking it. While i don't have a full insight into node, at this point you're either benchmarking a low-quality JIT pass, or, even worse, an interpreter. Benchmarking tools are supposed to gather data about real life scenarios, and in those we expect functions to be optimized. There are no guarantees that if F1 is deoptimizedly faster than F2, then the relation is preserved under optimizations. Further, benchmarking exists not to get fancy numbers, but rather to find the root causes and explanations (and, of course, there is an explanation for this behavior). I also found it a bit surprising that the issue is rather ignored than researched - benchmarking tools have to be very certain about the results they provide, and pursuing just a stable output doesn't guarantee it's correct, so whenever a thing like this appears in the background, it has to be researched, otherwise it's possible that every result obtained by the tool is wrong. Now to the gory explanatory stuff. It turned out to be pretty simple; i've observed the same behavior in toy scripts. The first clue is the fact that only the first benchmark exhibits strange behavior, no matter how you shuffle them. Moreover, it is not some kind of an outlier, because no matter how long you run it, the numbers and relations are the same. The problem here is systematic and infrastructure-level; the first question would be if it's a framework or something deeper. As i've mentioned toy scripts - we can be certain it's rooted in the node itself. To continue poking the problem, we can look into the fact that while functions are identical, they are not the same object. What if all three benchmarks would be the same function? This result
Transforms into
So we're on the right track here. There is an optimization that relies on the fact that identical input is used. Are there any additional differences in the invocations of the first and the second benchmark? It's the name,
This will produce unpleasant thousands lines of output, but the answer is there. Ctrl + F 'runTask' ->
The Task.run itself is being recompiled. If there is only one benchmark, there is a single compilation; if there are 2+, there are three compilations. I don't have time to spend on finding out the reason of the third one appearing (if you check it, there's a fun fact), but let's look at the first two
The benchmark function isn't inlined anymore, because, well, this run doesn't include it - and TurboFan has to recompile everything it has previously assembled in a giant blob, keeping in mind that everything except
results:
We don't get much of a suspense here yet. However, as we know now, the function is inlined. While anything can be expected from an interpreted language (and others too, just not so often), including excessive machine code in some specific cases, 60ns translates to 240 cycles at 4GHz, which is somewhat my machine should be running at. For an inlined function that does absolutely nothing it's a bit too much, it should literally take zero instructions. Let's fiddle with
And then we get
That's one CPU cycle per call, which is likely to be the loop increment (it's not as simple as that though; CPU may execute more than one instruction per cycle). Technically, JIT should have dropped the loop altogether, because this code does virtually nothing, but i guess we're not there yet. Why am i getting to it? If this would be the result from the very beginning, guessing the inlining thing would be trivial. Of course, this deserves a bit of explanation itself. The timer calls are not free, they are relatively expensive (55ns, as we see), so this benchmark actually measures timer rather than the passed code. What's happening is that the timer function is Dear god. How this can be fixed!?In java, we do some compiler instructions about what can and can't be inlined. We're a bit blessed with that - node doesn't seem to have function-specific knobs, only global ones. However, there is an escape hatch (and we use it too, for the same reasons) - each benchmark should run in a separate process (and these processes must not run in parallel). In that case, all optimizations are restricted to whatever code is provided by the user, and it's their responsibility to supply homogeneous or heterogeneous input to reflect their real workload. This is still far from perfect - JIT may decide to inline one thing and forget about the other - but at least gives a decent chance to compare apples to apples. |
I've been comparing the performance of two versions of the same function, and I've noticed that the first one is always significantly faster, possibly due to JIT.
bench.mjs
I've also tried increasing
time
and using.warmup()
but it didn't help so much.Am I missing something? Or could there be an option to run benchmarks in the sequence of a, b, c, a, b, c, ...?
Versions:
The text was updated successfully, but these errors were encountered: