-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251
Comments
|
@bording both options sound reasonable and looks like you are aware of the |
From what I understand, those sort of things shouldn't be impacted by a change to
Once we've got both PRs up, I should be able to post some stats that show the performance differences between the two. So far, they appear to be fairly close, and both are much better than the above numbers. |
Now that both PRs are up, here are some quick comparison numbers. This is running the same scenario as above:
Those are just some quick comparisons. Each one of those was run twice and I averaged them. A larger number of runs per scenario would be useful, and I'd like to go back and run them on some of the other machines as well, particularly the old Sandy Bridge machine. |
@michaelklishin Could you write a few words about your preferences in regards to which PR is more likely to be accepted? I'll need to update mine #253. I'd like to know more before I chase the master. |
I haven't had a chance to take a look at both of them. From a quick glance I like what I see in #253. In fact, rebasing both would make it easier for @kjnilsson and me to review it as the diff would be much smaller. |
@Scooletz it isn't that easy to review the PRs completely until they have been rebased against master after |
Thanks for your answers @michaelklishin @kjnilsson . I'll rebase my PR then. |
Thank you @Scooletz |
Per #252 (comment), this failing test is expected based on that approach's inability to guarantee ordering. Based on the fact that acking mulitple delivery tags assumes you've already seen all earlier tags, I'm now thinking that giving up ordering is going to be an unacceptable trade-off. |
#253 is merged. |
Recently, I've been doing some performance profiling of the NServiceBus RabbitMQ transport and I've come across some interesting findings.
When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!
I put together a repro project here and was able to run the scenario on a number of different machines:
Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.
After spending some time analyzing some profiler results, my colleague @Scooletz and I came to the conclusion that the current design of the ConsumerWorkService / BatchingWorkPool is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.
We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them.
The first approach is largely focused on the
BatchingWorkPool
itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.The second approach is focused on changing the
ConsumerWorkService
instead. It no longer uses theBatchingWorkPool
at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a customTaskScheduler
to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @Scooletz should be opening a PR with this change soon.The text was updated successfully, but these errors were encountered: