Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

Closed
bording opened this issue Aug 26, 2016 · 11 comments

Comments

@bording
Copy link
Collaborator

bording commented Aug 26, 2016

Recently, I've been doing some performance profiling of the NServiceBus RabbitMQ transport and I've come across some interesting findings.

When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!

I put together a repro project here and was able to run the scenario on a number of different machines:

Core 2 Duo E6550
--------------
1 consumer: 6908
32 consumer: 7667

i7-2600K
--------------
1 consumer: 7619
32 consumer: 11124

i7-4870HQ
--------------
1 consumer: 6103
32 consumer: 5237

i5-5200U
--------------
1 consumer: 7178
32 consumer: 1928

i7-6500U
--------------
1 consumer: 2243
32 consumer: 1253

i7-6700K
--------------
1 consumer: 6476
32 consumer: 2213

Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.

After spending some time analyzing some profiler results, my colleague @Scooletz and I came to the conclusion that the current design of the ConsumerWorkService / BatchingWorkPool is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.

We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them.

The first approach is largely focused on the BatchingWorkPool itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.

The second approach is focused on changing the ConsumerWorkService instead. It no longer uses the BatchingWorkPool at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a custom TaskScheduler to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @Scooletz should be opening a PR with this change soon.

@michaelklishin
Copy link
Member

ConsumerWorkService has to guarantee per-channel dispatch ordering and dispatch in such a way that consumers can use synchronous protocol methods (e.g. queue.declare). As long as we satisfy those two, any changes are on the table.

@michaelklishin
Copy link
Member

@bording both options sound reasonable and looks like you are aware of the ConsumerWorkService's raison d'être. Can you please post some data to compare to each pull request? Thanks again for stepping up to help!

@bording
Copy link
Collaborator Author

bording commented Aug 26, 2016

ConsumerWorkService has to guarantee per-channel dispatch ordering and dispatch in such a way that consumers can use synchronous protocol methods (e.g. queue.declare). As long as we satisfy those two, any changes are on the table.

From what I understand, those sort of things shouldn't be impacted by a change to ConsumerWorkService. The ConcurrentConsumerDispatcher is the only class that uses ConsumerWorkService, and it deals with the following:

  • HandleBasicConsumeOk
  • HandleBasicDeliver
  • HandleBasicCancelOk
  • HandleBasicCancel
  • HandleModelShutdown

Can you please post some data to compare to each pull request?

Once we've got both PRs up, I should be able to post some stats that show the performance differences between the two. So far, they appear to be fairly close, and both are much better than the above numbers.

@bording
Copy link
Collaborator Author

bording commented Aug 30, 2016

Now that both PRs are up, here are some quick comparison numbers. This is running the same scenario as above:

i7-6700K talking to a broker running in a VM on the same machine:

before
--------------
1 consumer: 7325
32 consumer: 2364

PR 252
--------------
1 consumer: 12649
32 consumer: 12129

PR 253
--------------
1 consumer: 12763
32 consumer: 12356

Those are just some quick comparisons. Each one of those was run twice and I averaged them. A larger number of runs per scenario would be useful, and I'd like to go back and run them on some of the other machines as well, particularly the old Sandy Bridge machine.

@Scooletz
Copy link
Contributor

@michaelklishin Could you write a few words about your preferences in regards to which PR is more likely to be accepted? I'll need to update mine #253. I'd like to know more before I chase the master.

@michaelklishin
Copy link
Member

I haven't had a chance to take a look at both of them. From a quick glance I like what I see in #253.

In fact, rebasing both would make it easier for @kjnilsson and me to review it as the diff would be much smaller.

@kjnilsson
Copy link
Contributor

kjnilsson commented Aug 31, 2016

@Scooletz it isn't that easy to review the PRs completely until they have been rebased against master after vs-support was merged in. What I did notice about your PR is that it contains a breaking API change in IConnectionFactory. This I'd like to avoid at least in the short term. However @bording 's PR currently fails fairly reliably on one of the tests. I need to spend some time reviewing that before I can say whether it is significant or not.

@Scooletz
Copy link
Contributor

Thanks for your answers @michaelklishin @kjnilsson . I'll rebase my PR then.

@kjnilsson
Copy link
Contributor

Thank you @Scooletz

@bording
Copy link
Collaborator Author

bording commented Aug 31, 2016

his I'd like to avoid at least in the short term. However @bording 's PR currently fails fairly reliably on one of the tests. I need to spend some time reviewing that before I can say whether it is significant or not.

Per #252 (comment), this failing test is expected based on that approach's inability to guarantee ordering. Based on the fact that acking mulitple delivery tags assumes you've already seen all earlier tags, I'm now thinking that giving up ordering is going to be an unacceptable trade-off.

@michaelklishin
Copy link
Member

#253 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants