ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

bording · 2016-08-26T19:32:36Z

Recently, I've been doing some performance profiling of the NServiceBus RabbitMQ transport and I've come across some interesting findings.

When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!

I put together a repro project here and was able to run the scenario on a number of different machines:

Core 2 Duo E6550
--------------
1 consumer: 6908
32 consumer: 7667

i7-2600K
--------------
1 consumer: 7619
32 consumer: 11124

i7-4870HQ
--------------
1 consumer: 6103
32 consumer: 5237

i5-5200U
--------------
1 consumer: 7178
32 consumer: 1928

i7-6500U
--------------
1 consumer: 2243
32 consumer: 1253

i7-6700K
--------------
1 consumer: 6476
32 consumer: 2213

Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.

After spending some time analyzing some profiler results, my colleague @Scooletz and I came to the conclusion that the current design of the ConsumerWorkService / BatchingWorkPool is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.

We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them.

The first approach is largely focused on the BatchingWorkPool itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.

The second approach is focused on changing the ConsumerWorkService instead. It no longer uses the BatchingWorkPool at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a custom TaskScheduler to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @Scooletz should be opening a PR with this change soon.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2016-08-26T19:44:35Z

ConsumerWorkService has to guarantee per-channel dispatch ordering and dispatch in such a way that consumers can use synchronous protocol methods (e.g. queue.declare). As long as we satisfy those two, any changes are on the table.

michaelklishin · 2016-08-26T19:51:55Z

@bording both options sound reasonable and looks like you are aware of the ConsumerWorkService's raison d'être. Can you please post some data to compare to each pull request? Thanks again for stepping up to help!

bording · 2016-08-26T20:21:01Z

ConsumerWorkService has to guarantee per-channel dispatch ordering and dispatch in such a way that consumers can use synchronous protocol methods (e.g. queue.declare). As long as we satisfy those two, any changes are on the table.

From what I understand, those sort of things shouldn't be impacted by a change to ConsumerWorkService. The ConcurrentConsumerDispatcher is the only class that uses ConsumerWorkService, and it deals with the following:

HandleBasicConsumeOk
HandleBasicDeliver
HandleBasicCancelOk
HandleBasicCancel
HandleModelShutdown

Can you please post some data to compare to each pull request?

Once we've got both PRs up, I should be able to post some stats that show the performance differences between the two. So far, they appear to be fairly close, and both are much better than the above numbers.

bording · 2016-08-30T00:08:35Z

Now that both PRs are up, here are some quick comparison numbers. This is running the same scenario as above:

i7-6700K talking to a broker running in a VM on the same machine:

before
--------------
1 consumer: 7325
32 consumer: 2364

PR 252
--------------
1 consumer: 12649
32 consumer: 12129

PR 253
--------------
1 consumer: 12763
32 consumer: 12356

Those are just some quick comparisons. Each one of those was run twice and I averaged them. A larger number of runs per scenario would be useful, and I'd like to go back and run them on some of the other machines as well, particularly the old Sandy Bridge machine.

Scooletz · 2016-08-31T15:01:54Z

@michaelklishin Could you write a few words about your preferences in regards to which PR is more likely to be accepted? I'll need to update mine #253. I'd like to know more before I chase the master.

michaelklishin · 2016-08-31T15:05:22Z

I haven't had a chance to take a look at both of them. From a quick glance I like what I see in #253.

In fact, rebasing both would make it easier for @kjnilsson and me to review it as the diff would be much smaller.

kjnilsson · 2016-08-31T15:07:05Z

@Scooletz it isn't that easy to review the PRs completely until they have been rebased against master after vs-support was merged in. What I did notice about your PR is that it contains a breaking API change in IConnectionFactory. This I'd like to avoid at least in the short term. However @bording 's PR currently fails fairly reliably on one of the tests. I need to spend some time reviewing that before I can say whether it is significant or not.

Scooletz · 2016-08-31T15:18:35Z

Thanks for your answers @michaelklishin @kjnilsson . I'll rebase my PR then.

kjnilsson · 2016-08-31T15:52:31Z

Thank you @Scooletz

bording · 2016-08-31T16:21:13Z

his I'd like to avoid at least in the short term. However @bording 's PR currently fails fairly reliably on one of the tests. I need to spend some time reviewing that before I can say whether it is significant or not.

Per #252 (comment), this failing test is expected based on that approach's inability to guarantee ordering. Based on the fact that acking mulitple delivery tags assumes you've already seen all earlier tags, I'm now thinking that giving up ordering is going to be an unacceptable trade-off.

michaelklishin · 2016-09-05T09:00:52Z

#253 is merged.

bording mentioned this issue Aug 26, 2016

Improve batching work pool performance #252

Closed

Scooletz mentioned this issue Aug 29, 2016

No lock contention at consumer work service #253

Merged

michaelklishin added this to the 4.1.0 milestone Sep 1, 2016

michaelklishin added effort-medium enhancement labels Sep 1, 2016

michaelklishin closed this as completed Sep 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

bording commented Aug 26, 2016 •

edited

Loading

michaelklishin commented Aug 26, 2016

michaelklishin commented Aug 26, 2016

bording commented Aug 26, 2016 •

edited

Loading

bording commented Aug 30, 2016

Scooletz commented Aug 31, 2016

michaelklishin commented Aug 31, 2016

kjnilsson commented Aug 31, 2016 •

edited

Loading

Scooletz commented Aug 31, 2016

kjnilsson commented Aug 31, 2016

bording commented Aug 31, 2016

michaelklishin commented Sep 5, 2016

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

Comments

bording commented Aug 26, 2016 • edited Loading

michaelklishin commented Aug 26, 2016

michaelklishin commented Aug 26, 2016

bording commented Aug 26, 2016 • edited Loading

bording commented Aug 30, 2016

Scooletz commented Aug 31, 2016

michaelklishin commented Aug 31, 2016

kjnilsson commented Aug 31, 2016 • edited Loading

Scooletz commented Aug 31, 2016

kjnilsson commented Aug 31, 2016

bording commented Aug 31, 2016

michaelklishin commented Sep 5, 2016

bording commented Aug 26, 2016 •

edited

Loading

bording commented Aug 26, 2016 •

edited

Loading

kjnilsson commented Aug 31, 2016 •

edited

Loading