[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

anuchandy · 2022-04-12T03:42:07Z

Root causing

Cx deployed An application using SB session to 10 pods. A heap dump was taken from one pod after six days of execution, and showing excessive allocation of "reactor.core.publisher.NextProcessor$NextInner" type

Out of 1,065,540, the 1,065,504 "NextInner" instances retains "MonoCacheTime+MonoCacheTime$CoordinatedSubscriber" instances, those also appeared on the top of the above histogram.

Such ~1M instances of these types are leaking symptoms.

These 1,065,504 "NextInner" objects are retained by one "reactor.core.publisher.NextProcessor" object.

that "NextProcessor" is retained by the private variable shutdownSignalSink (type=Sink.One) in ReactorConnection.

While there are multiple places library subscribe to this shutdownSignalSink, the only place that exposes it via the intermediate MonoCacheTime operator (as appears in the second figure) by applying cache() operator is ReactorConnection::getShutdownSignal() method

private final Sinks.One<AmqpShutdownSignal> shutdownSignalSink = Sinks.one();

@Override
public Flux<AmqpShutdownSignal> getShutdownSignals() {
    return shutdownSignalSink.asMono().cache().flux();
}

Scanning the source code for reference to the method getShutdownSignals(), the only place it gets referenced and subscribed is in ReactorReceiver constructor

 amqpConnection.getShutdownSignals().flatMap(signal -> {
    logger.verbose("Shutdown signal received.");
    return closeAsync("Connection shutdown.", null);
}).subscribe());

But the subscription is DISPOSED in the ReactorReceiver close api.

Had it a bug in not closing ReactorReceiver, there would be ~1M instances of ReactorReceiver, but actually, there are only 250 of them.

It proves that ReactorReceivers are not getting leaked and are disposed correctly. It leads us to check the implementation of Sink.One and identified a problem in Sink.One where it continues to retain the subscriber even after the disposition. A git-ticket is opened in reactor-core repro "Memory leak in SinkOneMulticast" reactor/reactor-core#3001, and the fix coming in reactor-core-3.4.17

Observation_1 (Cx action item)

You might have noticed that the reactor git-ticket mention about the type SinkOneMulticast, but in heap-dump, it's a different type NextProcessor.

The reason for that is - Cx application seems to have one or more dependency that brings in a relatively old version of the reactor-core library (~7 months behind). The Azure ServiceBus SDK is defined to use the recent version of reactor-core-3.4.14.

Back in Sept 2021, the use of NextProcessor in Sink.One was replaced with SinkOneMulticast; in this commit

Cx needs to analyze the dependencies and align the versions of shared libraries (e.g., reactor-core, reactor-netty, etc..) by upgrading dependencies so that the application is ready to pick when reactor-core-3.4.17 is available.

Observation_2 (Cx action item)

The 1,065,504 instances of "NextProcessor$NextInner" means, over the period of 6 days, around ~1M ReactorReceiver objects were created and closed. It means ~125 ReactorReceiver instances created-disposed per minute.

The only reason for such a massive churn of these objects is - that the consuming application is trying to acquire too many sessions from the service, but the producer application does not create enough sessions. Due to this, SB service DETACHes those unnecessary receivers after a 1 minute timeout. This indicates that maxConcurrentSession in the consumer application is too large. The Cx should tune this configuration according to the expected load.

Observation_3: (Sdk action-item)

The getShutdownSignals() API uses cache() operator. We don't have to use the cache operator here, Sink.One is capable of remembering the last signal and replaying it. While cache() doesn't directly contribute to any leak, we could remove it and save allocations.

And the azure-core should upgrade to reactor-core 3.4.17 once released.

The text was updated successfully, but these errors were encountered:

alzimmermsft · 2022-05-06T14:32:37Z

@anuchandy, @conniey, azure-core will be releasing this month using Reactor 3.4.17 if this issue needs to be verified as resolved.

anuchandy · 2022-05-31T18:31:15Z

"azure-core:1.28.0" using "reactor-core:3.4.17" is released

anuchandy added Event Hubs Service Bus pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Apr 12, 2022

anuchandy self-assigned this Apr 12, 2022

anuchandy changed the title ~~Excessive $NextInner objects (EH and SB)~~ [BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) Apr 26, 2022

conniey added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Apr 26, 2022

anuchandy closed this as completed May 31, 2022

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

anuchandy commented Apr 12, 2022 •

edited

Loading

alzimmermsft commented May 6, 2022

anuchandy commented May 31, 2022

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

Comments

anuchandy commented Apr 12, 2022 • edited Loading

Root causing

Observation_1 (Cx action item)

Observation_2 (Cx action item)

Observation_3: (Sdk action-item)

alzimmermsft commented May 6, 2022

anuchandy commented May 31, 2022

anuchandy commented Apr 12, 2022 •

edited

Loading