Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

Closed
anuchandy opened this issue Apr 12, 2022 · 2 comments
Closed

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

anuchandy opened this issue Apr 12, 2022 · 2 comments
Assignees
Labels
Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly

Comments

@anuchandy
Copy link
Member

anuchandy commented Apr 12, 2022

Root causing

Cx deployed An application using SB session to 10 pods. A heap dump was taken from one pod after six days of execution, and showing excessive allocation of "reactor.core.publisher.NextProcessor$NextInner" type

HG

Out of 1,065,540, the 1,065,504 "NextInner" instances retains "MonoCacheTime+MonoCacheTime$CoordinatedSubscriber" instances, those also appeared on the top of the above histogram.

MCT_Inbound

Such ~1M instances of these types are leaking symptoms.

These 1,065,504 "NextInner" objects are retained by one "reactor.core.publisher.NextProcessor" object.

NP_26

NP_NPI

that "NextProcessor" is retained by the private variable shutdownSignalSink (type=Sink.One) in ReactorConnection.

SSS_Retained

While there are multiple places library subscribe to this shutdownSignalSink, the only place that exposes it via the intermediate MonoCacheTime operator (as appears in the second figure) by applying cache() operator is ReactorConnection::getShutdownSignal() method

private final Sinks.One<AmqpShutdownSignal> shutdownSignalSink = Sinks.one();

@Override
public Flux<AmqpShutdownSignal> getShutdownSignals() {
    return shutdownSignalSink.asMono().cache().flux();
}

Scanning the source code for reference to the method getShutdownSignals(), the only place it gets referenced and subscribed is in ReactorReceiver constructor

 amqpConnection.getShutdownSignals().flatMap(signal -> {
    logger.verbose("Shutdown signal received.");
    return closeAsync("Connection shutdown.", null);
}).subscribe());

But the subscription is DISPOSED in the ReactorReceiver close api.

Had it a bug in not closing ReactorReceiver, there would be ~1M instances of ReactorReceiver, but actually, there are only 250 of them.

SBRR

It proves that ReactorReceivers are not getting leaked and are disposed correctly. It leads us to check the implementation of Sink.One and identified a problem in Sink.One where it continues to retain the subscriber even after the disposition. A git-ticket is opened in reactor-core repro "Memory leak in SinkOneMulticast" reactor/reactor-core#3001, and the fix coming in reactor-core-3.4.17

Observation_1 (Cx action item)

You might have noticed that the reactor git-ticket mention about the type SinkOneMulticast, but in heap-dump, it's a different type NextProcessor.

The reason for that is - Cx application seems to have one or more dependency that brings in a relatively old version of the reactor-core library (~7 months behind). The Azure ServiceBus SDK is defined to use the recent version of reactor-core-3.4.14.

Back in Sept 2021, the use of NextProcessor in Sink.One was replaced with SinkOneMulticast; in this commit

Cx needs to analyze the dependencies and align the versions of shared libraries (e.g., reactor-core, reactor-netty, etc..) by upgrading dependencies so that the application is ready to pick when reactor-core-3.4.17 is available.

Observation_2 (Cx action item)

The 1,065,504 instances of "NextProcessor$NextInner" means, over the period of 6 days, around ~1M ReactorReceiver objects were created and closed. It means ~125 ReactorReceiver instances created-disposed per minute.

The only reason for such a massive churn of these objects is - that the consuming application is trying to acquire too many sessions from the service, but the producer application does not create enough sessions. Due to this, SB service DETACHes those unnecessary receivers after a 1 minute timeout. This indicates that maxConcurrentSession in the consumer application is too large. The Cx should tune this configuration according to the expected load.

Observation_3: (Sdk action-item)

The getShutdownSignals() API uses cache() operator. We don't have to use the cache operator here, Sink.One is capable of remembering the last signal and replaying it. While cache() doesn't directly contribute to any leak, we could remove it and save allocations.

And the azure-core should upgrade to reactor-core 3.4.17 once released.

@anuchandy anuchandy added Event Hubs Service Bus pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Apr 12, 2022
@anuchandy anuchandy self-assigned this Apr 12, 2022
@anuchandy anuchandy changed the title Excessive $NextInner objects (EH and SB) [BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) Apr 26, 2022
@conniey conniey added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Apr 26, 2022
@alzimmermsft
Copy link
Member

@anuchandy, @conniey, azure-core will be releasing this month using Reactor 3.4.17 if this issue needs to be verified as resolved.

@anuchandy
Copy link
Member Author

"azure-core:1.28.0" using "reactor-core:3.4.17" is released

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

3 participants