[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235
Labels
Event Hubs
pillar-reliability
The issue is related to reliability, one of our core engineering pillars. (includes stress testing)
Service Bus
tracking-external-issue
The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Root causing
Cx deployed An application using SB session to 10 pods. A heap dump was taken from one pod after six days of execution, and showing excessive allocation of "reactor.core.publisher.NextProcessor$NextInner" type
Out of 1,065,540, the 1,065,504 "NextInner" instances retains "MonoCacheTime+MonoCacheTime$CoordinatedSubscriber" instances, those also appeared on the top of the above histogram.
Such ~1M instances of these types are leaking symptoms.
These 1,065,504 "NextInner" objects are retained by one "reactor.core.publisher.NextProcessor" object.
that "NextProcessor" is retained by the private variable
shutdownSignalSink
(type=Sink.One) in ReactorConnection.While there are multiple places library subscribe to this
shutdownSignalSink
, the only place that exposes it via the intermediateMonoCacheTime
operator (as appears in the second figure) by applyingcache()
operator is ReactorConnection::getShutdownSignal() methodScanning the source code for reference to the method
getShutdownSignals()
, the only place it gets referenced and subscribed is in ReactorReceiver constructorBut the subscription is DISPOSED in the ReactorReceiver close api.
Had it a bug in not closing ReactorReceiver, there would be ~1M instances of ReactorReceiver, but actually, there are only 250 of them.
It proves that ReactorReceivers are not getting leaked and are disposed correctly. It leads us to check the implementation of Sink.One and identified a problem in Sink.One where it continues to retain the subscriber even after the disposition. A git-ticket is opened in reactor-core repro "Memory leak in SinkOneMulticast" reactor/reactor-core#3001, and the fix coming in reactor-core-3.4.17
Observation_1 (Cx action item)
You might have noticed that the reactor git-ticket mention about the type
SinkOneMulticast
, but in heap-dump, it's a different typeNextProcessor
.The reason for that is - Cx application seems to have one or more dependency that brings in a relatively old version of the reactor-core library (~7 months behind). The Azure ServiceBus SDK is defined to use the recent version of reactor-core-3.4.14.
Back in Sept 2021, the use of
NextProcessor
in Sink.One was replaced withSinkOneMulticast
; in this commitCx needs to analyze the dependencies and align the versions of shared libraries (e.g., reactor-core, reactor-netty, etc..) by upgrading dependencies so that the application is ready to pick when reactor-core-3.4.17 is available.
Observation_2 (Cx action item)
The 1,065,504 instances of "NextProcessor$NextInner" means, over the period of 6 days, around ~1M ReactorReceiver objects were created and closed. It means ~125 ReactorReceiver instances created-disposed per minute.
The only reason for such a massive churn of these objects is - that the consuming application is trying to acquire too many sessions from the service, but the producer application does not create enough sessions. Due to this, SB service DETACHes those unnecessary receivers after a 1 minute timeout. This indicates that maxConcurrentSession in the consumer application is too large. The Cx should tune this configuration according to the expected load.
Observation_3: (Sdk action-item)
The getShutdownSignals() API uses
cache()
operator. We don't have to use the cache operator here, Sink.One is capable of remembering the last signal and replaying it. Whilecache()
doesn't directly contribute to any leak, we could remove it and save allocations.And the azure-core should upgrade to reactor-core 3.4.17 once released.
The text was updated successfully, but these errors were encountered: