Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High performance degradation by instrumentation-type DECORATE_QUEUES #3158

Closed
jnt0r opened this issue Aug 18, 2022 · 7 comments
Closed

High performance degradation by instrumentation-type DECORATE_QUEUES #3158

jnt0r opened this issue Aug 18, 2022 · 7 comments
Labels
status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor

Comments

@jnt0r
Copy link

jnt0r commented Aug 18, 2022

We had an issue with spring-sleuth context not been propagated to every thread so we switch the instrumentation type to DECORATE_QUEUES which solved this issue in our tests. In production we faced high performance degredation, long response times and failed requests (aborted SocketChannels etc.). The kubernetes pods requested more than 2 to 3 times CPU than the previous version with the default instrumentation type.

We see the following exceptions thrown:

Exception: io.netty.channel.StacklessClosedChannelException
Stacktrace:
io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:619)
io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352)
io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606)
io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.close(CombinedChannelDuplexHandler.java:505
io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:77)
io.netty.channel.CombinedChannelDuplexHandler.close(CombinedChannelDuplexHandler.java:316)
io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:472)
io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:957)
io.netty.channel.AbstractChannel.close(AbstractChannel.java:244)
reactor.netty.http.client.HttpClientOperations.onInboundCancel(HttpClientOperations.java:275)
reactor.netty.channel.FluxReceive.unsubscribeReceiver(FluxReceive.java:476)
reactor.netty.channel.FluxReceive.lambda$
reactor.netty.channel.FluxReceive$$Lambda$.dispose
reactor.netty.channel.FluxReceive.cancelReceiver(FluxReceive.java:194)
reactor.netty.channel.FluxReceive.cancel(FluxReceive.java:97)
reactor.netty.channel.ChannelOperations.discard(ChannelOperations.java:366)
reactor.netty.channel.ChannelOperations.dispose(ChannelOperations.java:201)
reactor.core.publisher.MonoCreate$DefaultMonoSink.onCancel(MonoCreate.java:231)
reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onNext(HttpClientConnect.java:299)
reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onNext(HttpClientConnect.java:284)
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.lambda$
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber$$Lambda$.run
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.withActiveSubpath(TracingReactorSubscriber.java:99)
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.onNext(TracingReactorSubscriber.java:67)
reactor.core.publisher.MonoCreate$DefaultMonoSink.success(MonoCreate.java:165)
reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onStateChange(DefaultPooledConnectionProvider.java:182)
reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnection.onStateChange(DefaultPooledConnectionProvider.java:440)
reactor.netty.channel.ChannelOperationsHandler.channelActive(ChannelOperationsHandler.java:62)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209)
io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelActive(CombinedChannelDuplexHandler.java:412)
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:69)
io.netty.channel.CombinedChannelDuplexHandler.channelActive(CombinedChannelDuplexHandler.java:211)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelActive(DefaultChannelPipeline.java:1398)
Exception: reactor.netty.channel.AbortedExceptionMessage: Connection has been closed BEFORE send operation
Stacktrace:
reactor.netty.http.client.HttpClientOperations.send(HttpClientOperations.java:408)
org.springframework.cloud.gateway.filter.NettyRoutingFilter.lambda$
org.springframework.cloud.gateway.filter.NettyRoutingFilter$$Lambda$.apply
reactor.netty.http.client.HttpClientConnect$HttpClientHandler.requestWithBody(HttpClientConnect.java:583)
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver.lambda$
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver$$Lambda$.get
reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:44)
reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver.onStateChange(HttpClientConnect.java:424)
reactor.netty.ReactorNetty$CompositeConnectionObserver.onStateChange(ReactorNetty.java:677)
reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.run(DefaultPooledConnectionProvider.java:270)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:829)

Expected Behavior

As mentioned in the Documentation I excpected the same funtional behaviour with maybe slightly performance impact (the documentation mentions a LOW performance impact).

Actual Behavior

The CPU consumption of the application drastically increases with this instrumentation type and we see many Exceptions about closed and aborted SocketChannels.

Steps to Reproduce

Switch to instrumentation-type DECORATE_QUEUES and fire like 200 requests per second to one instance of the application.

Possible Solution

We took a short look into the implementation and thought about the usage of a double linked LinkedHashMap for this feature. Maybe a LinkedHashMap is not the right choice here?

Your Environment

The application routes requests depending on some parameters to other services (normally very short resquest/response-times). Some parameters are cached so we don't need to check them every time. When we need to check them (cache evicted) we explicitely start a new thread because the check consists of a synchronous API call:

.onCacheMissResume(() -> Mono.fromCallable(() -> kubernetesRepository.callTokenReviewEndpoint(authorizationHeader)) .subscribeOn(Schedulers.boundedElastic())) ....
In these cases the sleuth-context was missing with the default instrumentation-type. With decorate_queues the context is present.

  • Reactor version(s) used: 3.4.15
  • Other relevant libraries versions (eg. netty, ...): Spring Boot 2.6.4, Spring Cloud 2021.0.1
  • JVM version (java -version): 11.0.16
  • OS and version (eg uname -a): Docker Image ubi8/openjdk-11
@reactorbot reactorbot added the ❓need-triage This issue needs triage, hasn't been looked at by a team member yet label Aug 18, 2022
@jnt0r jnt0r changed the title High performance degredation by instrumentation-type DECORATE_QUEUES High performance degradation by instrumentation-type DECORATE_QUEUES Aug 19, 2022
@OlegDokuka
Copy link
Contributor

OlegDokuka commented Aug 19, 2022

@jnt0r can you please provide a small isolated application that reflects your environment and demonstrates the mentioned problem?

@OlegDokuka OlegDokuka added status/need-investigation This needs more in-depth investigation area/performance This belongs to the performance theme status/need-user-input This needs user input to proceed and removed ❓need-triage This issue needs triage, hasn't been looked at by a team member yet labels Aug 19, 2022
@jnt0r
Copy link
Author

jnt0r commented Aug 29, 2022

Hi @OlegDokuka, currently I don't have time to create a sample application. It's basically a Spring Cloud Gateway application with two explicit thread jumps (as described in my initial description) where we switched the instrumentation-type from default to decorate_on_queues which lead to reproducible huge performance impacts (CPU usage, response time, ...).
We think the implementation using a LinkedHashMap is causing the problem, as it requires many CPU intensive tasks.

@chemicL
Copy link
Member

chemicL commented Aug 29, 2022

We'll keep the issue in the backlog, if you find the time to prepare a reproducer, we'd be happy to have a look and profile. Without an isolated reproducer it will be difficult to address the issue.

@chemicL chemicL added this to the 3.5.x backlog milestone Aug 29, 2022
simonbasle added a commit that referenced this issue Aug 30, 2022
One code snippet was badly closed in the previous PR.
simonbasle added a commit that referenced this issue Aug 30, 2022
@jnt0r
Copy link
Author

jnt0r commented Oct 6, 2022

We figured out that it was more of a documentation missmatch between reactor-core and spring cloud gateway. The defaults for the instrumentation-type differ because spring cloud gateway overrides the default to manual. That makes it clear why we faced such a huge performance impact by switching to decorate_queues.
We thought we switched from decorate_on_each to decorate_queues but we actually switched from manual to decorate_queues.

@chemicL
Copy link
Member

chemicL commented Oct 6, 2022

Thank you for following up. That makes sense. I think we can consider closing the issue then?

@jnt0r
Copy link
Author

jnt0r commented Oct 6, 2022 via email

@OlegDokuka OlegDokuka closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2022
@OlegDokuka OlegDokuka added status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor and removed status/need-investigation This needs more in-depth investigation area/performance This belongs to the performance theme status/need-user-input This needs user input to proceed labels Oct 6, 2022
@chemicL
Copy link
Member

chemicL commented Oct 7, 2022

@jnt0r please go ahead and file a report in https://github.com/spring-cloud/spring-cloud-gateway/issues then, thanks. I see @OlegDokuka already closed this issue.

@violetagg violetagg removed this from the 3.5.x backlog milestone Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor
Projects
None yet
Development

No branches or pull requests

5 participants