Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIO2 performance issue while performing streaming calls on version 0.6.x #513

Closed
cipriansofronia opened this issue Jun 15, 2023 · 15 comments

Comments

@cipriansofronia
Copy link
Contributor

We recently migrated to ZIO2, so zio-grpc got bumped to 0.6.0-rc5. We've noticed some performance issues after the upgrade while performing streaming calls. Going to attach some screenshots of the IntelliJ Profiler.

On the upper hand of the screenshot there is the timeline of the previous version of zio-grpc (0.5.1) running with ZIO1 and below we have the zio-grpc version 0.6.0-rc5 with ZIO2.
The streaming request is kept alive for a few seconds for both versions but we can notice some differences between these two. On 0.5 there is a small spike when the request is triggered but then the CPU goes down and it stays like that, on 0.6 the CPU load is bigger and it says like that for the duration of the streaming request (not shown here in the screenshots but if we trigger more requests then the CPU goes even higher).
There are no other changes in our service, only the migration to ZIO2 and bumping zio-grpc, so not doing anything extra on the new version for the same streaming call. The only thing different I noticed is that 0.6 is allocating more resources while performing serverStreamingWithBackpressure and underneath is calling ZIO2 internals. Using ZIO2 streaming so far in other parts of the app did not present performance issues.

Screenshot 2023-06-15 at 08 17 42 Screenshot 2023-06-15 at 08 16 18

We isolated even more the streaming call, we created a separate project with zio-grpc 0.6 where we only do:

ZStream(<zio_grpc_generated_class>.defaultInstance)
      .repeat(Schedule.spaced(10.second))

So, nothing CPU consuming and we got the same behaviour.

Please advise if more informations are needed.

@regiskuckaertz
Copy link
Contributor

Have you tried increasing the back pressure queue size? The default is very small and that may explain this behaviour. We use a queue of 512k iirc in our service.

@cipriansofronia
Copy link
Contributor Author

No, I did not, will give it a try! Thanks! 🙌🏻

@cipriansofronia
Copy link
Contributor Author

cipriansofronia commented Jun 16, 2023

Unfortunately, it did not help. At first I was not sure if the config was reaching that back pressure queue, but debugging the service I was able to see the new config I set, and sadly, no change in CPU usage.
I was able to reproduce the issue with the example service (helloworld) provided in the zio-grpc repo as well, I can push my changes to my fork if that helps.

@Gregory-Berkman-Imprivata

We are running into a very similar performance problem. We are not using streaming but we notice that when we receive grpc requests over time, the number of FiberId$Runtime instances continuously increases and never drops. Eventually performance degrades significantly and Kubernetes kills the node.
For unary requests we can see that a forkDaemon is being called (also called for streaming requests) here:


Is the new fiber correctly being released?

@regiskuckaertz
Copy link
Contributor

@cipriansofronia hello - could you give this one a try: #514 It maybe was a mistake to use toQueueOfElements, but I also wonder if calling isReady that much is contributing to it. The grpc-java internal buffer is currently fixed at 32kb so it may kick in very often depending on the workload. Another road to explore would be to use the stream observer instead, I'll look into that later. See grpc/grpc-java#5433

@Gregory-Berkman-Imprivata that looks like a separate issue but you are right that we should rather fork in a scope that is closed when the call terminates.

@cipriansofronia
Copy link
Contributor Author

@regiskuckaertz, I published locally your changes from #514, tested it with the helloworld example and I can tell that this version does not stress the CPU anymore, there is a small spike at first when the request is made but then it drops down to almost 0 while the request is still active and the stream is emitting elements. I tried performing multiple requests in the same time and the CPU stayed the same. It performs well even with the default queue size of 16. Thank you for looking into it! 🙏🏻
Screenshot 2023-06-17 at 10 21 35
Screenshot 2023-06-17 at 10 21 49

@regiskuckaertz
Copy link
Contributor

Yiihaaa! That is great to hear, thanks for trying it out. It's also weird to come back to something you wrote months ago and think "wow was I on the crack pipe back then? this is way too complex" 😄

@regiskuckaertz
Copy link
Contributor

@Gregory-Berkman-Imprivata I think this will help #515

@cipriansofronia
Copy link
Contributor Author

cipriansofronia commented Jun 19, 2023

@thesamet thank you for merging these PRs, unfortunately there is an issue publishing the snapshots it appears.
edit: Actually, it appears that it was published here, not sure what's that error about.

@cipriansofronia
Copy link
Contributor Author

cipriansofronia commented Jun 19, 2023

@thesamet, @regiskuckaertz, could you release another RC with these changes, please?

@thesamet
Copy link
Contributor

Sure, will cut a release this week.

@cipriansofronia
Copy link
Contributor Author

@regiskuckaertz and @thesamet, I really appreciate your help and fast replies. The issue is solved now, so I'm closing it. Cheers!

@ghostdogpr
Copy link
Contributor

We did a round of load testing using RC5 (which reproduced this issue, perf was bad) and then using the latest snapshot, and the latest shapshot gave us a great performance (better than our zio 1 code!). How about a first official release for ZIO 2 finally? 😄

@Gregory-Berkman-Imprivata
Copy link

Gregory-Berkman-Imprivata commented Jul 28, 2023

We are running into a very similar performance problem. We are not using streaming but we notice that when we receive grpc requests over time, the number of FiberId$Runtime instances continuously increases and never drops. Eventually performance degrades significantly and Kubernetes kills the node. For unary requests we can see that a forkDaemon is being called (also called for streaming requests) here:

Is the new fiber correctly being released?

@regiskuckaertz Not sure if this has been fixed actually. I can open a new issue for this but I am still seeing the number of FiberId$Runtime increasing continuously.

This is from running load tests locally on my machine. the number of instance of the FiberId only increases

/service $ jmap -histo 1 | grep FiberId
   8:         86715        2774880  zio.FiberId$Runtime

@ghostdogpr
Copy link
Contributor

I think you can close this issue since this is unrelated to streaming. Let's discuss in #537

@thesamet thesamet closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants