-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse LiveQuery Excessive CPU #5106
Comments
While this may seem excessive, this ‘kindof’ make sense. Would you be able to provide profiling info on the liveQuery server? Perhaps an obvious optimization can be made. I’m thinking JSON serialization bottlenecks |
Absolutely, what would you recommend? I’ve never had to do profiling with Node before. |
So that's almost easy :) Enable inspect mode on your node process
Forward the portBy default the debugger interface for your process should start on port 9229, if you're running the process on a remote server, then you need to ensure the port 9229 is properly open for inspection. Chrome inspectorWe use the chrome inspector to gather the profiles and process, open a new chrome tab an in the URL bar, enter: Locally this is what I see when debugging the parse-live-query-server process locally, in your case, this may be different: Hit the |
Awesome, I can get the results posted tomorrow. Thanks! |
Nice, also, do not profile for too long, profiles are huge :) given what you describe we should hopefully be able to see bottlenecks without too much digging |
Good point. Will do. |
Hi @flovilmart, See attached for the 3 node processes running on the live query server. One process is the watchdog/master that spawns the other two node servers for each of the 2 available CPUs on a C5.Large. Each was captured for about 60 seconds of profile data while there was one client connected and the CPU was around 30%. Let me know what else I can capture for you. I really appreciate the help! Wes |
FYI: Your directions were nearly perfect. The one step I had to do on the remote server was to start the node app/Live Query with:
The --inspect=10.0.1.77 line tells the inspector to attach to it's local IP address. |
Thanks @wfilleman. Are you able to record a profile when your CPU saturates? At this time the profile doesn't show any particular bottleneck on the live query server beside the fact we spend a lot of time in internals of |
Hi @flovilmart, I connected 4 clients to my LiveQuery server and was able to capture 2 of the 3 processes for about 60 seconds each before the server crashed. Hopefully, this will give you better insight as to what's happening. I'm pretty sure the one titled Thanks! |
Awesome thanks! I’ll have a look |
Managed to grab that first process after a server reboot. The server looks like it's crashing due to running out of RAM. In watching the metrics on the server this time, the LiveQuery RAM use gets to about 80% of the available RAM (on the C5.Large it's 4GB) and then the server crashes. My guess is the garbage collector can't run due to the excessive CPU. Thanks! |
@flovilmart, One more data point: With no clients connected, the CPU is 1-2%. Stable. Whatever is happening, it's only once there are 1+ clients connected. Wes |
@wfilleman are you able to test against a particuarl branch like: schemaController-performance? I've identified one way things can improve #5107 |
@wfilleman the more I look at your trace in Process3, it seems to spend most of it's time on the internal connect calls. Are you sure the client connections are steady and stable? If you have your 3 clients connected and no activity on the server, what is your CPU usage like? It looks like all your clients keep connecting and disconnecting all the time. Do you have a reverse proxy in front? |
@flovilmart, yes, absolutely. Loading it up now and will report shortly. Wes |
@flovilmart, let me double-check the load balancer. It may be killing the websocket every 10 seconds. When I look at the client logs, I'll see the connection drop and it reconnects occasionally when in the high CPU usage case. Wes |
This may explain a few things if you look at this: your program has spent:
That's about 75% of CPU time just handling the conections, and not doing anything. From the trace we see connect calls every 150ms :) |
Wow, ok. In the client logs, I'm logging when the connection to live query is opened and closed and errored. There's a handful of these messages...definitely not one every 150ms. I'm using parse client 2.1.0. Pulling the master branch now to test your change. Wes |
I've also changed the load balancer to keep connections open for 20 minutes to see if that changes anything as well. |
Any client / server pair should properly work. It definitely reaches the node process, but never the server itself can you print the result of |
@flovilmart I pulled the master branch, but I'm seeing this error:
Looks like it may be related to 3.0.1 of Bcrypt: kelektiv/node.bcrypt.js#656 |
@flovilmart Sure thing, here's the result of npm ls with parse-server#main branch |
@wfilleman alright, we're making progress there, if bcrypt is problematic this seems to be related to a particular version of node.js what version are you using? |
Can you try with node 8.11.2 (what the CI run with) or 8.11.3 (latest LTS) |
@flovilmart I'm using the latest version available from AWS: 8.11.4, however, I've also tried this with the same results on 8.9.3 (Another production box I have running parse server). |
@flovilmart Maybe I didn't understand which issue we're looking at. My local dev machine is 8.9.3, are you suggesting upgarding my local node to 8.11.2(3,4)? |
let's keep the focus here, bcrypt is used for encrypting passwords in the DB' not the livequery server. So this doesn't seem related at first look. What are the errors you see in your client? if the connection fails, then this is where it should be fixed. Also, are you sure throng is working fine with websockets? |
@flovilmart Agreed. However, I can't run the master branch on 8.9.3 (my local setup) or AWS 8.11.4. I've moved back to release 3.0.0 and removed throng so there's only one node process. Will post the CPU profile shortly. |
bcrypt is an optional dependency as it can explicitly fail to build. Why can't you run locally?
Good idea, I'm also curious if you're using nginx in front, this may cause the errors. Did you make sure it's properly configured: https://www.nginx.com/blog/websocket-nginx/ |
That should be fine then, but I'd rather have the full parse-server separated which is the way for scaling it up. I would not expect the need to make calls to parse-server on each message. See:
|
One thing I overlooked, you mentioned ACL’s used roles, can you do the same with the server running with VERBOSE=1. It is fine to have a ‘sidekick’ parse-server as you have to prevent network. I would guess that the roles queries are making your server slow. Putting them externally may speed things up, but not guaranteed. Can you confirm? |
I'm deploying with your "lite" live query as you describe...Will report in a few min. Yes. I'm using ACLs and the user's session token to get notified about changes to object in classes that the user "owns" via their ACL. Wes |
Ok, the initial results are looking positive. I have another commitment right now, so I will let this run and will report back later tonight or first thing tomorrow. Thank you so much for your time and help today. I think we're getting very close. Wes |
@wfilleman can you check the branch on #4387 's PR (liveQuery-CLP) as it contains a shared cache for roles and it may impact your issue in a good way. |
@flovilmart WOW! Just loaded up #4387 and attached my 4 clients. CPU is between 1-3%. Incredible. Confirmed: My issue was with the frequent accessing of the roles saturating the CPU. Your caching approach nailed my issue perfectly! THANK YOU! Is there anything you want me to test? I'll leave these clients running here for a few hours and see how it does and report back. Wes |
Keep those up, and add more perhaps, this Pr is a long time standing and we need to merge it somehow |
Can you test that all the ACL’s and roles are working properly? In the sense of only authorized objects are properly forwarded? |
@flovilmart I can confirm that the ACL's and roles are working perfectly so far. Just to give you a sense of scale, my production Parse Server is tracking nearly 1M objects and their status in real time. These are split between approx 2100 roles. My test clients are logged in with a user with one role, and I'm correctly seeing the objects propagate to my clients with the role of the logged in user. So, the remaining 1.9M objects are correctly NOT flowing outside of their assigned roles. Update: I'm going to keep watching this today to make sure the CPU tracks with the object update rate and falls back down tonight with the frequency dropoff of object updates as expected tonight. Tomorrow, once this test is complete, I'll work on standing up more clients to put a stress on this LiveQuery server deployment to see where it starts to max out. So far, so good. Wes |
So on a t3.micro you’re using 10% CPU, it ain’t bad given the quantity of traffic it gets :) |
I'm not sure I can explain the spike in network out from LiveQuery to the outside. Watching this today to see if I can correlate it with anything else. It's definitely related to the CPU doubling at the same time. |
@flovilmart Been running some long timed tests and running into an issue. That last chart I posted with the jump in network out traffic is repeatable and with the jump in network out also spikes the CPU utilization. Here's what I'm testing and results:
Here's the moment that this happens: In the logs of my clients, at the exact time of the CPU/network out spike, new updates of their monitored role objects stop flowing to the clients, but there are no errors. As in, the clients all think they are still connected to LiveQuery, but updates to their role objects stop flowing from LiveQuery to the client. Thoughts/Questions:
I've reset the server and clients just now to reproduce this condition again. Yesterday it took about 8 hours with 2 clients connected to see the sudden spike in network out and CPU metric. If you have suggestions as to what to test or look for, I'm all ears. I'll report back this afternoon if the spike returns again. Wes |
Hi @flovilmart, Had that spike in network/cpu traffic this morning. This time LiveQuery made it 24 hours. Here's the network out graph from the last 2.5 days. You can see where I reboot LiveQuery and reconnect clients. The network out drops back to the 500-750KB range for a period of time: Same effects on the clients. Clients were reporting receiving object updates until the spike in network/cpu and then no object updates after that. No errors in the client. What can I test or get you to help figure out what's going on? Wes |
Again profiling is your friend here, attaching the debugger to the process during the spikes and not during the spikes. We can probably compare which method is getting called more |
Ok, I can work on that. Another data point: Instead of rebooting the LiveQuery server, I just disconnected my two test clients and reconnected them. The Network Out dropped back to normal baseline levels (500KB). So, now I'm thinking that maybe the client is getting into a weird state and putting a load on the LiveQuery server. What's the best way to log what's happening in the Parse/React Native LiveQuery client? I'm using Parse 2.1.0. Wes |
The out network would mean that the server is sending the same events multiple times, this could occur perhaps if the client connects multiple times, and the server keeps all of those connections. |
Hi @flovilmart, Took me a couple of days to reset and capture performance profiles, but here they are. Ok, here's the test setup:
With this test setup, I profiled two minutes of nominal activity. CPU was 1-2% and Network Out was around 500KB. This ran stable for 24 hours to the minute before the CPU jumped 4X and Network out went to 4MB (about 8X from baseline). In the React Native clients, at the 24-hour mark, I see where the LiveQuery connection was closed and the client auto-retried to connect but was met with an error stating the "Session token is expired.". No further object updates flowed to the React Native clients (as expected if the session token expires). The React Native client isn't yet set up to detect this condition to re-login again. At this exact moment, the LiveQuery server spiked CPU and NetworkOut and another 2-minute profile was captured. I'll let you take a look, but I'm seeing unhandled promise rejections in the profile capture and I'm wondering if that's due to the expired session token which then causes something in LiveQuery to cause the side-effects I'm seeing with 4X CPU and 8X Network out. Thanks for sticking with me and investigating alongside. Let me know if you'd like me to capture anything else or run another test. Wes |
Digging into the unhandled promise rejection, It looks like line 513 in ParseLiveQueryServer.js may be the source of my issue:
I suspect Wes |
Why would the session token expire so quickly? |
No idea. How can I change the expiration time? |
Well session tokens have expiration of a year. So that should not be problematic in the first place. When you say ‘you suspect’ did you get any proof of it? I don’t have the time to investigate further unfortunately. |
@wfilleman I see your config has this set
|
Ah, there we go. I can remove that from production to use the default session lenght of one year. Thanks, @dplewis. @flovilmart, The profile capture pointed to the exact top level promise that was throwing the uncaught promise exception. I looked at the source and there are two places inside this promise where 513 and 534 are marked by me below with **.
Using the default of one year should be a fine workaround, but wanted to document here what we've uncovered incase there's a security need from other project use cases for a shorter session token length where this might become an issue. Wes |
If you believe you have found the source of the issue, you should open a PR :) |
That is a good idea. I'm running a different set of tests currently to load up LiveQuery with a simulated 200 concurrent users with the kind of traffic my production server is producing to check that box. Assuming this looks good after 24 hours then I'll dig into the session token issue and see what that fix path looks like. Wes |
Hi @flovilmart, Everything is looking good. After running my 200 concurrent sessions for about 30 hours, CPU and Network Out metrics are stable at:
I think I am all set to move on with my project using LiveQuery. I saw that the #4387 pull hadn't been merged yet to main, if you were waiting on my test results, I think it's good to go from my perspective. In all my testing I've only ever seen object updates for roles my test users were assigned. Also, this fixes the original CPU issue. I want to thank you for your patience and help to work through this with me. I'll take a look at the session token expiration issue I ran into and open a separate pull request. Otherwise, I'm considering this issue closed with the merging of #4387 into main. Wes |
Thanks for providing the live production data that was invaluable and awesome! 😎 There is still this bug that screw up everything when session expire. That would need addressing. |
Thanks, @flovilmart! Parse has become my go-to for a stable and dev friendly backend on my (and client) projects. You and the maintenance team are doing great work. Agreed about the session expire issue. It's on my list to investigate deeper but shouldn't hold up closing this original issue, now with the merge to main complete. Wes |
Issue Description
I've got a parse server in production in an IoT environment where it's processing around 100 requests a second. This all works great with CPU around 30% and 25ms response times. The parse server is clustered on AWS on two C5.Large instances.
I'm adding in LiveQuery as a separate C5.Large server. Both Parse server and LiveQuery are using Redis to communicate changes and this is working.
The problem I'm seeing is the LiveQuery server with just one client connected has CPU usage between 20-35%. Two clients connected and this jumps to around 40%. More than 3 clients connected and the server crashes within minutes.
I'm looking for some suggestions as to what to try to figure out why the excessive CPU usage and then server crash. To be clear, subscriptions all work from Parse Server to LiveQuery to the client.
More information:
Here is how Live Query server is configured:
Steps to reproduce
Configure Parse server/LiveQuery per above with Redis and connect one client that subscribes to all 12 classes. Queries are not complex. Example:
const nodeStatusesQuery = new ParseObj.Query('NodeStatus').limit(10000);
Observer jump in CPU usage in high-throughput (100 requests per second to parse server).
Expected Results
I'm not sure what the CPU usage should be, but 30% for one client, 40-50% for two and crashing after that doesn't seem right.
Actual Outcome
LiveQuery server with just one client connected has CPU usage between 20-35%. Two clients connected and this jumps to around 40%. More than 3 clients connected and the server crashes within minutes.
Environment Setup
Server
Database
Logs/Trace
The text was updated successfully, but these errors were encountered: