Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse LiveQuery Excessive CPU #5106

Closed
wfilleman opened this issue Oct 7, 2018 · 76 comments
Closed

Parse LiveQuery Excessive CPU #5106

wfilleman opened this issue Oct 7, 2018 · 76 comments

Comments

@wfilleman
Copy link

Issue Description

I've got a parse server in production in an IoT environment where it's processing around 100 requests a second. This all works great with CPU around 30% and 25ms response times. The parse server is clustered on AWS on two C5.Large instances.

I'm adding in LiveQuery as a separate C5.Large server. Both Parse server and LiveQuery are using Redis to communicate changes and this is working.

The problem I'm seeing is the LiveQuery server with just one client connected has CPU usage between 20-35%. Two clients connected and this jumps to around 40%. More than 3 clients connected and the server crashes within minutes.

I'm looking for some suggestions as to what to try to figure out why the excessive CPU usage and then server crash. To be clear, subscriptions all work from Parse Server to LiveQuery to the client.

More information:

  • Parse Server version: 3.0.0
  • Parse Client: 2.1.0
  • Number of Classes monitored by Live Query: 12
  • Using Role ACL
  • Using Session Token in client subscriptions

Here is how Live Query server is configured:

let parseApi = new ParseServer({
    databaseURI: `mongodb://${config.get('/mongo/userName')}:${config.get('/mongo/password')}@${config.get('/mongo/uri')}`, // Connection string for your MongoDB database
    appId: config.get('/parse/appId'),
    masterKey: config.get('/parse/masterKey'), // Keep this key secret!
    serverURL: `http://127.0.0.1:${config.get('/port/webapp')}/parse`,
    logLevel: "ERROR",
    sessionLength: ONE_DAY, // in seconds. Set to 24 hours.
    schemaCacheTTL: ONE_MONTH_MS, //"The TTL for caching the schema for optimizing read/write operations. You should put a long TTL when your DB is in production. default to 5000; set 0 to disable."
    cacheTTL: ONE_DAY_MS, //"Sets the TTL for the in memory cache (in ms), defaults to 5000 (5 seconds)"
    cacheMaxSize: 1000000, //"Sets the maximum size for the in memory cache, defaults to 10000"
    enableSingleSchemaCache: true //"Use a single schema cache shared across requests. Reduces number of queries made to _SCHEMA. Defaults to false, i.e. unique schema cache per request."
});

// Serve the Parse API on the /parse URL prefix
app.use('/parse', parseApi);

let port = config.get('/port/webapp');
let server = app.listen(port);

// Initialize a LiveQuery server instance, app is the express app of your Parse Server
if (config.get('/parseAppServerIsLocal')) {
    debug(`Starting Live Query Server on port ${config.get('/port/parseLiveQuery')}`);
    let httpServer = require('http').createServer(app);
    httpServer.listen(config.get('/port/parseLiveQuery'));
    let liveQueryParams = {
        redisURL: config.get('/server/redis')
    };
    let parseLiveQueryServer = ParseServer.createLiveQueryServer(httpServer,liveQueryParams);
}

Steps to reproduce

Configure Parse server/LiveQuery per above with Redis and connect one client that subscribes to all 12 classes. Queries are not complex. Example:

const nodeStatusesQuery = new ParseObj.Query('NodeStatus').limit(10000);

Observer jump in CPU usage in high-throughput (100 requests per second to parse server).

Expected Results

I'm not sure what the CPU usage should be, but 30% for one client, 40-50% for two and crashing after that doesn't seem right.

Actual Outcome

LiveQuery server with just one client connected has CPU usage between 20-35%. Two clients connected and this jumps to around 40%. More than 3 clients connected and the server crashes within minutes.

Environment Setup

  • Server

    • parse-server version (Be specific! Don't say 'latest'.) : 3.0.0
    • Operating System: Linux
    • Hardware: AWS ELB C5.Large instance
    • Localhost or remote server? (AWS, Heroku, Azure, Digital Ocean, etc): AWS
  • Database

    • MongoDB version: 3.4.14
    • Storage engine: Not sure
    • Hardware: mLab hosted on AWS
    • Localhost or remote server? (AWS, mLab, ObjectRocket, Digital Ocean, etc): mLab

Logs/Trace

@flovilmart
Copy link
Contributor

While this may seem excessive, this ‘kindof’ make sense. Would you be able to provide profiling info on the liveQuery server? Perhaps an obvious optimization can be made. I’m thinking JSON serialization bottlenecks

@wfilleman
Copy link
Author

Absolutely, what would you recommend? I’ve never had to do profiling with Node before.

@flovilmart
Copy link
Contributor

So that's almost easy :)

Enable inspect mode on your node process

  • Get the <pid> of your live query process with ps aux | grep node
  • Send the USR1 signal to this processkill -usr1 <pid> (This will not kill the node process but send the USR1 signal. node is built to enable the inspector when receiving this signal).

Forward the port

By default the debugger interface for your process should start on port 9229, if you're running the process on a remote server, then you need to ensure the port 9229 is properly open for inspection.

Chrome inspector

We use the chrome inspector to gather the profiles and process, open a new chrome tab an in the URL bar, enter: chrome://inspect

capture d ecran le 2018-10-07 a 17 47 12

Locally this is what I see when debugging the parse-live-query-server process locally, in your case, this may be different:

capture d ecran le 2018-10-07 a 17 51 46

Hit the inspect link and start recording the profile

capture d ecran le 2018-10-07 a 17 52 04

@wfilleman
Copy link
Author

Awesome, I can get the results posted tomorrow. Thanks!

@flovilmart
Copy link
Contributor

Nice, also, do not profile for too long, profiles are huge :) given what you describe we should hopefully be able to see bottlenecks without too much digging

@wfilleman
Copy link
Author

Good point. Will do.

@wfilleman
Copy link
Author

Hi @flovilmart,

See attached for the 3 node processes running on the live query server. One process is the watchdog/master that spawns the other two node servers for each of the 2 available CPUs on a C5.Large.

Each was captured for about 60 seconds of profile data while there was one client connected and the CPU was around 30%.

Let me know what else I can capture for you. I really appreciate the help!

LiveQuery-HighCPU.zip

Wes

@wfilleman
Copy link
Author

FYI: Your directions were nearly perfect. The one step I had to do on the remote server was to start the node app/Live Query with:

node --inspect=10.0.1.77 server.js

The --inspect=10.0.1.77 line tells the inspector to attach to it's local IP address.

@flovilmart
Copy link
Contributor

Thanks @wfilleman. Are you able to record a profile when your CPU saturates? At this time the profile doesn't show any particular bottleneck on the live query server beside the fact we spend a lot of time in internals of process.nextTick() See below the time usage of runMicrotasks internal:

capture d ecran le 2018-10-08 a 09 47 32

@wfilleman
Copy link
Author

Hi @flovilmart,

I connected 4 clients to my LiveQuery server and was able to capture 2 of the 3 processes for about 60 seconds each before the server crashed. Hopefully, this will give you better insight as to what's happening. I'm pretty sure the one titled Process2-10-8-2018 is the LiveQuery instance. The process3 one is the watchdog.

Thanks!
Wes
LiveQuery-HighCPU-10-8-2018.zip

@flovilmart
Copy link
Contributor

Awesome thanks! I’ll have a look

@wfilleman
Copy link
Author

@flovilmart,

Managed to grab that first process after a server reboot. The server looks like it's crashing due to running out of RAM. In watching the metrics on the server this time, the LiveQuery RAM use gets to about 80% of the available RAM (on the C5.Large it's 4GB) and then the server crashes. My guess is the garbage collector can't run due to the excessive CPU.

Thanks!
Wes
Process1-10-8-2018-V2.zip

@wfilleman
Copy link
Author

@flovilmart, One more data point: With no clients connected, the CPU is 1-2%. Stable. Whatever is happening, it's only once there are 1+ clients connected.

Wes

@flovilmart
Copy link
Contributor

@wfilleman are you able to test against a particuarl branch like: schemaController-performance? I've identified one way things can improve #5107

@flovilmart
Copy link
Contributor

@wfilleman the more I look at your trace in Process3, it seems to spend most of it's time on the internal connect calls. Are you sure the client connections are steady and stable? If you have your 3 clients connected and no activity on the server, what is your CPU usage like? It looks like all your clients keep connecting and disconnecting all the time. Do you have a reverse proxy in front?

@wfilleman
Copy link
Author

@flovilmart, yes, absolutely. Loading it up now and will report shortly.

Wes

@wfilleman
Copy link
Author

@flovilmart, let me double-check the load balancer. It may be killing the websocket every 10 seconds. When I look at the client logs, I'll see the connection drop and it reconnects occasionally when in the high CPU usage case.

Wes

@flovilmart
Copy link
Contributor

It may be killing the websocket every 10 seconds.

This may explain a few things if you look at this:

capture d ecran le 2018-10-08 a 14 23 58

your program has spent:

  • 36% of it's time garbage collecting
  • 26% on connection
  • 6% emitting errors + 3% collecting the stacks :)

That's about 75% of CPU time just handling the conections, and not doing anything.

From the trace we see connect calls every 150ms :)

@wfilleman
Copy link
Author

Wow, ok. In the client logs, I'm logging when the connection to live query is opened and closed and errored. There's a handful of these messages...definitely not one every 150ms. I'm using parse client 2.1.0.

Pulling the master branch now to test your change.

Wes

@wfilleman
Copy link
Author

I've also changed the load balancer to keep connections open for 20 minutes to see if that changes anything as well.

@flovilmart
Copy link
Contributor

Any client / server pair should properly work. It definitely reaches the node process, but never the server itself can you print the result of npm ls

@wfilleman
Copy link
Author

wfilleman commented Oct 8, 2018

@flovilmart I pulled the master branch, but I'm seeing this error:

dyld: lazy symbol binding failed: Symbol not found: __ZN4node19GetCurrentEventLoopEPN2v87IsolateE
  Referenced from: /LiveQueryServer/node_modules/parse-server/node_modules/bcrypt/lib/binding/bcrypt_lib.node
  Expected in: flat namespace

Looks like it may be related to 3.0.1 of Bcrypt: kelektiv/node.bcrypt.js#656

@wfilleman
Copy link
Author

can you print the result of npm ls

@flovilmart Sure thing, here's the result of npm ls with parse-server#main branch

Wes
npm ls (with parse-server#master).txt

@flovilmart
Copy link
Contributor

@wfilleman alright, we're making progress there, if bcrypt is problematic this seems to be related to a particular version of node.js what version are you using?

@flovilmart
Copy link
Contributor

Can you try with node 8.11.2 (what the CI run with) or 8.11.3 (latest LTS)

@wfilleman
Copy link
Author

@flovilmart I'm using the latest version available from AWS: 8.11.4, however, I've also tried this with the same results on 8.9.3 (Another production box I have running parse server).

@wfilleman
Copy link
Author

@flovilmart Maybe I didn't understand which issue we're looking at. My local dev machine is 8.9.3, are you suggesting upgarding my local node to 8.11.2(3,4)?

@flovilmart
Copy link
Contributor

let's keep the focus here, bcrypt is used for encrypting passwords in the DB' not the livequery server. So this doesn't seem related at first look. What are the errors you see in your client? if the connection fails, then this is where it should be fixed. Also, are you sure throng is working fine with websockets?

@wfilleman
Copy link
Author

@flovilmart Agreed. However, I can't run the master branch on 8.9.3 (my local setup) or AWS 8.11.4. I've moved back to release 3.0.0 and removed throng so there's only one node process. Will post the CPU profile shortly.

@flovilmart
Copy link
Contributor

However, I can't run the master branch on 8.9.3 (my local setup) or AWS 8.11.4

bcrypt is an optional dependency as it can explicitly fail to build. Why can't you run locally?

removed throng so there's only one node process

Good idea, I'm also curious if you're using nginx in front, this may cause the errors. Did you make sure it's properly configured: https://www.nginx.com/blog/websocket-nginx/

@flovilmart
Copy link
Contributor

Just to be clear, the Parse Server that's standing up to support Live Query is connected to the production MongoDB, but it definetly is not exposed to the outside world and cannot accept parse server connections.

That should be fine then, but I'd rather have the full parse-server separated which is the way for scaling it up. I would not expect the need to make calls to parse-server on each message.

See:

let httpServer = require('http').createServer(express());
httpServer.listen(config.get('/port/parseLiveQuery'));
let liveQueryParams = {
  appId: 'appId',
  javascriptKey: '',
  masterKey: '', 
  serverURL: '...', // public URL for the parse-server to use
  redisURL: config.get('/server/redis')
};
let parseLiveQueryServer = ParseServer.createLiveQueryServer(httpServer, liveQueryParams);

@flovilmart
Copy link
Contributor

One thing I overlooked, you mentioned ACL’s used roles, can you do the same with the server running with VERBOSE=1. It is fine to have a ‘sidekick’ parse-server as you have to prevent network.

I would guess that the roles queries are making your server slow.

Putting them externally may speed things up, but not guaranteed. Can you confirm?

@wfilleman
Copy link
Author

I'm deploying with your "lite" live query as you describe...Will report in a few min.

Yes. I'm using ACLs and the user's session token to get notified about changes to object in classes that the user "owns" via their ACL.

Wes

@wfilleman
Copy link
Author

Ok, the initial results are looking positive. I have another commitment right now, so I will let this run and will report back later tonight or first thing tomorrow.

Thank you so much for your time and help today. I think we're getting very close.

Wes

@flovilmart
Copy link
Contributor

@wfilleman can you check the branch on #4387 's PR (liveQuery-CLP) as it contains a shared cache for roles and it may impact your issue in a good way.

@wfilleman
Copy link
Author

@flovilmart WOW! Just loaded up #4387 and attached my 4 clients. CPU is between 1-3%. Incredible.

Confirmed: My issue was with the frequent accessing of the roles saturating the CPU. Your caching approach nailed my issue perfectly! THANK YOU!

Is there anything you want me to test? I'll leave these clients running here for a few hours and see how it does and report back.

Wes

@flovilmart
Copy link
Contributor

Keep those up, and add more perhaps, this Pr is a long time standing and we need to merge it somehow

@flovilmart
Copy link
Contributor

Can you test that all the ACL’s and roles are working properly? In the sense of only authorized objects are properly forwarded?

@wfilleman
Copy link
Author

wfilleman commented Oct 9, 2018

@flovilmart I can confirm that the ACL's and roles are working perfectly so far. Just to give you a sense of scale, my production Parse Server is tracking nearly 1M objects and their status in real time. These are split between approx 2100 roles.

My test clients are logged in with a user with one role, and I'm correctly seeing the objects propagate to my clients with the role of the logged in user. So, the remaining 1.9M objects are correctly NOT flowing outside of their assigned roles.

Update:
With the initial good results this morning I switched back to using the Load Balancer and reduced the server way down to a t3.micro. I also went back to standing up my helper Parse Server on the LiveQuery server to reduce network traffic to the production Parse Server cluster. Four clients connected, and CPU is around 5-10%, and the CPU is tracking with the overall rate of object updates from my production servers. Daytime or after 8:00-9:00 am local the rate of updates are elevated till evening when they drop off.

capture

I'm going to keep watching this today to make sure the CPU tracks with the object update rate and falls back down tonight with the frequency dropoff of object updates as expected tonight.

Tomorrow, once this test is complete, I'll work on standing up more clients to put a stress on this LiveQuery server deployment to see where it starts to max out.

So far, so good.

Wes

@flovilmart
Copy link
Contributor

So on a t3.micro you’re using 10% CPU, it ain’t bad given the quantity of traffic it gets :)

@wfilleman
Copy link
Author

I 100% agree. Here's the volume of network in/out on the LiveQuery Server as reference:

image

@wfilleman
Copy link
Author

wfilleman commented Oct 9, 2018

I'm not sure I can explain the spike in network out from LiveQuery to the outside. Watching this today to see if I can correlate it with anything else. It's definitely related to the CPU doubling at the same time.

@wfilleman
Copy link
Author

@flovilmart Been running some long timed tests and running into an issue.

That last chart I posted with the jump in network out traffic is repeatable and with the jump in network out also spikes the CPU utilization. Here's what I'm testing and results:

  • I changed the server size to a t3.small to stay under the 20% baseline CPU level as dictated by AWS.
  • The t3.small is only hosting the LiveQuery server and it's helper Parse Server. It doesn't do any other work.
  • I had 2 clients connected for 24 hours.
  • CPU starts off at 2-3% but about 8 hours later jumps to 10-18%

Here's the moment that this happens:
image
image

In the logs of my clients, at the exact time of the CPU/network out spike, new updates of their monitored role objects stop flowing to the clients, but there are no errors. As in, the clients all think they are still connected to LiveQuery, but updates to their role objects stop flowing from LiveQuery to the client.

Thoughts/Questions:

  • In LiveQuery are there any timeouts around caching that may explain the 8 hour window?
  • In configuring the LiveQuery server are there any options to adjust cache values, timeout values, etc? For reference, this is how I'm configuring LiveQuery server based on your feedback:
let httpServer = require('http').createServer(express());
    httpServer.listen(config.get('/port/parseLiveQuery'));
    let liveQueryParams = {
        appId: config.get('/parse/appId'),
        javascriptKey: '',
        masterKey: config.get('/parse/masterKey'),
        serverURL: `http://127.0.0.1:${config.get('/port/webapp')}/parse`,
        redisURL: config.get('/server/redis')
    };
    let parseLiveQueryServer = ParseServer.createLiveQueryServer(httpServer, liveQueryParams);

I've reset the server and clients just now to reproduce this condition again. Yesterday it took about 8 hours with 2 clients connected to see the sudden spike in network out and CPU metric. If you have suggestions as to what to test or look for, I'm all ears. I'll report back this afternoon if the spike returns again.

Wes

@wfilleman
Copy link
Author

Hi @flovilmart,

Had that spike in network/cpu traffic this morning. This time LiveQuery made it 24 hours. Here's the network out graph from the last 2.5 days. You can see where I reboot LiveQuery and reconnect clients. The network out drops back to the 500-750KB range for a period of time:
image

Same effects on the clients. Clients were reporting receiving object updates until the spike in network/cpu and then no object updates after that. No errors in the client.

What can I test or get you to help figure out what's going on?

Wes

@flovilmart
Copy link
Contributor

Again profiling is your friend here, attaching the debugger to the process during the spikes and not during the spikes. We can probably compare which method is getting called more

@wfilleman
Copy link
Author

Ok, I can work on that. Another data point: Instead of rebooting the LiveQuery server, I just disconnected my two test clients and reconnected them. The Network Out dropped back to normal baseline levels (500KB).

So, now I'm thinking that maybe the client is getting into a weird state and putting a load on the LiveQuery server. What's the best way to log what's happening in the Parse/React Native LiveQuery client? I'm using Parse 2.1.0.

Wes

@flovilmart
Copy link
Contributor

The out network would mean that the server is sending the same events multiple times, this could occur perhaps if the client connects multiple times, and the server keeps all of those connections.

@wfilleman
Copy link
Author

Hi @flovilmart,

Took me a couple of days to reset and capture performance profiles, but here they are. Ok, here's the test setup:

  • AWS t3.small server.
  • LiveQuery Server ONLY. It is set up to connect to my production Parse Server so that we remove any Parse Server CPU cycles from the profile logs.
  • Only one CPU/process.
  • Two React Native clients.

With this test setup, I profiled two minutes of nominal activity. CPU was 1-2% and Network Out was around 500KB. This ran stable for 24 hours to the minute before the CPU jumped 4X and Network out went to 4MB (about 8X from baseline).

In the React Native clients, at the 24-hour mark, I see where the LiveQuery connection was closed and the client auto-retried to connect but was met with an error stating the "Session token is expired.". No further object updates flowed to the React Native clients (as expected if the session token expires). The React Native client isn't yet set up to detect this condition to re-login again.

At this exact moment, the LiveQuery server spiked CPU and NetworkOut and another 2-minute profile was captured. I'll let you take a look, but I'm seeing unhandled promise rejections in the profile capture and I'm wondering if that's due to the expired session token which then causes something in LiveQuery to cause the side-effects I'm seeing with 4X CPU and 8X Network out.

Thanks for sticking with me and investigating alongside. Let me know if you'd like me to capture anything else or run another test.

Wes

LiveQueryProfile - 2 Clients - Spiked CPU NetworkOut.zip

@wfilleman
Copy link
Author

Digging into the unhandled promise rejection, It looks like line 513 in ParseLiveQueryServer.js may be the source of my issue:

const roleNames = await auth.getUserRoles(); // Finally, see if any of the user's roles allow them read access

I suspect .getUserRoles() is throwing an exception due to the session token expiration. Buried in that call is a REST request to Parse Server which, I could see getting repeatedly hit if the cache no longer is preventing needless network traffic.

Wes

@flovilmart
Copy link
Contributor

Why would the session token expire so quickly?

@wfilleman
Copy link
Author

No idea. How can I change the expiration time?

@flovilmart
Copy link
Contributor

Well session tokens have expiration of a year. So that should not be problematic in the first place.

When you say ‘you suspect’ did you get any proof of it?

I don’t have the time to investigate further unfortunately.

@dplewis
Copy link
Member

dplewis commented Oct 15, 2018

@wfilleman I see your config has this set

 sessionLength: ONE_DAY, // in seconds. Set to 24 hours.

@wfilleman
Copy link
Author

Ah, there we go. I can remove that from production to use the default session lenght of one year. Thanks, @dplewis.

@flovilmart, The profile capture pointed to the exact top level promise that was throwing the uncaught promise exception. I looked at the source and there are two places inside this promise where await is used without a try/catch around it that is the likely source. Either line 513 or line 534 in ParseLiveQueryServer.js is throwing an uncaught exception when the session token expires which leads to the CPU/Network Out spike.

513 and 534 are marked by me below with **.

return Promise.resolve().then(async () => {
      // Resolve false right away if the acl doesn't have any roles
      const acl_has_roles = Object.keys(acl.permissionsById).some(key => key.startsWith('role:'));

      if (!acl_has_roles) {
        return false;
      }

      **const roleNames = await auth.getUserRoles(); // Finally, see if any of the user's roles allow them read access**

      for (const role of roleNames) {
        // We use getReadAccess as `role` is in the form `role:roleName`
        if (acl.getReadAccess(role)) {
          return true;
        }
      }

      return false;
    }).then(async isRoleMatched => {
      if (isRoleMatched) {
        return Promise.resolve(true);
      } // Check client sessionToken matches ACL


      const clientSessionToken = client.sessionToken;

      if (clientSessionToken) {
        const {
          userId
        } = **await this.getAuthForSessionToken(clientSessionToken);**
        return acl.getReadAccess(userId);
      } else {
        return isRoleMatched;
      }
    }).catch(() => {
      return false;
    });

Using the default of one year should be a fine workaround, but wanted to document here what we've uncovered incase there's a security need from other project use cases for a shorter session token length where this might become an issue.

Wes

@flovilmart
Copy link
Contributor

If you believe you have found the source of the issue, you should open a PR :)

@wfilleman
Copy link
Author

That is a good idea. I'm running a different set of tests currently to load up LiveQuery with a simulated 200 concurrent users with the kind of traffic my production server is producing to check that box. Assuming this looks good after 24 hours then I'll dig into the session token issue and see what that fix path looks like.

Wes

@wfilleman
Copy link
Author

wfilleman commented Oct 17, 2018

Hi @flovilmart,

Everything is looking good. After running my 200 concurrent sessions for about 30 hours, CPU and Network Out metrics are stable at:

  • CPU: 15-20%
  • Network Out: 500KB-1MB

I think I am all set to move on with my project using LiveQuery. I saw that the #4387 pull hadn't been merged yet to main, if you were waiting on my test results, I think it's good to go from my perspective. In all my testing I've only ever seen object updates for roles my test users were assigned. Also, this fixes the original CPU issue.

I want to thank you for your patience and help to work through this with me. I'll take a look at the session token expiration issue I ran into and open a separate pull request. Otherwise, I'm considering this issue closed with the merging of #4387 into main.

Wes

@flovilmart
Copy link
Contributor

Thanks for providing the live production data that was invaluable and awesome! 😎

There is still this bug that screw up everything when session expire. That would need addressing.

@wfilleman
Copy link
Author

Thanks, @flovilmart! Parse has become my go-to for a stable and dev friendly backend on my (and client) projects. You and the maintenance team are doing great work.

Agreed about the session expire issue. It's on my list to investigate deeper but shouldn't hold up closing this original issue, now with the merge to main complete.

Wes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants