Random periodic crashes #461

matteosuppo · 2017-05-18T16:48:08Z

It's a mystery. We have two instances that serve a lot of traffic without problems, but sometimes the database connections spike, the instances stop serving traffic and I have no idea what causes it.

What could I do to investigate? Wading through the logs seems difficult without an idea of what to look for.

We tried with v0.7.x and 0.8.x and with and without docker.

Could be something related to the database? It's a postgresql

aeneasr · 2017-05-18T22:03:50Z

Is this the same issue you reported previously? I thought it had to do with a bug in the REST writer which caused an infinite loop if certain error conditions came true. However, that would not explain database spikes and also this error has been resolved in 0.8.x

One likely cause for database spikes is leaking connections when an error occurs. This might happen in a transaction that has multiple statements that don't close the database connectivity if an error occurs during one of the transactions. However, that would result in a steady increase of connections over time, not an abrupt spike.

To identify the issue we need to know which API is causing it. By excluding APIs we will narrow down the search, so are there API endpoints you can exclude? Also it could be important to know which flows you've enabled.

I think the most likely source is either the warden, then token introspection, then token issuance. Simply because the warden is the most complex and probably has a lot of API calls.

The first thing to look for in the logs is the 5xx messages and finding out request that happened before that. Also important is the amount of time it took to leak the database connections. Another part could be how you tweaked your database connection (e.g. max conns).

aeneasr · 2017-05-18T22:12:41Z

If the error happens both in 0.7.x and 0.8.x ladon is actually not as likely to be the source of this as I thought. The reason for that is that the ladon sql manager was reworked completely. So it's less likely to be the source for this.

How often does this happen by the way? Is there a fixed pattern (ever 5 hours for example)?

aeneasr · 2017-05-18T22:17:58Z

Actually, @matteosuppo could you share the CPU, Memory and SQL Connection graphs here, including timestamps?

matteosuppo · 2017-05-19T07:38:06Z

As soon as I get another crash

aeneasr · 2017-05-19T13:17:49Z

So I started to run some tests on endpoints, I'll summarize what I did here so we get a good idea of what's happening. I checked the number of connections using

docker exec -t -i 991 /bin/bash
psql --user postgres
SELECT * FROM pg_stat_activity;

ps: everything ran locally here, so zero chance of network errors, which could be the underlying issue here.

Testing oauth2 token flow

while true; do hydra token client; done

Here, I was not able to find connection leaks. They were properly re-used, closed and lower than the amount of concurrent clients, which is a good sign. The connections maxed out with about 5 postgres connections at 15 concurrent clients and returned to 2 idle connections once load was taken away.

Token validation

I ran the same test with token validation:

while true; do hydra token validate $(hydra token client); done

Same observations as above.

Client creation

while true; do hydra clients create; done

Same observations as above.

Group

Create

while true; do hydra groups create $(date +%s%N); done

Find

while true; do hydra groups find foo; done

Same observations as above.

JWK creation

while true; do hydra keys create -a HS256 $(date +%s%N); done

Same observations as above.

Policy

Create

while true; do hydra policies create -a foo --allow -r foo -s foo --id $(date +%s%N); done

Get

hydra policies create -a foo --allow -r foo -s foo --id 1495199487515477900
while true; do hydra policies get 1495199487515477900; done
while true; do hydra policies get i-do-not-exist; done

Same observations as above.

I haven't checked the warden yet but since policy, token validation and group finding have been tested already, it's unlikely that the warden has a leak.

In general I'm now more confused as to what's causing this. There isn't an obvious sql connection leak, I also checked all the sql implementations for missing rows.Close() but was unable to find any. Maybe it's caused by some infinite loop that hit's off when some error occurs. But we really need to nail down the API that's causing this in order to find it.

aeneasr · 2017-05-20T09:47:55Z

@matteosuppo could you share the database DSN without confidential things, so for example: postgres://*:*@*:*/hydra?foo=bar

aeneasr · 2017-05-20T09:48:22Z

Oh and by the way, do both instances stop serving traffic or only one?

matteosuppo · 2017-05-20T09:53:55Z

Both instances. I recreated the database (bigger) on rds and it seems it's stabler now. My guess is that the issue was in the database instance, that hanged on some queries and caused slowdowns and reconnections. I don't have the URL here since I'm from the phone. It's only accessible from inside the VPN though, so I'm not sure what you could do with it :p Il sab 20 mag 2017, 11:48 Aeneas <[email protected]> ha scritto:

…

Oh and by the way, do both instances stop serving traffic or only one? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#461 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdjec7Xdmhh7S_x7z7lkpfXrv_W0Y-9ks5r7rbogaJpZM4Nfg0M> .

aeneasr · 2017-05-20T09:59:57Z

What I'm interested in is the settings max_conns, max_idle_conns, and max_conn_lifetime - or if you haven't set them at all. If you haven't set them, could you try setting max_conn_lifetime to, for example 1 minute?

postgres://*:*@*:*/hydra?max_conn_lifetime=1m

My guess is that either some connections have issues and then clog up the database, or that connections are capped at a low value (e.g. 4 - which happens on 2 logical cpus) which simply isn't enough. So you could also try:

postgres://*:*@*:*/hydra?max_conn_lifetime=1m&max_conns=20&max_idle_conns=4

which greatly increases the maximum amount of allowed connections.

Of course, upgrading to a larger db might also help. Since the symptoms you describe don't fit in a sql connection leak, and only in either a database load issue or an infinite loop issue, upgrading could resolve those issues with at least 50% chance.

matteosuppo · 2017-05-20T10:06:06Z

I didn't set up those values, so it may indeed be that. I'll try that on Monday if I'll encounter issues. Thank you very much. Il sab 20 mag 2017, 11:59 Aeneas <[email protected]> ha scritto:

…

What I'm interested in is the settings max_conns, max_idle_conns, and max_conn_lifetime - or if you haven't set them at all. If you haven't set them, could you try setting max_conn_lifetime to, for example 1 minute? postgres://*:*@*:*/hydra?max_conn_lifetime=1m My guess is that either some connections have issues and then clog up the database, or that connections are capped at a low value (e.g. 4 - which happens on 2 logical cpus) which simply isn't enough. So you could also try: postgres://*:*@*:*/hydra?max_conn_lifetime=1m&max_conns=20&max_idle_conns=4 which greatly increases the maximum amount of allowed connections. Of course, upgrading to a larger db might also help. Since the symptoms you describe don't fit in a sql memory leak, and only in either a database load issue or an infinite loop issue, upgrading could resolve those issues. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#461 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdjeaw2cGxJP50d60GcKCeI2-KYRCRxks5r7rmfgaJpZM4Nfg0M> .

matteosuppo · 2017-06-01T08:43:43Z

Those flags don't seems to have changed anything.

I managed to capture some stats of the machine before it died:

http://imgur.com/wqDR4tT
http://imgur.com/8nb3qdr

You can see that there seem to be some memory leak going on, since the memory allocated keeps growing. It reaches 2.5GBs

At that point my theory is that it reaches the maximum memory allocated per process set by docker so it tries to go into swap and everything goes south.

We can also see that the disk space used is growing, probably because of the logs. I wonder if those two are connected.

Anyway I tried monitoring top usage while running a bunch of load in my local machine with

hey -n 2000 -c 500 -H "Authorization: bearer z5VepohqLE3hIEfHfCkfpzJCrufJbhXVZcO43SNWG18.FaYVcXhgd5lUSemxQG-Vitg8IxuSSD3JIKrzdJeACXo" -d '{"scopes":["profile:core"],"token":"BKxpqvBicy6YP64JNzRdnu_DAlzq0lQ9q0_KFHNxiu4.ez2I29zoArRYuvHDmlTmCleRkp-xHJ94IfiYuPSRLdI","resource":"user:me","action":"show","context":null}' -m POST http://localhost:4444/warden/token/allowed

And I saw the memory usage grow steadily.

I also tried to make sense of go tool pprof but I'm not very good at it

aeneasr · 2017-06-01T08:53:26Z

Thank you, this helps a lot!

aeneasr · 2017-06-01T09:29:43Z

Hm, I was not able to reproduce the memory leak with hey. Are you sure that what you observed wasn't caused by Go ramping up memory to be able to deal with 500 concurrent connections? You can test this assumption by waiting ~5 minutes after the test - if the memory usage drops back to a few megabytes it's not a memory leak

aeneasr · 2017-06-01T09:42:56Z

Oh, did you run this against the in memory database or a SQL backend? I think I'm seeing the same results as you with postgres backend enabled

aeneasr · 2017-06-01T10:02:08Z

So I ran about 10 million requests against hydra set up in a docker container and postgres connectivity and while at some point the memory kept growing slowly, it was freed up immediately after the load stopped (checked with docker stats). I was unable to reproduce the growing disk size (docker ps -s).

I checked one of my production deployments on kubernetes (significantly less load though) which is running hydra 0.7.7-http and it looks like this:

where the green line is hydra and the disk spikes are caused by the hydra logs (the drops are probably caused by kubernetes cleaning up log files). I'm not sure if that's the same thing you're experiencing but it shows some increase in memory consumption over time

matteosuppo · 2017-06-01T10:18:12Z

I was running it against a postgresql database. It could indeed be that the memory is growing to accomodate the load, but wouldn't it reach a stable point? Instead it growed hour after hour until it allocated 2.5 GB before the crash.

aeneasr · 2017-06-01T10:19:19Z

@matteosuppo could you join me on gitter?

aeneasr pushed a commit that referenced this issue Jun 1, 2017

cmd/server: resolve gorilla session mem leak - closes #461

e035e7b

aeneasr closed this as completed in baf60d2 Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random periodic crashes #461

Random periodic crashes #461

matteosuppo commented May 18, 2017

aeneasr commented May 18, 2017 •

edited

Loading

aeneasr commented May 18, 2017

aeneasr commented May 18, 2017

matteosuppo commented May 19, 2017

aeneasr commented May 19, 2017

aeneasr commented May 20, 2017

aeneasr commented May 20, 2017

matteosuppo commented May 20, 2017 via email

aeneasr commented May 20, 2017 •

edited

Loading

matteosuppo commented May 20, 2017 via email

matteosuppo commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017 •

edited

Loading

matteosuppo commented Jun 1, 2017

aeneasr commented Jun 1, 2017

Random periodic crashes #461

Random periodic crashes #461

Comments

matteosuppo commented May 18, 2017

aeneasr commented May 18, 2017 • edited Loading

aeneasr commented May 18, 2017

aeneasr commented May 18, 2017

matteosuppo commented May 19, 2017

aeneasr commented May 19, 2017

Testing oauth2 token flow

Token validation

Client creation

Group

Create

Find

JWK creation

Policy

Create

Get

aeneasr commented May 20, 2017

aeneasr commented May 20, 2017

matteosuppo commented May 20, 2017 via email

aeneasr commented May 20, 2017 • edited Loading

matteosuppo commented May 20, 2017 via email

matteosuppo commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented Jun 1, 2017 • edited Loading

matteosuppo commented Jun 1, 2017

aeneasr commented Jun 1, 2017

aeneasr commented May 18, 2017 •

edited

Loading

aeneasr commented May 20, 2017 •

edited

Loading

aeneasr commented Jun 1, 2017 •

edited

Loading