Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random periodic crashes #461

Closed
matteosuppo opened this issue May 18, 2017 · 17 comments
Closed

Random periodic crashes #461

matteosuppo opened this issue May 18, 2017 · 17 comments

Comments

@matteosuppo
Copy link
Contributor

It's a mystery. We have two instances that serve a lot of traffic without problems, but sometimes the database connections spike, the instances stop serving traffic and I have no idea what causes it.

What could I do to investigate? Wading through the logs seems difficult without an idea of what to look for.

We tried with v0.7.x and 0.8.x and with and without docker.

Could be something related to the database? It's a postgresql

@aeneasr
Copy link
Member

aeneasr commented May 18, 2017

Is this the same issue you reported previously? I thought it had to do with a bug in the REST writer which caused an infinite loop if certain error conditions came true. However, that would not explain database spikes and also this error has been resolved in 0.8.x

One likely cause for database spikes is leaking connections when an error occurs. This might happen in a transaction that has multiple statements that don't close the database connectivity if an error occurs during one of the transactions. However, that would result in a steady increase of connections over time, not an abrupt spike.

To identify the issue we need to know which API is causing it. By excluding APIs we will narrow down the search, so are there API endpoints you can exclude? Also it could be important to know which flows you've enabled.

I think the most likely source is either the warden, then token introspection, then token issuance. Simply because the warden is the most complex and probably has a lot of API calls.

The first thing to look for in the logs is the 5xx messages and finding out request that happened before that. Also important is the amount of time it took to leak the database connections. Another part could be how you tweaked your database connection (e.g. max conns).

@aeneasr
Copy link
Member

aeneasr commented May 18, 2017

If the error happens both in 0.7.x and 0.8.x ladon is actually not as likely to be the source of this as I thought. The reason for that is that the ladon sql manager was reworked completely. So it's less likely to be the source for this.

How often does this happen by the way? Is there a fixed pattern (ever 5 hours for example)?

@aeneasr
Copy link
Member

aeneasr commented May 18, 2017

Actually, @matteosuppo could you share the CPU, Memory and SQL Connection graphs here, including timestamps?

@matteosuppo
Copy link
Contributor Author

As soon as I get another crash

@aeneasr
Copy link
Member

aeneasr commented May 19, 2017

So I started to run some tests on endpoints, I'll summarize what I did here so we get a good idea of what's happening. I checked the number of connections using

docker exec -t -i 991 /bin/bash
psql --user postgres
SELECT * FROM pg_stat_activity;

ps: everything ran locally here, so zero chance of network errors, which could be the underlying issue here.

Testing oauth2 token flow

while true; do hydra token client; done

Here, I was not able to find connection leaks. They were properly re-used, closed and lower than the amount of concurrent clients, which is a good sign. The connections maxed out with about 5 postgres connections at 15 concurrent clients and returned to 2 idle connections once load was taken away.

Token validation

I ran the same test with token validation:

while true; do hydra token validate $(hydra token client); done

Same observations as above.

Client creation

while true; do hydra clients create; done

Same observations as above.

Group

Create

while true; do hydra groups create $(date +%s%N); done

Find

while true; do hydra groups find foo; done

Same observations as above.

JWK creation

while true; do hydra keys create -a HS256 $(date +%s%N); done

Same observations as above.

Policy

Create

while true; do hydra policies create -a foo --allow -r foo -s foo --id $(date +%s%N); done

Get

hydra policies create -a foo --allow -r foo -s foo --id 1495199487515477900
while true; do hydra policies get 1495199487515477900; done
while true; do hydra policies get i-do-not-exist; done

Same observations as above.


I haven't checked the warden yet but since policy, token validation and group finding have been tested already, it's unlikely that the warden has a leak.

In general I'm now more confused as to what's causing this. There isn't an obvious sql connection leak, I also checked all the sql implementations for missing rows.Close() but was unable to find any. Maybe it's caused by some infinite loop that hit's off when some error occurs. But we really need to nail down the API that's causing this in order to find it.

@aeneasr
Copy link
Member

aeneasr commented May 20, 2017

@matteosuppo could you share the database DSN without confidential things, so for example: postgres://*:*@*:*/hydra?foo=bar

@aeneasr
Copy link
Member

aeneasr commented May 20, 2017

Oh and by the way, do both instances stop serving traffic or only one?

@matteosuppo
Copy link
Contributor Author

matteosuppo commented May 20, 2017 via email

@aeneasr
Copy link
Member

aeneasr commented May 20, 2017

What I'm interested in is the settings max_conns, max_idle_conns, and max_conn_lifetime - or if you haven't set them at all. If you haven't set them, could you try setting max_conn_lifetime to, for example 1 minute?

postgres://*:*@*:*/hydra?max_conn_lifetime=1m

My guess is that either some connections have issues and then clog up the database, or that connections are capped at a low value (e.g. 4 - which happens on 2 logical cpus) which simply isn't enough. So you could also try:

postgres://*:*@*:*/hydra?max_conn_lifetime=1m&max_conns=20&max_idle_conns=4

which greatly increases the maximum amount of allowed connections.

Of course, upgrading to a larger db might also help. Since the symptoms you describe don't fit in a sql connection leak, and only in either a database load issue or an infinite loop issue, upgrading could resolve those issues with at least 50% chance.

@matteosuppo
Copy link
Contributor Author

matteosuppo commented May 20, 2017 via email

@matteosuppo
Copy link
Contributor Author

Those flags don't seems to have changed anything.

I managed to capture some stats of the machine before it died:

http://imgur.com/wqDR4tT
http://imgur.com/8nb3qdr

You can see that there seem to be some memory leak going on, since the memory allocated keeps growing. It reaches 2.5GBs

At that point my theory is that it reaches the maximum memory allocated per process set by docker so it tries to go into swap and everything goes south.

We can also see that the disk space used is growing, probably because of the logs. I wonder if those two are connected.

Anyway I tried monitoring top usage while running a bunch of load in my local machine with

hey -n 2000 -c 500 -H "Authorization: bearer z5VepohqLE3hIEfHfCkfpzJCrufJbhXVZcO43SNWG18.FaYVcXhgd5lUSemxQG-Vitg8IxuSSD3JIKrzdJeACXo" -d '{"scopes":["profile:core"],"token":"BKxpqvBicy6YP64JNzRdnu_DAlzq0lQ9q0_KFHNxiu4.ez2I29zoArRYuvHDmlTmCleRkp-xHJ94IfiYuPSRLdI","resource":"user:me","action":"show","context":null}' -m POST http://localhost:4444/warden/token/allowed

And I saw the memory usage grow steadily.

I also tried to make sense of go tool pprof but I'm not very good at it

@aeneasr
Copy link
Member

aeneasr commented Jun 1, 2017

Thank you, this helps a lot!

@aeneasr
Copy link
Member

aeneasr commented Jun 1, 2017

Hm, I was not able to reproduce the memory leak with hey. Are you sure that what you observed wasn't caused by Go ramping up memory to be able to deal with 500 concurrent connections? You can test this assumption by waiting ~5 minutes after the test - if the memory usage drops back to a few megabytes it's not a memory leak

@aeneasr
Copy link
Member

aeneasr commented Jun 1, 2017

Oh, did you run this against the in memory database or a SQL backend? I think I'm seeing the same results as you with postgres backend enabled

@aeneasr
Copy link
Member

aeneasr commented Jun 1, 2017

So I ran about 10 million requests against hydra set up in a docker container and postgres connectivity and while at some point the memory kept growing slowly, it was freed up immediately after the load stopped (checked with docker stats). I was unable to reproduce the growing disk size (docker ps -s).

I checked one of my production deployments on kubernetes (significantly less load though) which is running hydra 0.7.7-http and it looks like this:

image

where the green line is hydra and the disk spikes are caused by the hydra logs (the drops are probably caused by kubernetes cleaning up log files). I'm not sure if that's the same thing you're experiencing but it shows some increase in memory consumption over time

@matteosuppo
Copy link
Contributor Author

I was running it against a postgresql database. It could indeed be that the memory is growing to accomodate the load, but wouldn't it reach a stable point? Instead it growed hour after hour until it allocated 2.5 GB before the crash.

@aeneasr
Copy link
Member

aeneasr commented Jun 1, 2017

@matteosuppo could you join me on gitter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants