-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random periodic crashes #461
Comments
Is this the same issue you reported previously? I thought it had to do with a bug in the REST writer which caused an infinite loop if certain error conditions came true. However, that would not explain database spikes and also this error has been resolved in 0.8.x One likely cause for database spikes is leaking connections when an error occurs. This might happen in a transaction that has multiple statements that don't close the database connectivity if an error occurs during one of the transactions. However, that would result in a steady increase of connections over time, not an abrupt spike. To identify the issue we need to know which API is causing it. By excluding APIs we will narrow down the search, so are there API endpoints you can exclude? Also it could be important to know which flows you've enabled. I think the most likely source is either the warden, then token introspection, then token issuance. Simply because the warden is the most complex and probably has a lot of API calls. The first thing to look for in the logs is the 5xx messages and finding out request that happened before that. Also important is the amount of time it took to leak the database connections. Another part could be how you tweaked your database connection (e.g. max conns). |
If the error happens both in 0.7.x and 0.8.x ladon is actually not as likely to be the source of this as I thought. The reason for that is that the ladon sql manager was reworked completely. So it's less likely to be the source for this. How often does this happen by the way? Is there a fixed pattern (ever 5 hours for example)? |
Actually, @matteosuppo could you share the CPU, Memory and SQL Connection graphs here, including timestamps? |
As soon as I get another crash |
So I started to run some tests on endpoints, I'll summarize what I did here so we get a good idea of what's happening. I checked the number of connections using
ps: everything ran locally here, so zero chance of network errors, which could be the underlying issue here. Testing oauth2 token flow
Here, I was not able to find connection leaks. They were properly re-used, closed and lower than the amount of concurrent clients, which is a good sign. The connections maxed out with about 5 postgres connections at 15 concurrent clients and returned to 2 idle connections once load was taken away. Token validationI ran the same test with token validation:
Same observations as above. Client creation
Same observations as above. GroupCreate
Find
Same observations as above. JWK creation
Same observations as above. PolicyCreate
Get
Same observations as above. I haven't checked the warden yet but since policy, token validation and group finding have been tested already, it's unlikely that the warden has a leak. In general I'm now more confused as to what's causing this. There isn't an obvious sql connection leak, I also checked all the sql implementations for missing |
@matteosuppo could you share the database DSN without confidential things, so for example: |
Oh and by the way, do both instances stop serving traffic or only one? |
Both instances. I recreated the database (bigger) on rds and it seems it's
stabler now. My guess is that the issue was in the database instance, that
hanged on some queries and caused slowdowns and reconnections. I don't have
the URL here since I'm from the phone. It's only accessible from inside the
VPN though, so I'm not sure what you could do with it :p
Il sab 20 mag 2017, 11:48 Aeneas <[email protected]> ha scritto:
… Oh and by the way, do both instances stop serving traffic or only one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#461 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABdjec7Xdmhh7S_x7z7lkpfXrv_W0Y-9ks5r7rbogaJpZM4Nfg0M>
.
|
What I'm interested in is the settings
My guess is that either some connections have issues and then clog up the database, or that connections are capped at a low value (e.g. 4 - which happens on 2 logical cpus) which simply isn't enough. So you could also try:
which greatly increases the maximum amount of allowed connections. Of course, upgrading to a larger db might also help. Since the symptoms you describe don't fit in a sql connection leak, and only in either a database load issue or an infinite loop issue, upgrading could resolve those issues with at least 50% chance. |
I didn't set up those values, so it may indeed be that. I'll try that on
Monday if I'll encounter issues. Thank you very much.
Il sab 20 mag 2017, 11:59 Aeneas <[email protected]> ha scritto:
… What I'm interested in is the settings max_conns, max_idle_conns, and
max_conn_lifetime - or if you haven't set them at all. If you haven't set
them, could you try setting max_conn_lifetime to, for example 1 minute?
postgres://*:*@*:*/hydra?max_conn_lifetime=1m
My guess is that either some connections have issues and then clog up the
database, or that connections are capped at a low value (e.g. 4 - which
happens on 2 logical cpus) which simply isn't enough. So you could also try:
postgres://*:*@*:*/hydra?max_conn_lifetime=1m&max_conns=20&max_idle_conns=4
which greatly increases the maximum amount of allowed connections.
Of course, upgrading to a larger db might also help. Since the symptoms
you describe don't fit in a sql memory leak, and only in either a database
load issue or an infinite loop issue, upgrading could resolve those issues.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#461 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABdjeaw2cGxJP50d60GcKCeI2-KYRCRxks5r7rmfgaJpZM4Nfg0M>
.
|
Those flags don't seems to have changed anything. I managed to capture some stats of the machine before it died: http://imgur.com/wqDR4tT You can see that there seem to be some memory leak going on, since the memory allocated keeps growing. It reaches 2.5GBs At that point my theory is that it reaches the maximum memory allocated per process set by docker so it tries to go into swap and everything goes south. We can also see that the disk space used is growing, probably because of the logs. I wonder if those two are connected. Anyway I tried monitoring top usage while running a bunch of load in my local machine with hey -n 2000 -c 500 -H "Authorization: bearer z5VepohqLE3hIEfHfCkfpzJCrufJbhXVZcO43SNWG18.FaYVcXhgd5lUSemxQG-Vitg8IxuSSD3JIKrzdJeACXo" -d '{"scopes":["profile:core"],"token":"BKxpqvBicy6YP64JNzRdnu_DAlzq0lQ9q0_KFHNxiu4.ez2I29zoArRYuvHDmlTmCleRkp-xHJ94IfiYuPSRLdI","resource":"user:me","action":"show","context":null}' -m POST http://localhost:4444/warden/token/allowed And I saw the memory usage grow steadily. I also tried to make sense of go tool pprof but I'm not very good at it |
Thank you, this helps a lot! |
Hm, I was not able to reproduce the memory leak with hey. Are you sure that what you observed wasn't caused by Go ramping up memory to be able to deal with 500 concurrent connections? You can test this assumption by waiting ~5 minutes after the test - if the memory usage drops back to a few megabytes it's not a memory leak |
Oh, did you run this against the in memory database or a SQL backend? I think I'm seeing the same results as you with postgres backend enabled |
So I ran about 10 million requests against hydra set up in a docker container and postgres connectivity and while at some point the memory kept growing slowly, it was freed up immediately after the load stopped (checked with I checked one of my production deployments on kubernetes (significantly less load though) which is running hydra 0.7.7-http and it looks like this: where the green line is hydra and the disk spikes are caused by the hydra logs (the drops are probably caused by kubernetes cleaning up log files). I'm not sure if that's the same thing you're experiencing but it shows some increase in memory consumption over time |
I was running it against a postgresql database. It could indeed be that the memory is growing to accomodate the load, but wouldn't it reach a stable point? Instead it growed hour after hour until it allocated 2.5 GB before the crash. |
@matteosuppo could you join me on gitter? |
It's a mystery. We have two instances that serve a lot of traffic without problems, but sometimes the database connections spike, the instances stop serving traffic and I have no idea what causes it.
What could I do to investigate? Wading through the logs seems difficult without an idea of what to look for.
We tried with v0.7.x and 0.8.x and with and without docker.
Could be something related to the database? It's a postgresql
The text was updated successfully, but these errors were encountered: