-
Notifications
You must be signed in to change notification settings - Fork 29
Credhub request is really slow or timeout [critical] #46
Comments
Hi @kgrodzicki, Sorry to hear that. Funnily enough, a few minutes ago I was working on our Concourse-Up-deployed CredHub instance, and it was working just fine. How familiar are you with BOSH and debugging BOSH-deployed systems? |
Hi @DanielJonesEB, |
Let us know if you find anything interesting. |
cat credhub.stdout.log And logs from /var/vcap/sys/log/credhub/credhub.log looks interesting. Many authorization errors like:
and It is only concourse who calls credhub at the moment and it looks like it is DOS'ing making it unresponsive. |
Hi @kgrodzicki, We're having trouble replicating this. Could you give some more details about your deployment? What flags did you use when you first deployed with Approximately how many pipelines are there on the Concourse? After running Thanks. |
After taking a look at our logs I don't think the errors listed are necessarily related to your problem. We see similar errors but aren't experiencing credhub timeouts or slowness. Part of the credhub implementation in Concourse is that credentials can be stored either at team level ( In our logs we are seeing quite a few Something you could check is what your credit balance/usage is on your RDS instance. You can see this in the AWS console. I have seen it before where the DB got put under extra load after transitioning a number of pipelines over to using credhub. The credits got exhausted and CPU started being throttled which caused a variety of slowness-related issues. Running |
Hi @iferunsewe @crsimmons, @iferunsewe First installation command was: @crsimmons |
After update db instance to xlarge it looks much better now. Thank you guys for your time, it was really helpful ! |
Unfortunately, the problem does not seem to be fully fixed. Using secrets and variables from credhub significally slows down our pipelines Experiment: ``
jobs:
Using this config the hello-world task takes 2 min and 10 seconds By removing the credhub params, the build-time drops to 6 seconds.
jobs:
For every credhub param we add to the config, the build-time increases. NB! Is there any plans to support AWS SSM as an alternative? this would most likely solve our problem, and keep us aligned with our policy of using AWS tools. |
In addition, after upgrading to RDS size extra large, the CPU is constantly hitting the roof at 90% to 100%. |
The performance has been gradually degrading, it is not only related to the latest upgrade. (see first post) |
I don't think this issue is the result of a problem with Concourse-up. Concourse-up is just a mechanism for deploying Concourse - I can't think of anything special that we are doing that would result in performance degradation. If anything it sounds like an issue either with Concourse, Credhub, or the integration between the two. You might get some better debugging advice if you raise issues with those projects. You could also start a thread over at the Concourse discuss forum to see if anyone else has encountered this. It's challenging for me to artificially replicate the load generated by many devs on ~70 pipelines but at clients I have seen Concourse databases slow down tremendously as the number of credhub lookups from pipelines increase. As far as I can tell the only real solution from an ops perspective is to scale the instance. As for your question about AWS SSM, that isn't on our current roadmap but I have added a note about it to our story planning board. |
Yes, we have boosted both web and db to 2xlarge to make it manageable. I guess we have to migrate away from credhub to SSM sooner rather than later. |
We had this exact problem, and diagnosed it to the t2 instance type being used for the I strongly recommend not using T2s for the web instance. |
We could add more web sizes to Concourse-up. Either some m4 ones or maybe the ability to enable t2 unlimited on the existing instances. I'd be interested to know (from @archgrove or @engrun or anyone else seeing this issue) if the web vm CPU is running close to max all the time or if it only spikes up some of the time. You can see a CPU graph on the included grafana dashboard at |
We are now running 2xlarge for worker, web, and database. Now it is finally working as expected (although it still takes 10-15 seconds in the pending state before tasks starts). Which reminds my of a previous issue regarding secrets management, where we advocated a pass-through for credential management configuration, leaving it up to the users of concourse-up to choose their own credential management. This issues makes this relevant again, as we would like to use AWS SSM instead of credhub given the state of CredHub. I think we have two possible solutions in the longer term. 1) Wait for SSM support on concourse-up 2) Host concourse on EC2 instead |
@crsimmons We observed T2 credit depletion due to a persistent > 30% CPU load on our t2.small instance. SSHing into the box indicated that most of this seemed due to the credhub Java processes. As soon as the CPU credits were exhausted, it was pegged to 100% (no surprise on a throttled T2). A t2.large running the same workload is persistently sitting around 30% CPU usage, which is fortunately below the threshold at which the credit balance will decrease. However, we won't need many more pipelines before even that isn't enough (hence my advice to avoid T2 instance types). Interestingly, our RDS is fine as a t2.small. Whatever the workload is, it's not that database heavy. I'd love to get SSM support into concourse-up, given that it has native support in Concourse. This would fit with our preferred model of keeping everything that AWS offers "as a service" inside the AWS estate. Until someone (maybe us) finds time to PR that, documenting the web instance sizes, and offering non-t2s would probably be a win. Frankly, I'd also like to find out why the ATC + credhub use that much resource. Our workload isn't particularly heavy, and it seems excessive to need large instances to run the CI pipeline management software - not even the pipeline workload. |
@engrun @crsimmons One other thing we're trialing is reducing the |
From experience, setting |
@archgrove @crsimmons We have solved |
I know SSM support was recently added to your backlog. Any chance this issue could affect the priority it? |
Hi all, @iferunsewe and @takeyourhatoff did some profiling and found behaviour in the CredHub client library that is causing massive amounts of inefficiency: cloudfoundry/credhub-cli#45 We've reached out to the CredHub and Concourse teams to get this merged into the right places. Thankyou for your patience! |
For those interested, you can follow our Concourse issue here concourse/concourse#2373 |
Hi,
after updating concourse to latest version with concourse-up we experience that credhub is really slow or it actually timeouts. Can you please give us some tips how to troubleshoot this ?
The text was updated successfully, but these errors were encountered: