Credhub request is really slow or timeout [critical] #46

kgrodzicki · 2018-05-29T12:31:11Z

Hi,

after updating concourse to latest version with concourse-up we experience that credhub is really slow or it actually timeouts. Can you please give us some tips how to troubleshoot this ?

DanielJonesEB · 2018-05-29T12:33:20Z

Hi @kgrodzicki,

Sorry to hear that. Funnily enough, a few minutes ago I was working on our Concourse-Up-deployed CredHub instance, and it was working just fine.

How familiar are you with BOSH and debugging BOSH-deployed systems?

kgrodzicki · 2018-05-29T12:38:04Z

Hi @DanielJonesEB,
thank you for fast answer. Haven't used BOSH so far. Where to start ?

DanielJonesEB · 2018-05-29T12:51:16Z

Run eval "$(concourse-up info DEPLOYMENT --env)", which will populate your session with various environment variables.
bosh ssh web which will get you an SSH session on the VM with CredHub running
sudo su - to change to root
Have a look around on the box. Logs are in /var/vcap/sys/log/, and monit is running as a process monitor.

Let us know if you find anything interesting.

kgrodzicki · 2018-05-29T14:43:04Z

cat credhub.stdout.log
[2018-05-28 06:41:41+0000] Could not reach the UAA server

And logs from /var/vcap/sys/log/credhub/credhub.log looks interesting. Many authorization errors like:

2018-05-29T13:50:23.636Z [https-jsse-nio-8844-exec-173] .... INFO --- CEFAudit: CEF:0|cloud_foundry|credhub|0.0.0|GET /api/v1/data|GET /api/v1/data|0|rt=1527601814412 suser=atc_to_credhub suid=uaa-client:atc_to_credhub cs1Label=userAuthenticationMechanism cs1=uaa request=/api/v1/data?current=true&name=%2Fconcourse%2Fteam%2Fgit_paths requestMethod=GET cs3Label=versionUuid cs3=99843440-ad9c-4385-bf03-edd8f6db190d cs4Label=httpStatusCode cs4=200 src=127.0.0.1 dst=127.0.0.1 cs2Label=resourceName cs2=/concourse/team/git_paths cs5Label=resourceUuid cs5=... deviceAction=GET cs6Label=requestDetails cs6={"name":"/concourse/team/git_paths","versions":null,"current":true} 2018-05-29T13:50:23.718Z [https-jsse-nio-8844-exec-58] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.

and
2018-05-29T14:29:59.608Z [https-jsse-nio-8844-exec-13] .... INFO --- RetryingEncryptionService: Attempting decrypt 2018-05-29T14:29:59.676Z [https-jsse-nio-8844-exec-42] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.713Z [https-jsse-nio-8844-exec-85] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization. 2018-05-29T14:29:59.736Z [https-jsse-nio-8844-exec-74] .... ERROR --- ExceptionHandlers: The request could not be completed because the credential does not exist or you do not have sufficient authorization.

It is only concourse who calls credhub at the moment and it looks like it is DOS'ing making it unresponsive.

iferunsewe · 2018-05-29T15:48:45Z

Hi @kgrodzicki,

We're having trouble replicating this. Could you give some more details about your deployment?

What flags did you use when you first deployed with concourse-up deploy? Specifically, what size are your web, workers, and RDS. Also how many workers do you have?

Approximately how many pipelines are there on the Concourse?

After running eval "$(concourse-up info DEPLOYMENT --env)" you should be able to run credhub find (assuming you have the credhub cli on your PATH). Is it slow to respond when you do this?

Thanks.

crsimmons · 2018-05-29T16:20:39Z

After taking a look at our logs I don't think the errors listed are necessarily related to your problem. We see similar errors but aren't experiencing credhub timeouts or slowness.

Part of the credhub implementation in Concourse is that credentials can be stored either at team level (/concourse/TEAM/credential) or at pipeline level (/concourse/TEAM/PIPELINE/credential). Concourse will always check in the latter before the former.

In our logs we are seeing quite a few The request could not be completed because the credential does not exist or you do not have sufficient authorization. errors but they seem to be from checks for pipeline level credentials. My hypothesis is that these errors are generated whenever a team level credential is requested.

Something you could check is what your credit balance/usage is on your RDS instance. You can see this in the AWS console. I have seen it before where the DB got put under extra load after transitioning a number of pipelines over to using credhub. The credits got exhausted and CPU started being throttled which caused a variety of slowness-related issues. Running concourse-up deploy with a larger value for --db-size fixed that issue (AWS will transfer all DB data over to the new larger instance).

kgrodzicki · 2018-05-29T18:06:06Z

Hi @iferunsewe @crsimmons,
thank you for your answers.

@iferunsewe
We have around 70 pipelines so far.

First installation command was:
concourse-up deploy --worker-size large --workers 1 --domain test.com test
and last update to 0.9.0 version:
concourse-up deploy --worker-size medium --workers 8 --web-size large --db-size medium --region eu-west-1 --domain test.com test

@crsimmons
Looks like IOPS has been used on RDS. I am going to update instance type for db and will keep you updated.

kgrodzicki · 2018-05-29T20:26:01Z

After update db instance to xlarge it looks much better now. Thank you guys for your time, it was really helpful !

engrun · 2018-05-30T08:53:34Z

Unfortunately, the problem does not seem to be fully fixed.
This morning concourse was really slow again, and after investigating further we have established the following:

Using secrets and variables from credhub significally slows down our pipelines

Experiment:
Created and helloworld pipeline using the following config.

``
resources:

name: repo
type: git
source:
uri: [email protected]:MY-ORG/myrepo.git
branch: master
private_key: |-
-----BEGIN RSA PRIVATE KEY-----
foo
-----END RSA PRIVATE KEY-----

jobs:

name: job-hello-world
plan:
- get: repo
  trigger: true
- task: hello-world
  config:
  platform: linux
  image_resource:
  type: docker-image
  source: {repository: busybox}
  run:
  path: env
  args: []
  params:
  repository: ((ecr_base_uri))/husdyrfag/webhook-resource
  aws_access_key_id: ((ecr_key_id))
  aws_secret_access_key: ((ecr_key_secret))
  name: ((concourse_team))
  username: ((concourse_basic_auth_user))
  password: ((concourse_basic_auth_password))
  password: ((git_key.private_key))
  ``

Using this config the hello-world task takes 2 min and 10 seconds

By removing the credhub params, the build-time drops to 6 seconds.
This config takes 6 seconds
``
resources:

name: repo
type: git
source:
uri: [email protected]:MY-ORG/myrepo.git
branch: master
private_key: |-
-----BEGIN RSA PRIVATE KEY-----
foo
-----END RSA PRIVATE KEY-----

jobs:

name: job-hello-world
plan:
- get: repo
  trigger: true
- task: hello-world
  config:
  platform: linux
  image_resource:
  type: docker-image
  source: {repository: busybox}
  run:
  path: echo
  args: [hello world]
  params:
  ``

For every credhub param we add to the config, the build-time increases.

NB!
The credhub params we use are all on the team-level, could this impact the resolving time?

Is there any plans to support AWS SSM as an alternative? this would most likely solve our problem, and keep us aligned with our policy of using AWS tools.

engrun · 2018-05-30T09:01:41Z

In addition, after upgrading to RDS size extra large, the CPU is constantly hitting the roof at 90% to 100%.
This is weird, as the read and write IOPS are very low.

engrun · 2018-05-30T09:03:45Z

The performance has been gradually degrading, it is not only related to the latest upgrade. (see first post)
This is probably because the developers are creating more and more pipelines, thus also generating more and more params/secrets in CredHub

crsimmons · 2018-05-30T17:23:29Z

I don't think this issue is the result of a problem with Concourse-up. Concourse-up is just a mechanism for deploying Concourse - I can't think of anything special that we are doing that would result in performance degradation. If anything it sounds like an issue either with Concourse, Credhub, or the integration between the two. You might get some better debugging advice if you raise issues with those projects. You could also start a thread over at the Concourse discuss forum to see if anyone else has encountered this.

It's challenging for me to artificially replicate the load generated by many devs on ~70 pipelines but at clients I have seen Concourse databases slow down tremendously as the number of credhub lookups from pipelines increase. As far as I can tell the only real solution from an ops perspective is to scale the instance.

As for your question about AWS SSM, that isn't on our current roadmap but I have added a note about it to our story planning board.

engrun · 2018-05-31T07:04:36Z

Yes, we have boosted both web and db to 2xlarge to make it manageable. I guess we have to migrate away from credhub to SSM sooner rather than later.
Thanks for the input

archgrove · 2018-05-31T11:27:11Z

We had this exact problem, and diagnosed it to the t2 instance type being used for the web EC2 instance. By default, Concourse-up uses a t2.small. It turns out, credhub is a huge CPU monster, and with 30ish pipelines, just ate through all our CPU credits in no time. A throttled T2 instance can't keep up.

I strongly recommend not using T2s for the web instance.

crsimmons · 2018-05-31T12:15:50Z

We could add more web sizes to Concourse-up. Either some m4 ones or maybe the ability to enable t2 unlimited on the existing instances.

I'd be interested to know (from @archgrove or @engrun or anyone else seeing this issue) if the web vm CPU is running close to max all the time or if it only spikes up some of the time. You can see a CPU graph on the included grafana dashboard at <your-concourse>:3000.

engrun · 2018-05-31T13:03:51Z

We are now running 2xlarge for worker, web, and database. Now it is finally working as expected (although it still takes 10-15 seconds in the pending state before tasks starts).
As @archgrove says, our findings is also that CredHub is "a huge CPU monster".
Our current deployment is estimated at $1300 amonth, which really is a lot.

Which reminds my of a previous issue regarding secrets management, where we advocated a pass-through for credential management configuration, leaving it up to the users of concourse-up to choose their own credential management. This issues makes this relevant again, as we would like to use AWS SSM instead of credhub given the state of CredHub. I think we have two possible solutions in the longer term. 1) Wait for SSM support on concourse-up 2) Host concourse on EC2 instead

archgrove · 2018-05-31T14:58:37Z

@crsimmons We observed T2 credit depletion due to a persistent > 30% CPU load on our t2.small instance. SSHing into the box indicated that most of this seemed due to the credhub Java processes. As soon as the CPU credits were exhausted, it was pegged to 100% (no surprise on a throttled T2). A t2.large running the same workload is persistently sitting around 30% CPU usage, which is fortunately below the threshold at which the credit balance will decrease. However, we won't need many more pipelines before even that isn't enough (hence my advice to avoid T2 instance types).

Interestingly, our RDS is fine as a t2.small. Whatever the workload is, it's not that database heavy.

I'd love to get SSM support into concourse-up, given that it has native support in Concourse. This would fit with our preferred model of keeping everything that AWS offers "as a service" inside the AWS estate. Until someone (maybe us) finds time to PR that, documenting the web instance sizes, and offering non-t2s would probably be a win.

Frankly, I'd also like to find out why the ATC + credhub use that much resource. Our workload isn't particularly heavy, and it seems excessive to need large instances to run the CI pipeline management software - not even the pipeline workload.

archgrove · 2018-05-31T15:05:07Z

@engrun @crsimmons One other thing we're trialing is reducing the check-every interval (https://concourse-ci.org/resources.html#resource-check-every) on resources which change less frequently. The majority of the credhub load seems to be induced by our circa 100 resources polling every minute, each of which induces a couple of credhub lookups. One would think that an average of 2-3 calls a second wouldn't be that much of a drain, but it seems to be (when we ran into this the first time, our fix was "Pause every non-critical pipeline until credits rebuilt").

crsimmons · 2018-06-01T09:24:12Z

From experience, setting check-every to 2-3min greatly reduces the number of checks without negatively impacting pipelines. It means it will take marginally longer for changes to be picked up but manual job triggering will pick them up immediately.

engrun · 2018-06-01T13:13:13Z

@archgrove @crsimmons We have solved check-every by using webhooks instead. And builds triggers "instantly"

engrun · 2018-06-02T14:25:02Z

I know SSM support was recently added to your backlog. Any chance this issue could affect the priority it?

DanielJonesEB · 2018-07-06T12:26:03Z

Hi all,

@iferunsewe and @takeyourhatoff did some profiling and found behaviour in the CredHub client library that is causing massive amounts of inefficiency: cloudfoundry/credhub-cli#45

We've reached out to the CredHub and Concourse teams to get this merged into the right places.

Thankyou for your patience!

crsimmons · 2018-07-17T15:37:43Z

For those interested, you can follow our Concourse issue here concourse/concourse#2373

kgrodzicki changed the title ~~Credhub request is really slow or timeouts [critical]~~ Credhub request is really slow or timeout [critical] May 29, 2018

kgrodzicki closed this as completed May 29, 2018

kgrodzicki reopened this May 30, 2018

kgrodzicki closed this as completed Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Credhub request is really slow or timeout [critical] #46

Credhub request is really slow or timeout [critical] #46

kgrodzicki commented May 29, 2018

DanielJonesEB commented May 29, 2018

kgrodzicki commented May 29, 2018

DanielJonesEB commented May 29, 2018

kgrodzicki commented May 29, 2018 •

edited

Loading

iferunsewe commented May 29, 2018

crsimmons commented May 29, 2018

kgrodzicki commented May 29, 2018

kgrodzicki commented May 29, 2018

engrun commented May 30, 2018

engrun commented May 30, 2018

engrun commented May 30, 2018 •

edited

Loading

crsimmons commented May 30, 2018

engrun commented May 31, 2018

archgrove commented May 31, 2018

crsimmons commented May 31, 2018 •

edited

Loading

engrun commented May 31, 2018

archgrove commented May 31, 2018 •

edited

Loading

archgrove commented May 31, 2018 •

edited

Loading

crsimmons commented Jun 1, 2018

engrun commented Jun 1, 2018

engrun commented Jun 2, 2018

DanielJonesEB commented Jul 6, 2018

crsimmons commented Jul 17, 2018

Credhub request is really slow or timeout [critical] #46

Credhub request is really slow or timeout [critical] #46

Comments

kgrodzicki commented May 29, 2018

DanielJonesEB commented May 29, 2018

kgrodzicki commented May 29, 2018

DanielJonesEB commented May 29, 2018

kgrodzicki commented May 29, 2018 • edited Loading

iferunsewe commented May 29, 2018

crsimmons commented May 29, 2018

kgrodzicki commented May 29, 2018

kgrodzicki commented May 29, 2018

engrun commented May 30, 2018

engrun commented May 30, 2018

engrun commented May 30, 2018 • edited Loading

crsimmons commented May 30, 2018

engrun commented May 31, 2018

archgrove commented May 31, 2018

crsimmons commented May 31, 2018 • edited Loading

engrun commented May 31, 2018

archgrove commented May 31, 2018 • edited Loading

archgrove commented May 31, 2018 • edited Loading

crsimmons commented Jun 1, 2018

engrun commented Jun 1, 2018

engrun commented Jun 2, 2018

DanielJonesEB commented Jul 6, 2018

crsimmons commented Jul 17, 2018

kgrodzicki commented May 29, 2018 •

edited

Loading

engrun commented May 30, 2018 •

edited

Loading

crsimmons commented May 31, 2018 •

edited

Loading

archgrove commented May 31, 2018 •

edited

Loading

archgrove commented May 31, 2018 •

edited

Loading