Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport-proxy does not serve updated https certificates without restart #3815

Closed
dampersand opened this issue Jun 9, 2020 · 12 comments · Fixed by #19996
Closed

Teleport-proxy does not serve updated https certificates without restart #3815

dampersand opened this issue Jun 9, 2020 · 12 comments · Fixed by #19996
Assignees
Labels
bug c-an Internal Customer Reference c-au Internal Customer Reference c-cc Internal Customer Reference c-cx Internal Customer Reference c-cy Internal Customer Reference c-di Internal Customer Reference c-fi Internal Customer Reference cloud Cloud feature-request Used for new features in Teleport, improvements to current should be #enhancements

Comments

@dampersand
Copy link

dampersand commented Jun 9, 2020

Description

We use short-lived SSL certificates from a private CA in our environment, and we have provisioned our private teleport-proxies to serve these short-lived https certificates. (teleport.yaml directive: proxy_service.https_cert_file)

It appears that when these short-lived certificates are rotated and placed on the server, the teleport service continues to pass the old certificate out. Obvs this is no good - we COULD code 'systemctl restart teleport' into the cert rotation, but much better for teleport to handle automated cert changes (so we don't have to force-disconnect users)!

What happened:
contents of proxy_service.https_cert_file changed, but teleport service did not reflect the change until after a restart

What you expected to happen:
the teleport service to accurately reflect the cert file after any possible updates without requiring a restart

How to reproduce it (as minimally and precisely as possible):
Helps to have your own CA, but you can spoof this, too!

  1. have a testing teleport proxy up-and-running. Be sure you can reach it from a browser.
  2. generate a keypair on teleport-proxy and self-sign the public key into a certificate with some unique metadata (for instance, expiry date of tomorrow)
  3. point your teleport proxy's 'proxy_service.https_cert_file' and 'proxy_service.https_key_file' to the generated key/cert
  4. start the teleport service on the teleport-proxy
  5. Point a browser to the teleport-proxy node and view the certificate that is returned (paying close attention to the unique metadata you've set in step 2)
  6. on the teleport-proxy, regenerate/re-sign the certificate, but vary the unique metadata from step 2 (for instance, change the expiry date to next week)
  7. Confirm that teleport-proxy's teleport.yaml is still pointing to the correct key/cert, but DO NOT restart the teleport service
  8. Point a browser to the teleport proxy node and view the certificate that is returned. Notice that the unique metadata from steps 2/6 have NOT been changed from the previous cert you received!

Environment

  • Teleport version (use teleport version): Teleport Enterprise v4.2.10git:v4.2.10-0-g7fd6c7348 go1.13.2

  • Tsh version (use tsh version): Teleport v4.2.10 git:v4.2.10-0-g7fd6c7348 go1.13.2

  • OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)

  • Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): AWS EC2s, but treat it like baremetal

Browser environment

  • Browser Version (for UI-related issues): Google Chrome, Version 83.0.4103.97 (Official Build) (64-bit)
  • Install tools: NA
  • Others: NA

Relevant Debug Logs If Applicable

NA

gz#3739

@webvictim
Copy link
Contributor

This is the same issue as #3198. There was a fix in #3272 but it wasn't resilient/tested enough.

@dampersand
Copy link
Author

My bad. I actually saw that one and then skipped it when I read 'cert-manager', thinking it was a different situation.

@webvictim
Copy link
Contributor

That's no problem - this explanation is more thorough anyway. I definitely want to see a fix for this issue and will try to see if we can get it worked on.

@webvictim webvictim added the feature-request Used for new features in Teleport, improvements to current should be #enhancements label Jun 9, 2020
@russjones russjones added this to the 5.0 Codename TBD milestone Jun 10, 2020
@webvictim
Copy link
Contributor

webvictim commented Aug 24, 2020

I got hit by this one myself this morning.

error: Get https://teleport.example.com:3080/v1/webapi/ping: x509: certificate has expired or is not yet valid
kex_exchange_identification: Connection closed by remote host

At first I thought I'd failed to set up certbot renew to run properly on my Teleport auth/proxy server, but it actually turned out that the certificates had been renewed just fine - they just hadn't been reloaded due to this exact issue.

My current workaround is to add a post-renew hook to certbot which should tell Teleport to reload the certs after they've been renewed:

# cat /etc/letsencrypt/renewal-hooks/post# cat 001-restart-teleport 
#!/bin/bash
kill -HUP $(pidof /usr/local/bin/teleport)

@russjones
Copy link
Contributor

@fspmarshall This is related to cluster upgrades. State machine that can cover CA rotation, but also other upgrades, like nodes, as well as proxy certificates. Watch file and trigger self-reload.

@pierrebeaucamp
Copy link
Contributor

We're running into this as well with Cloud. Fixing it in Teleport would be preferred as it would remove the need to do any sidecar shenanigans as a workaround.

@alex-kovoy
Copy link
Contributor

@russjones

It seems that you can update the keys without restarting the service and do it in place where you can use TLSConfig.GetCertificate method to pass your own callback.

(without knowing all details) we can add a flag to teleport config to autoupdate the keys. When enabled, we would listen to these files and once updated the TLSConfig.GetCertificate callback will make sure that the new keys will be used for the new connections.

robustirc/robustirc@3b83e51
https://stackoverflow.com/questions/37473201/is-there-a-way-to-update-the-tls-certificates-in-a-net-http-server-without-any-d

@benarent benarent added the cloud Cloud label Dec 14, 2020
@russjones russjones modified the milestones: 6.0 "San Diego", 6.1 Jan 26, 2021
@russjones russjones modified the milestones: 6.1, Runway Milestone Feb 3, 2021
@yjperez yjperez added the c-cy Internal Customer Reference label Jun 21, 2021
@nickbrennan1 nickbrennan1 added the c-cx Internal Customer Reference label Sep 27, 2021
@awnumar
Copy link

awnumar commented Dec 7, 2021

I have a preliminary patchset implementing certificate reloading functionality here: #9271

I would appreciate feedback on this

@deusxanima
Copy link
Contributor

Adding additional customers who've requested this feature, as well as requesting that it work for cert-manager as well for users deploying in a k8s environment.

@deusxanima deusxanima added the c-au Internal Customer Reference label Jan 18, 2022
@taylorsmcclure
Copy link

taylorsmcclure commented Mar 31, 2022

For the time being, we solved this by using https://github.com/stakater/Reloader. Once it's deployed you will need to add the annotation reloader.stakater.com/auto: "true" to your teleport deployment in your cluster.

Once the teleport-tls secret is updated reloader will trigger the termination of the teleport pods. After that the new pods will launch with the updated tls cert via the new secret.

@pschisa pschisa added the c-di Internal Customer Reference label Apr 12, 2022
@ygersie
Copy link

ygersie commented Jun 3, 2022

This is a really important feature that's currently lacking. It is a security recommendation to keep certificates short lived, reason being; otherwise you need clients to do cert verification on massive CRLs. Especially when you have certificates issued by HashiCorp Vault you want to keep certs short lived (hours/days, not weeks) as Vault will keep a copy of the cert until it expires, it will not purge it from the store when it is revoked otherwise it wouldn't be able to serve the CRL. I'm not sure why everyone always wants to come up with difficult solutions, a 30 second refresh interval reading a cert pair from the filesystem would fit anyone's use case, no need to deal with filesystem notifications or other complicated handles.

@strideynet
Copy link
Contributor

strideynet commented Jun 8, 2022

Thank you all for your suggestions and thoughts. I've had a quick look at this and formulated a bit of a plan for how we could implement this.

Firstly, we'll need to detect changes to the certificate files on disk. This is a little more complex than first imagined as we'll need to make sure we gracefully handle the certificate file being updated before the key file and vice versa (e.g not crashing or stopping serving), as I imagine that some users will be manually cp or mving the updated files into the directory. I feel an approach using inotify is probably the most sensible here (as changes made will be reflected immediately), but for this we'll need to perform certain checks that the operating system supports this (e.g that the inotify watch limit has not been reached) and either warn users where it is not supported, or fall back to a polling based approach.

For dynamically updating the certificates used by the listener, we can leverage the GetCertificate callback of tls.Config to allow us to change the certificates being served without restarting the TLS listener. The implementation of the proxy TLS config can be found here. We could also leverage GetConfigForClient, as this would allow us to provide the Certificates field as we already do, and avoid us re-implementing the certificate selection logic (as GetCertificate only allows you to provide a single certificate, and we support multiple being configured). The TLS config produced by setupProxyTLSConfig is re-used by the ALPN router, which means that this change should work when ALPN is enabled or disabled.

As for the actual implementation, I'd imagine a CertificateWatcher struct providing:

  • A GetCertificate (or GetCertificates for use in GetConfigForClient) method that can be placed into tls.Config
  • A Start(ctx context.Context) error method that begins by synchronously loading the initially configured certificates, before starting a Goroutine to watch these files for changes. An error should be thrown if the initial configuration is invalid
  • The certificates and keys should be stored in the struct, protected by a sync.RWMutex
  • Support for multiple certificate-key pairs, as our configuration already allows this.
  • Appropriate logging so it is clear when new certificates start being used, and warnings where newly added certificates are invalid.

To enable this functionality, we probably want to include an option within the configuration file until we know this is stable and behaves as expected. I think eventually it might be nice to automatically enable this by default, unless anyone can think of a reason why you wouldn't want this behaviour enabled.

If anyone has any thoughts, please let me know, especially around whether or not we want to enable this functionality by default.

I'm going to go ahead and assign myself to this issue and take ownership of this getting implemented at some point. I can't promise any real deadlines on this, but I fully agree that this is functionality that is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-an Internal Customer Reference c-au Internal Customer Reference c-cc Internal Customer Reference c-cx Internal Customer Reference c-cy Internal Customer Reference c-di Internal Customer Reference c-fi Internal Customer Reference cloud Cloud feature-request Used for new features in Teleport, improvements to current should be #enhancements
Projects
None yet