Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

Closed
aashley opened this issue Feb 26, 2018 · 11 comments
Closed

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

aashley opened this issue Feb 26, 2018 · 11 comments

Comments

@aashley
Copy link

aashley commented Feb 26, 2018

What happened: Upgraded from 2.4.0 to 2.4.2 previously connecting trusted clusters could no longer connect. Got the following errors in the master logs:

Feb 26 07:26:18 teleport teleport[31744]: WARN [PROXY:SER] this claims to be signed as authDomain %!v(MISSING), but no matching signing keys found remote:180.181.245.242:35200 user:7f79bd80-6eb8-4290-9b8d-069ce2f700aa.od-server-00141 reversetunnel/srv.go:514
Feb 26 07:26:18 teleport teleport[31744]: ERRO [PROXY:SER] failed to retrieve trusted keys, err: parsing time "2018-02-27T03:26:17.15149118ZZ": extra text: Z reversetunnel/srv.go:416
Feb 26 07:26:18 teleport teleport[31744]: WARN [PROXY:SER] failed authenticate host, err: ssh: certificate signed by unrecognized authority remote:210.10.211.14:58960 user:87b691e3-fb39-411f-93b9-64617275cec3.od-server-00182 reversetunnel/srv.go:502

Upgraded the remote cluster to 2.4.2 with same result. Only 2 of the remote clusters reconnected successfully. All others failed.

Reverted master to 2.4.0, remote clusters a mix of 2.4.0 and 2.4.2 all connecting correctly.

What you expected to happen: 2.4.0 remote nodes connect successfully.

How to reproduce it (as minimally and precisely as possible):
Not sure yet, upgraded my production cluster from 2.4.0 to 2.4.2. Cluster was originally installed with 2.0, looks like most of the nodes getting the above errors where originally installed pre 2.3.0. The others where timing out but nothing more detailed.

Environment:

  • Teleport version (use teleport version):Teleport v2.4.2 git:v2.4.2-0-g079d345 and Teleport v2.4.0 git:v2.4.0-0-ge9d6645
  • Tsh version (use tsh version):Teleport v2.4.2 git:v2.4.2-0-g079d345 and Teleport v2.4.0 git:v2.4.0-0-ge9d6645
  • OS (e.g. from /etc/os-release):Ubuntu 16.04

Browser environment

  • Browser Version (for UI-related issues):
  • Install tools:
  • Others:

Relevant Debug Logs If Applicable

  • tsh --debug
  • teleport --debug
@klizhentas
Copy link
Contributor

@russjones see this one and the #1733

@russjones
Copy link
Contributor

russjones commented Mar 1, 2018

@aashley These issues are occurring for you while using tsh correct? Did you wipe out ~/.tsh and try to log in again?

@aashley
Copy link
Author

aashley commented Mar 2, 2018

Yes, and tried to login with the web interface in a new browser

@aashley
Copy link
Author

aashley commented Mar 2, 2018

was on the phone for last comment, most of my testing is with tsh on the console, included clearing the .tsh folder locally. To validate I tried the web interface as well as thats what most of my guys use and if it worked there I'd put up with the console issues for the other fixes, but it failed there as well.

@russjones
Copy link
Contributor

russjones commented Mar 2, 2018

@aashley I've tried reproducing this issue and not had any luck. I also went through the diff between v2.4.0 and v2.4.2 and nothing stood out relating to the error you are seeing because that part of the code was largely not changed.

Two questions:

  1. Are you using a HTTP CONNECT proxy to establish a connection between the remote cluster and the main cluster?
  2. How many remote clusters are connecting to the main cluster?

@aashley
Copy link
Author

aashley commented Mar 2, 2018

no proxy for any of the remote clusters.

Approximately 50 remote clusters.

@russjones
Copy link
Contributor

@aashley What backend are you using?

@aashley
Copy link
Author

aashley commented Mar 2, 2018

haven't configured anything so the default, bolt.

@klizhentas
Copy link
Contributor

We have run some tests and tried to reproduce your issue.

The fundamental problem is caching backend that is not goroutine friendly, and in case of many trusted clusters it overwrites the values corrupting local state.

I think I have fixed this problem in 2.5.0. Is it easy for you to try out 2.5.0 configuration (2.5.0 stable is coming out on Monday)?

Otherwise if it's hard to try out, I will think of some alternative solution, sorry for your troubles with this setup in OSS.

@aashley
Copy link
Author

aashley commented Mar 12, 2018

We tried the upgrade to 2.4.3 and that appears to of resolved the issue. All our nodes are running 2.4.3 now and connecting successfully.

@klizhentas
Copy link
Contributor

k, closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants