Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

aashley · 2018-02-26T07:47:07Z

What happened: Upgraded from 2.4.0 to 2.4.2 previously connecting trusted clusters could no longer connect. Got the following errors in the master logs:

Feb 26 07:26:18 teleport teleport[31744]: WARN [PROXY:SER] this claims to be signed as authDomain %!v(MISSING), but no matching signing keys found remote:180.181.245.242:35200 user:7f79bd80-6eb8-4290-9b8d-069ce2f700aa.od-server-00141 reversetunnel/srv.go:514
Feb 26 07:26:18 teleport teleport[31744]: ERRO [PROXY:SER] failed to retrieve trusted keys, err: parsing time "2018-02-27T03:26:17.15149118ZZ": extra text: Z reversetunnel/srv.go:416
Feb 26 07:26:18 teleport teleport[31744]: WARN [PROXY:SER] failed authenticate host, err: ssh: certificate signed by unrecognized authority remote:210.10.211.14:58960 user:87b691e3-fb39-411f-93b9-64617275cec3.od-server-00182 reversetunnel/srv.go:502

Upgraded the remote cluster to 2.4.2 with same result. Only 2 of the remote clusters reconnected successfully. All others failed.

Reverted master to 2.4.0, remote clusters a mix of 2.4.0 and 2.4.2 all connecting correctly.

What you expected to happen: 2.4.0 remote nodes connect successfully.

How to reproduce it (as minimally and precisely as possible):
Not sure yet, upgraded my production cluster from 2.4.0 to 2.4.2. Cluster was originally installed with 2.0, looks like most of the nodes getting the above errors where originally installed pre 2.3.0. The others where timing out but nothing more detailed.

Environment:

Teleport version (use teleport version):Teleport v2.4.2 git:v2.4.2-0-g079d345 and Teleport v2.4.0 git:v2.4.0-0-ge9d6645
Tsh version (use tsh version):Teleport v2.4.2 git:v2.4.2-0-g079d345 and Teleport v2.4.0 git:v2.4.0-0-ge9d6645
OS (e.g. from /etc/os-release):Ubuntu 16.04

Browser environment

Browser Version (for UI-related issues):
Install tools:
Others:

Relevant Debug Logs If Applicable

tsh --debug
teleport --debug

The text was updated successfully, but these errors were encountered:

klizhentas · 2018-03-01T05:13:34Z

@russjones see this one and the #1733

russjones · 2018-03-01T23:58:48Z

@aashley These issues are occurring for you while using tsh correct? Did you wipe out ~/.tsh and try to log in again?

aashley · 2018-03-02T00:15:14Z

Yes, and tried to login with the web interface in a new browser

aashley · 2018-03-02T01:37:14Z

was on the phone for last comment, most of my testing is with tsh on the console, included clearing the .tsh folder locally. To validate I tried the web interface as well as thats what most of my guys use and if it worked there I'd put up with the console issues for the other fixes, but it failed there as well.

russjones · 2018-03-02T01:43:18Z

@aashley I've tried reproducing this issue and not had any luck. I also went through the diff between v2.4.0 and v2.4.2 and nothing stood out relating to the error you are seeing because that part of the code was largely not changed.

Two questions:

Are you using a HTTP CONNECT proxy to establish a connection between the remote cluster and the main cluster?
How many remote clusters are connecting to the main cluster?

aashley · 2018-03-02T01:50:03Z

no proxy for any of the remote clusters.

Approximately 50 remote clusters.

russjones · 2018-03-02T02:47:19Z

@aashley What backend are you using?

aashley · 2018-03-02T05:20:09Z

haven't configured anything so the default, bolt.

klizhentas · 2018-03-10T23:43:30Z

We have run some tests and tried to reproduce your issue.

The fundamental problem is caching backend that is not goroutine friendly, and in case of many trusted clusters it overwrites the values corrupting local state.

I think I have fixed this problem in 2.5.0. Is it easy for you to try out 2.5.0 configuration (2.5.0 stable is coming out on Monday)?

Otherwise if it's hard to try out, I will think of some alternative solution, sorry for your troubles with this setup in OSS.

aashley · 2018-03-12T03:06:31Z

We tried the upgrade to 2.4.3 and that appears to of resolved the issue. All our nodes are running 2.4.3 now and connecting successfully.

klizhentas · 2018-05-08T04:27:00Z

k, closing this

klizhentas closed this as completed May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

aashley commented Feb 26, 2018

klizhentas commented Mar 1, 2018

russjones commented Mar 1, 2018 •

edited

Loading

aashley commented Mar 2, 2018

aashley commented Mar 2, 2018

russjones commented Mar 2, 2018 •

edited

Loading

aashley commented Mar 2, 2018

russjones commented Mar 2, 2018

aashley commented Mar 2, 2018

klizhentas commented Mar 10, 2018

aashley commented Mar 12, 2018

klizhentas commented May 8, 2018

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

Upgrade from 2.4.0 to 2.4.2, trusted clusters failed to reconnect #1723

Comments

aashley commented Feb 26, 2018

klizhentas commented Mar 1, 2018

russjones commented Mar 1, 2018 • edited Loading

aashley commented Mar 2, 2018

aashley commented Mar 2, 2018

russjones commented Mar 2, 2018 • edited Loading

aashley commented Mar 2, 2018

russjones commented Mar 2, 2018

aashley commented Mar 2, 2018

klizhentas commented Mar 10, 2018

aashley commented Mar 12, 2018

klizhentas commented May 8, 2018

russjones commented Mar 1, 2018 •

edited

Loading

russjones commented Mar 2, 2018 •

edited

Loading