Cannot start Teleport with "distant" DynamoDB backend #31690

camscale · 2023-09-11T07:34:25Z

Expected behavior:
Teleport starts and runs when configured with a DynamoDB backend in a far part of the world.

Current behavior:
Teleport does not start, logging a fatal error -

ERROR: initialization failed
failed to get ".locks/local" (dynamo error)
        RequestCanceled: request context canceled

Bug details:

Teleport version: 14.0.0-beta.1
Recreation steps
- Create a simple configuration with a dynamo db backend:

teleport:
  nodename: local
  data_dir: d
  storage:
    type: dynamodb
    region: ap-southeast-2
    table_name: camh-teleport-backend
    audit_events_uri:  ['dynamodb://camh-teleport-events']
    audit_sessions_uri: 's3://camh-sessions-bucket/records'
    audit_retention_period: 365d
    billing_mode: provisioned

Start teleport with this configuration

Debug logs

2023-09-11T16:58:54+10:00 INFO [AUTH]      Auth server is running periodic operations. auth/init.go:596
2023-09-11T16:58:54+10:00 DEBU [AUTH]      Ticking with period: 15s. auth/auth.go:937

ERROR REPORT:
Original Error: trace.aggregate failed to get &#34;.locks/local&#34; (dynamo error)
        RequestCanceled: request context canceled
caused by: context deadline exceeded
Stack Trace:
        github.com/gravitational/teleport/lib/backend/helpers.go:228 github.com/gravitational/teleport/lib/backend.RunWhileLocked
        github.com/gravitational/teleport/lib/auth/init.go:281 github.com/gravitational/teleport/lib/auth.Init
        github.com/gravitational/teleport/lib/service/service.go:1719 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService
        github.com/gravitational/teleport/lib/service/service.go:1086 github.com/gravitational/teleport/lib/service.NewTeleport
        github.com/gravitational/teleport/e/tool/teleport/process/process.go:57 github.com/gravitational/teleport/e/tool/teleport/process.NewTeleport
        github.com/gravitational/teleport/lib/service/service.go:685 github.com/gravitational/teleport/lib/service.Run
        github.com/gravitational/teleport/e/tool/teleport/main.go:28 main.main
        runtime/proc.go:267 runtime.main
        runtime/asm_amd64.s:1650 runtime.goexit
User Message: initialization failed
        failed to get &#34;.locks/local&#34; (dynamo error)
        RequestCanceled: request context canceled
caused by: context deadline exceeded

The root of this problem is that InitCluster() is run under RunWhileLocked() (https://github.com/gravitational/teleport/blob/v14.0.0-beta.1/lib/auth/init.go#L281) with a default lock release timeout of 300ms (https://github.com/gravitational/teleport/blob/v14.0.0-beta.1/lib/backend/helpers.go#L181). When the AWS region is "far" away, this is not enough time for the lock to be released, so the lock release fails even though the process was successfully initalised.

In the above sample configuration, I have the region as ap-southeast-2 (Sydney, Australia). I am in Melbourne, Australia so this works for me. But if I use us-west-2 (Oregon), I get the above failure trace. The RTT from Australia to the US across the pacific is typically about 200ms at best, so a 300ms lock release timeout appears to be too short for that.

The error message output for this failure gives no indication at all of what the error is. I needed to run teleport with DEBUG=1 to get a stack trace of the failure, and apply my understanding of "context deadline exeeded" as a Go programmer to be able to figure out what was going on. A non-developer user probably wouldn't have a chance of diagnosing this from the reported error.

The text was updated successfully, but these errors were encountered:

espadolini · 2023-09-11T09:58:42Z

The experience of using Teleport when a backend Delete takes longer than 300ms is going to be somewhat poor regardless, but maybe we could release the lock in the background? Or just give a generous enough time for the lock release to handle all wired connections across the planet, I suppose.

There's other uses of RunWhileLocked in the codebase, none of which alter ReleaseCtxTimeout from its 300ms default, so even if we fixed it for lib/auth.Init, the other ones are still going to error out (athena events processor, remote locks replacement, all SAML IdP and access list handling).

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

camscale added bug test-plan-problem Issues which have been surfaced by running the manual release test plan labels Sep 11, 2023

camscale mentioned this issue Sep 11, 2023

Teleport 14 Test Plan #31122

Closed

espadolini assigned rosstimothy Sep 11, 2023

zmb3 added the aws Used for AWS Related Issues. label Sep 11, 2023

rosstimothy mentioned this issue Sep 11, 2023

Increase lock release timeout in RunWhileLocked #31729

Merged

rosstimothy added a commit that referenced this issue Sep 11, 2023

Increase lock release timeout in RunWhileLocked

4cd3bf4

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

rosstimothy closed this as completed in #31729 Sep 11, 2023

github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2023

Increase lock release timeout in RunWhileLocked (#31729)

36c6156

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

github-actions bot pushed a commit that referenced this issue Sep 11, 2023

Increase lock release timeout in RunWhileLocked

016769f

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

github-actions bot pushed a commit that referenced this issue Sep 11, 2023

Increase lock release timeout in RunWhileLocked

81d2ffa

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

github-merge-queue bot pushed a commit that referenced this issue Sep 12, 2023

Increase lock release timeout in RunWhileLocked (#31742)

9ca5aa2

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

github-merge-queue bot pushed a commit that referenced this issue Sep 12, 2023

Increase lock release timeout in RunWhileLocked (#31741)

743511b

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

hugoShaka added a commit that referenced this issue Apr 18, 2024

Increase lock release timeout in RunWhileLocked

5e05768

The default release timeout is now a minute to allow slow/distant connections to the backend to complete releasing the lock. Closes #31690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot start Teleport with "distant" DynamoDB backend #31690

Cannot start Teleport with "distant" DynamoDB backend #31690

camscale commented Sep 11, 2023

espadolini commented Sep 11, 2023

Cannot start Teleport with "distant" DynamoDB backend #31690

Cannot start Teleport with "distant" DynamoDB backend #31690

Comments

camscale commented Sep 11, 2023

espadolini commented Sep 11, 2023