Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start Teleport with "distant" DynamoDB backend #31690

Closed
camscale opened this issue Sep 11, 2023 · 1 comment · Fixed by #31729
Closed

Cannot start Teleport with "distant" DynamoDB backend #31690

camscale opened this issue Sep 11, 2023 · 1 comment · Fixed by #31729
Assignees
Labels
aws Used for AWS Related Issues. bug test-plan-problem Issues which have been surfaced by running the manual release test plan

Comments

@camscale
Copy link
Contributor

Expected behavior:
Teleport starts and runs when configured with a DynamoDB backend in a far part of the world.

Current behavior:
Teleport does not start, logging a fatal error -

ERROR: initialization failed
failed to get ".locks/local" (dynamo error)
        RequestCanceled: request context canceled

Bug details:

  • Teleport version: 14.0.0-beta.1
  • Recreation steps
    • Create a simple configuration with a dynamo db backend:
teleport:
  nodename: local
  data_dir: d
  storage:
    type: dynamodb
    region: ap-southeast-2
    table_name: camh-teleport-backend
    audit_events_uri:  ['dynamodb://camh-teleport-events']
    audit_sessions_uri: 's3://camh-sessions-bucket/records'
    audit_retention_period: 365d
    billing_mode: provisioned
  • Start teleport with this configuration
  • Debug logs
2023-09-11T16:58:54+10:00 INFO [AUTH]      Auth server is running periodic operations. auth/init.go:596
2023-09-11T16:58:54+10:00 DEBU [AUTH]      Ticking with period: 15s. auth/auth.go:937

ERROR REPORT:
Original Error: trace.aggregate failed to get ".locks/local" (dynamo error)
        RequestCanceled: request context canceled
caused by: context deadline exceeded
Stack Trace:
        github.com/gravitational/teleport/lib/backend/helpers.go:228 github.com/gravitational/teleport/lib/backend.RunWhileLocked
        github.com/gravitational/teleport/lib/auth/init.go:281 github.com/gravitational/teleport/lib/auth.Init
        github.com/gravitational/teleport/lib/service/service.go:1719 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService
        github.com/gravitational/teleport/lib/service/service.go:1086 github.com/gravitational/teleport/lib/service.NewTeleport
        github.com/gravitational/teleport/e/tool/teleport/process/process.go:57 github.com/gravitational/teleport/e/tool/teleport/process.NewTeleport
        github.com/gravitational/teleport/lib/service/service.go:685 github.com/gravitational/teleport/lib/service.Run
        github.com/gravitational/teleport/e/tool/teleport/main.go:28 main.main
        runtime/proc.go:267 runtime.main
        runtime/asm_amd64.s:1650 runtime.goexit
User Message: initialization failed
        failed to get ".locks/local" (dynamo error)
        RequestCanceled: request context canceled
caused by: context deadline exceeded

The root of this problem is that InitCluster() is run under RunWhileLocked() (https://github.com/gravitational/teleport/blob/v14.0.0-beta.1/lib/auth/init.go#L281) with a default lock release timeout of 300ms (https://github.com/gravitational/teleport/blob/v14.0.0-beta.1/lib/backend/helpers.go#L181). When the AWS region is "far" away, this is not enough time for the lock to be released, so the lock release fails even though the process was successfully initalised.

In the above sample configuration, I have the region as ap-southeast-2 (Sydney, Australia). I am in Melbourne, Australia so this works for me. But if I use us-west-2 (Oregon), I get the above failure trace. The RTT from Australia to the US across the pacific is typically about 200ms at best, so a 300ms lock release timeout appears to be too short for that.

The error message output for this failure gives no indication at all of what the error is. I needed to run teleport with DEBUG=1 to get a stack trace of the failure, and apply my understanding of "context deadline exeeded" as a Go programmer to be able to figure out what was going on. A non-developer user probably wouldn't have a chance of diagnosing this from the reported error.

@camscale camscale added bug test-plan-problem Issues which have been surfaced by running the manual release test plan labels Sep 11, 2023
@espadolini
Copy link
Contributor

The experience of using Teleport when a backend Delete takes longer than 300ms is going to be somewhat poor regardless, but maybe we could release the lock in the background? Or just give a generous enough time for the lock release to handle all wired connections across the planet, I suppose.

There's other uses of RunWhileLocked in the codebase, none of which alter ReleaseCtxTimeout from its 300ms default, so even if we fixed it for lib/auth.Init, the other ones are still going to error out (athena events processor, remote locks replacement, all SAML IdP and access list handling).

@zmb3 zmb3 added the aws Used for AWS Related Issues. label Sep 11, 2023
rosstimothy added a commit that referenced this issue Sep 11, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
github-actions bot pushed a commit that referenced this issue Sep 11, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
github-actions bot pushed a commit that referenced this issue Sep 11, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
github-merge-queue bot pushed a commit that referenced this issue Sep 12, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
github-merge-queue bot pushed a commit that referenced this issue Sep 12, 2023
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
hugoShaka added a commit that referenced this issue Apr 18, 2024
The default release timeout is now a minute to allow slow/distant
connections to the backend to complete releasing the lock.

Closes #31690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Used for AWS Related Issues. bug test-plan-problem Issues which have been surfaced by running the manual release test plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants