Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport 14 Test Plan #31122

Closed
r0mant opened this issue Aug 28, 2023 · 30 comments
Closed

Teleport 14 Test Plan #31122

r0mant opened this issue Aug 28, 2023 · 30 comments
Labels
test-plan A list of tasks required to ship a successful product release.

Comments

@r0mant
Copy link
Collaborator

r0mant commented Aug 28, 2023

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport.
These tests should be run on both a fresh installation of the version to be released
as well as an upgrade of the previous version of Teleport.

  • Adding nodes to a cluster @bl-nero

    • Adding Nodes via Valid Static Token
    • Adding Nodes via Valid Short-lived Tokens
    • Adding Nodes via Invalid Token Fails
    • Revoking Node Invitation
  • Labels @bl-nero

    • Static Labels
    • Dynamic Labels
  • Trusted Clusters @bl-nero

    • Adding Trusted Cluster Valid Static Token
    • Adding Trusted Cluster Valid Short-lived Token
    • Adding Trusted Cluster Invalid Token
    • Removing Trusted Cluster
    • Changing role map of existing Trusted Cluster
  • RBAC @bl-nero

    Make sure that invalid and valid attempts are reflected in audit log. Do this with both Teleport and Agentless nodes.

    • Successfully connect to node with correct role
    • Unsuccessfully connect to a node in a role restricting access by label
    • Unsuccessfully connect to a node in a role restricting access by invalid SSH login
    • Allow/deny role option: SSH agent forwarding
    • Allow/deny role option: Port forwarding
    • Allow/deny role option: SSH file copying
  • Verify that custom PAM environment variables are available as expected. @atburke

  • Users @bl-nero

    With every user combination, try to login and signup with invalid second
    factor, invalid password to see how the system reacts.

    WebAuthn in the release tsh binary is implemented using libfido2 for
    linux/macOS. Ask for a statically built pre-release binary for realistic
    tests. (tsh fido2 diag should work in our binary.) Webauthn in Windows
    build is implemented using webauthn.dll. (tsh webauthn diag with
    security key selected in dialog should work.)

    Touch ID requires a signed tsh, ask for a signed pre-release binary so you
    may run the tests.

    Windows Webauthn requires Windows 10 19H1 and device capable of Windows
    Hello.

    • Adding Users Password Only

    • Adding Users OTP

    • Adding Users WebAuthn

    • Adding Users via platform authenticator

    • Managing MFA devices

      • Add an OTP device with tsh mfa add
      • Add a WebAuthn device with tsh mfa add
      • Add platform authenticator device with tsh mfa add
      • List MFA devices with tsh mfa ls
      • Remove an OTP device with tsh mfa rm
      • Remove a WebAuthn device with tsh mfa rm
      • Attempt removing the last MFA device on the user
        • with second_factor: on in auth_service, should fail
        • with second_factor: optional in auth_service, should succeed
    • Login Password Only

    • Login with MFA

      • Add an OTP, a WebAuthn and a Touch ID/Windows Hello device with tsh mfa add
      • Login via OTP
      • Login via WebAuthn
      • Login via platform authenticator
      • Login via WebAuthn using an U2F/CTAP1 device
    • Login OIDC

    • Login SAML

    • Login GitHub

    • Deleting Users

  • Backends @capnspacehook

    • Teleport runs with etcd
    • Teleport runs with DynamoDB
      • AWS integration tests are passing
    • Teleport runs with SQLite
    • Teleport runs with Firestore
      • GCP integration tests are passing
    • Teleport runs with Postgres
  • Session Recording @capnspacehook

    • Session recording can be disabled
    • Sessions can be recorded at the node
      • Sessions in remote clusters are recorded in remote clusters
    • Sessions can be recorded at the proxy
      • Sessions on remote clusters are recorded in the local cluster
      • With an OpenSSH server without a Teleport CA signed host certificate:
        • Host key checking enabled rejects connection
        • Host key checking disabled allows connection
  • Enhanced Session Recording @jakule

    • disk, command and network events are being logged.
    • Recorded events can be enforced by the enhanced_recording role option.
    • Enhanced session recording can be enabled on CentOS 7 with kernel 5.8+.
  • Restricted Session @jakule

    • Network request are allowed when a policy allow them.
    • Network request are blocked when a policy deny them.
  • Auditd @jakule

    • When auditd is enabled, audit events are recorded —
      // EventType represent auditd message type.
      // Values comes from https://github.com/torvalds/linux/blob/08145b087e4481458f6075f3af58021a3cf8a940/include/uapi/linux/audit.h#L54
      type EventType int
      const (
      AuditGet EventType = 1000
      AuditUserEnd EventType = 1106
      AuditUserLogin EventType = 1112
      AuditUserErr EventType = 1109
      )
      • SSH session start — user login event
      • SSH session end
      • SSH Login failures — SSH auth error
      • SSH Login failures — unknown OS user
      • Session ID is correct (only true when Teleport runs as systemd service)
      • Teleport user is recorded as an auditd event field
  • Audit Log @Joerger

    • Audit log with dynamodb

      • AWS integration tests are passing
    • Audit log with Firestore

      • GCP integration tests are passing
    • Failed login attempts are recorded

    • Interactive sessions have the correct Server ID

      • server_id is the ID of the node in "session_recording: node" mode
      • server_id is the ID of the node in "session_recording: proxy" mode
      • forwarded_by is the ID of the proxy in "session_recording: proxy" mode

      Node/Proxy ID may be found at /var/lib/teleport/host_uuid in the
      corresponding machine.

      Node IDs may also be queried via tctl nodes ls.

    • Exec commands are recorded

    • scp commands are recorded

    • Subsystem results are recorded

      Subsystem testing may be achieved using both
      Recording Proxy mode
      and
      OpenSSH integration.

      Assuming the proxy is proxy.example.com:3023 and node1 is a node running
      OpenSSH/sshd, you may use the following command to trigger a subsystem audit
      log:

      sftp -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%p" root@node1
  • Interact with a cluster using tsh @lxea

    These commands should ideally be tested for recording and non-recording modes as they are implemented in a different ways.

    • tsh ssh <regular-node>
    • tsh ssh <node-remote-cluster>
    • tsh ssh <agentless-node>
    • tsh ssh <agentless-node-remote-cluster>
    • tsh ssh -A <regular-node>
    • tsh ssh -A <node-remote-cluster>
    • tsh ssh -A <agentless-node>
    • tsh ssh -A <agentless-node-remote-cluster>
    • tsh ssh <regular-node> ls
    • tsh ssh <node-remote-cluster> ls
    • tsh ssh <agentless-node> ls
    • tsh ssh <agentless-node-remote-cluster> ls
    • tsh join <regular-node>
    • tsh join <node-remote-cluster>
    • tsh play <regular-node>
    • tsh play <node-remote-cluster>
    • tsh play <agentless-node>
    • tsh play <agentless-node-remote-cluster>
    • tsh scp <regular-node>
    • tsh scp <node-remote-cluster>
    • tsh scp <agentless-node>
    • tsh scp <agentless-node-remote-cluster>
    • tsh ssh -L <regular-node>
    • tsh ssh -L <node-remote-cluster>
    • tsh ssh -L <agentless-node>
    • tsh ssh -L <agentless-node-remote-cluster>
    • tsh ls
    • tsh clusters
  • Interact with a cluster using ssh @capnspacehook
    Make sure to test both recording and regular proxy modes.

    • ssh <regular-node>
    • ssh <node-remote-cluster>
    • ssh <agentless-node>
    • ssh <agentless-node-remote-cluster>
    • ssh -A <regular-node>
    • ssh -A <node-remote-cluster>
    • ssh -A <agentless-node>
    • ssh -A <agentless-node-remote-cluster>
    • ssh <regular-node> ls
    • ssh <node-remote-cluster> ls
    • ssh <agentless-node> ls
    • ssh <agentless-node-remote-cluster> ls
    • scp <regular-node>
    • scp <node-remote-cluster>
    • scp <agentless-node>
    • scp <agentless-node-remote-cluster>
    • ssh -L <regular-node>
    • ssh -L <node-remote-cluster>
    • ssh -L <agentless-node>
    • ssh -L <agentless-node-remote-cluster>
  • Verify proxy jump functionality @Joerger
    Log into leaf cluster via root, shut down the root proxy and verify proxy jump works.

    • tls routing disabled
      • tsh ssh -J <leaf.proxy.example.com:3023>
      • ssh -J <leaf.proxy.example.com:3023>
    • tls routing enabled
      • tsh ssh -J <leaf.proxy.example.com:3080>
      • tsh proxy ssh -J <leaf.proxy.example.com:3080>
  • Interact with a cluster using the Web UI @bl-nero

    • Connect to a Teleport node
    • Connect to a Agentless node
    • Check agent forwarding is correct based on role and proxy mode.
  • tsh CA loading @atburke

    Create a trusted cluster pair with a node in the leaf cluster. Log into the root cluster.

    • load_all_cas on the root auth server is false (default) -
      tsh ssh leaf.node.example.com results in access denied.
    • load_all_cas on the root auth server is true - tsh ssh leaf.node.example.com
      succeeds.
  • X11 Forwarding @Joerger

    • Install xeyes and xclip:
      • Linux: apt install x11-apps xclip
      • Mac: Install and launch XQuartz which comes with xeyes. Then brew install xclip.
    • Enable X11 forwarding for a Node running as root: ssh_service.x11.enabled = yes
    • Successfully X11 forward as both root and non-root user
      • tsh ssh -X user@node xeyes
      • tsh ssh -X root@node xeyes
    • Test untrusted vs trusted forwarding
      • tsh ssh -Y server01 "echo Hello World | xclip -sel c && xclip -sel c -o" should print "Hello World"
      • tsh ssh -X server01 "echo Hello World | xclip -sel c && xclip -sel c -o" should fail with "BadAccess" X error

User accounting @atburke

  • Verify that active interactive sessions are tracked in /var/run/utmp on Linux.
  • Verify that interactive sessions are logged in /var/log/wtmp on Linux.

Combinations @marcoandredinis

For some manual testing, many combinations need to be tested. For example, for
interactive sessions the 12 combinations are below.

  • N/A Connect to a OpenSSH node in a local cluster using OpenSSH.
  • N/A Connect to a OpenSSH node in a local cluster using Teleport.
  • N/A Connect to a OpenSSH node in a local cluster using the Web UI.
  • Connect to an Agentless node in a local cluster using OpenSSH.
  • Connect to an Agentless node in a local cluster using Teleport.
  • Connect to an Agentless node in a local cluster using the Web UI.
  • Connect to a Teleport node in a local cluster using OpenSSH.
  • Connect to a Teleport node in a local cluster using Teleport.
  • Connect to a Teleport node in a local cluster using the Web UI.
  • N/A Connect to a OpenSSH node in a remote cluster using OpenSSH.
  • N/A Connect to a OpenSSH node in a remote cluster using Teleport.
  • N/A Connect to a OpenSSH node in a remote cluster using the Web UI.
  • Connect to an Agentless node in a remote cluster using OpenSSH. Access agentless nodes using hostname with ssh as client #31281
  • Connect to an Agentless node in a remote cluster using Teleport.
  • Connect to an Agentless node in a remote cluster using the Web UI.
  • Connect to a Teleport node in a remote cluster using OpenSSH.
  • Connect to a Teleport node in a remote cluster using Teleport.
  • Connect to a Teleport node in a remote cluster using the Web UI.

Teleport with EKS/GKE @tigrato

  • Deploy Teleport on a single EKS cluster
  • Deploy Teleport on two EKS clusters and connect them via trusted cluster feature
  • Deploy Teleport Proxy outside GKE cluster fronting connections to it (use this script to generate a kubeconfig)
  • Deploy Teleport Proxy outside EKS cluster fronting connections to it (use this script to generate a kubeconfig)

Teleport with multiple Kubernetes clusters @AntonAM

Note: you can use GKE or EKS or minikube to run Kubernetes clusters.
Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

  • Deploy combo auth/proxy/kubernetes_service outside a Kubernetes cluster, using a kubeconfig
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy/kubernetes_service inside a Kubernetes cluster
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy_service outside the Kubernetes cluster and kubernetes_service inside of a Kubernetes cluster, connected over a reverse tunnel
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy a second kubernetes_service inside another Kubernetes cluster, connected over a reverse tunnel
    • Login with tsh login, check that tsh kube ls has both clusters
    • Switch to a second cluster using tsh kube login
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh on the new cluster
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy/kubernetes_service outside a Kubernetes cluster, using a kubeconfig with multiple clusters in it
    • Login with tsh login, check that tsh kube ls has all clusters
  • Test Kubernetes screen in the web UI (tab is located on left side nav on dashboard):
    • Verify that all kubes registered are shown with correct name and labels
    • Verify that clicking on a rows connect button renders a dialogue on manual instructions with Step 2 login value matching the rows name column
    • Verify searching for name or labels in the search bar works
    • Verify you can sort by name colum
  • Test Kubernetes exec via WebSockets - client

Kubernetes auto-discovery @tigrato

  • Test Kubernetes auto-discovery:
    • Verify that Azure AKS clusters are discovered and enrolled for different Azure Auth configs:
      • Local Accounts only
      • Azure AD
      • Azure RBAC
    • Verify that AWS EKS clusters are discovered and enrolled
    • Verify that GCP GKE clusters are discovered and enrolled
  • Verify dynamic registration.
    • Can register a new Kubernetes cluster using tctl create.
    • Can update registered Kubernetes cluster using tctl create -f.
    • Can delete registered Kubernetes cluster using tctl rm.

Kubernetes Secret Storage @AntonAM

  • Kubernetes Secret storage for Agent's Identity
    • Install Teleport agent with a short-lived token
      • Validate if the Teleport is installed as a Kubernetes Statefulset
      • Restart the agent after token TTL expires to see if it reuses the same identity.
    • Force cluster CA rotation

Kubernetes RBAC @AntonAM

  • Verify the following scenarios for kubernetes_resources:
    • {"kind":"pod","name":"*","namespace":"*"} - must allow access to every pod.
    • {"kind":"pod","name":"<somename>","namespace":"*"} - must allow access to pod <somename> in every namespace.
    • {"kind":"pod","name":"*","namespace":"<somenamespace>"} - must allow access to any pod in <somenamespace> namespace.
    • Verify support for * wildcards - <some-name>-* and regex for name and namespace fields.
    • Verify support for delete pods collection - must use go-client.
  • Verify scenarios with multiple roles defining kubernetes_resources:
    • Validate that the returned list of pods is the union of every role.
    • Validate that access to other pods is denied by RBAC.
    • Validate that the Kubernetes Groups/Users are correctly selected depending on the role that applies to the pod.
      • Test with a kubernetes_groups that denies exec into a pod
  • Verify the following scenarios for Resource Access Requests to Pods:
    • Create a valid resource access request and validate if access to other pods is denied.
    • Validate if creating a resource access request with Kubernetes resources denied by search_as_roles is not allowed.
  • Verify kind: namespace scenarios for kubernetes_resources:
    • Validate that the user can list namespaces.
    • Validate that the user has access to all resources within that namespace - including custom resources.
    • Validate that the user cannot list resources from other namespaces and is jailed - including custom resources.
    • Test with a kubernetes_groups that denies exec into a pod from another namespace
  • Verify the following scenarios for Resource Access Requests to Namespaces:
    • Create a valid resource access request and validate if access to other namespaces is denied.
    • Validate that the user can access all resources within the namespace.
  • Verify that kubernetes_resources is capable of restricting verbs:
    • Restrict access to read and try to list, update and delete some resource
    • Ensure that we correctly convert roles prior to v7 and no compatibility is lost

Teleport with FIPS mode @codingllama

  • Perform trusted clusters, Web and SSH sanity check with all teleport components deployed in FIPS mode.

ACME @bl-nero

  • Teleport can fetch TLS certificate automatically using ACME protocol.

Migrations @bl-nero

  • Migrate trusted clusters from 2.4.0 to 2.5.0
    • Migrate auth server on main cluster, then rest of the servers on main cluster
      SSH should work for both main and old clusters
    • Migrate auth server on remote cluster, then rest of the remote cluster
      SSH should work

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%[email protected]" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers

  • G Suite install instructions work @camscale
    • G Suite Screenshots are up-to-date
  • Azure Active Directory (AD) install instructions work @gabrielcorado
    • Azure Active Directory (AD) Screenshots are up-to-date
  • ActiveDirectory (ADFS) install instructions work @gabrielcorado
    • Active Directory (ADFS) Screenshots are up-to-date
  • Okta install instructions work @mdwn
    • Okta Screenshots are up-to-date
  • OneLogin install instructions work @hugoShaka
    • OneLogin Screenshots are up-to-date
  • GitLab install instructions work @capnspacehook
    • GitLab Screenshots are up-to-date
  • OIDC install instructions work @camscale
    • OIDC Screenshots are up-to-date
  • All providers with guides in docs are covered in this test plan
  • Login Rules work to transform traits from SSO provider @mdwn
  • SAML IdP guide instructions work @mdwn
    • SAML IdP screenshots are up to date

GitHub External SSO @capnspacehook

  • Teleport OSS
    • GitHub organization without external SSO succeeds
    • GitHub organization with external SSO fails
  • Teleport Enterprise
    • GitHub organization without external SSO succeeds
    • GitHub organization with external SSO succeeds

tctl sso family of commands @tcsc

For help with setting up sso connectors, check out the [Quick GitHub/SAML/OIDC Setup Tips]

tctl sso configure helps to construct a valid connector definition:

  • tctl sso configure github ... creates valid connector definitions
  • tctl sso configure oidc ... creates valid connector definitions
  • tctl sso configure saml ... creates valid connector definitions

tctl sso test test a provided connector definition, which can be loaded from
file or piped in with tctl sso configure or tctl get --with-secrets. Valid
connectors are accepted, invalid are rejected with sensible error messages.

  • Connectors can be tested with tctl sso test.
    • GitHub
    • SAML
    • OIDC
      • Google Workspace
      • Non-Google IdP

Teleport Plugins @EdwardDowling

  • Test receiving a message via Teleport Slackbot
  • Test receiving a new Jira Ticket via Teleport Jira

AWS Node Joining @atburke

Docs

  • On EC2 instance with ec2:DescribeInstances permissions for local account:
    TELEPORT_TEST_EC2=1 go test ./integration -run TestEC2NodeJoin
  • On EC2 instance with any attached role:
    TELEPORT_TEST_EC2=1 go test ./integration -run TestIAMNodeJoin
  • EC2 Join method in IoT mode with node and auth in different AWS accounts
  • IAM Join method in IoT mode with node and auth in different AWS accounts

Kubernetes Node Joining @hugoShaka

  • Join a Teleport node running in the same Kubernetes cluster via a Kubernetes ProvisionToken

Azure Node Joining @tcsc

Docs

  • Join a Teleport node running in an Azure VM

GCP Node Joining @tcsc

Docs

  • Join a Teleport node running in a GCP VM.

Cloud Labels @tcsc

  • Create an EC2 instance with tags in instance metadata enabled
    and with tag foo: bar. Verify that a node running on the instance has label
    aws/foo=bar.
  • Create an Azure VM with tag foo: bar. Verify that a node running on the
    instance has label azure/foo=bar.

Passwordless @codingllama

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-v10.0.0-alpha.2-linux-amd64-bin.tar.gz).

This sections complements "Users -> Managing MFA devices". tsh binaries for
each operating system (Linux, macOS and Windows) must be tested separately for
FIDO2 items.

  • Diagnostics

    Commands should pass all tests.

    • tsh fido2 diag (macOS/Linux)
    • tsh touchid diag (macOS only)
    • tsh webauthnwin diag (Windows only)
  • Registration

    • Register a passworldess FIDO2 key (tsh mfa add, choose WEBAUTHN and
      passwordless)
      • macOS/Linux
      • Windows
    • Register a platform authenticator
      • Touch ID credential (tsh mfa add, choose TOUCHID)
      • Windows hello credential (tsh mfa add, choose WEBAUTHN and
        passwordless)
  • Login

    • Passwordless login using FIDO2 (tsh login --auth=passwordless)
      • macOS/Linux
      • Windows
    • Passwordless login using platform authenticator (tsh login --auth=passwordless)
      • Touch ID
      • Windows Hello
    • tsh login --auth=passwordless --mfa-mode=cross-platform uses FIDO2
      • macOS/Linux
      • Windows
    • tsh login --auth=passwordless --mfa-mode=platform uses platform authenticator
      • Touch ID
      • Windows Hello
    • tsh login --auth=passwordless --mfa-mode=auto prefers platform authenticator
      • Touch ID
      • Windows Hello
    • Exercise credential picker (register credentials for multiple users in
      the same device)
      • FIDO2 macOS/Linux
      • Touch ID
      • Windows
    • Passwordless disable switch works
      (auth_service.authentication.passwordless = false)
    • Cluster in passwordless mode defaults to passwordless
      (auth_service.authentication.connector_name = passwordless)
    • Cluster in passwordless mode allows MFA login
      (tsh login --auth=local)
  • Touch ID support commands

    • tsh touchid ls works
    • tsh touchid rm works (careful, may lock you out!)

Device Trust @codingllama

Device Trust requires Teleport Enterprise.

This feature has additional build requirements, so it should be tested with a
pre-release build from Drone (eg:
https://get.gravitational.com/teleport-v10.0.0-alpha.2-linux-amd64-bin.tar.gz).

Client-side enrollment requires a signed tsh for macOS, make sure to use the
tsh binary from tsh.app.

A simple formula for testing device authorization is:

# Before enrollment.
# Replace with other kinds of access, as appropriate (db, kube, etc)
tsh ssh node-that-requires-device-trust
> ERROR: ssh: rejected: administratively prohibited (unauthorized device)

# Register the device.
# Get the serial number from "Apple -> About This Mac".
tctl devices add --os=macos --asset-tag=<SERIAL_NUMBER> --enroll

# Enroll the device.
tsh device enroll --token=<TOKEN_FROM_COMMAND_ABOVE>
tsh logout; tsh login

# After enrollment
tsh ssh node-that-requires-device-trust
> $
  • Inventory management

    • Add device (tctl devices add)
    • Add device and create enrollment token (tctl devices add --enroll)
    • List devices (tctl devices ls)
    • Remove device using device ID (tctl devices rm)
    • Remove device using asset tag (tctl devices rm)
    • Create enrollment token using device ID (tctl devices enroll)
    • Create enrollment token using asset tag (tctl devices enroll)
  • Device enrollment

    • Enroll device on macOS (tsh device enroll)

    • Enroll device on Windows (tsh device enroll)

    • Verify device extensions on TLS certificate

      Note that different accesses have different certificates (Database, Kube,
      etc).

      $ openssl x509 -noout -in ~/.tsh/keys/zarquon/llama-x509.pem -nameopt sep_multiline -subject | grep 1.3.9999.3
      > 1.3.9999.3.1=6e60b9fd-1e3e-473d-b148-27b4f158c2a7
      > 1.3.9999.3.2=AAAAAAAAAAAA
      > 1.3.9999.3.3=661c9340-81b0-4a1a-a671-7b1304d28600
    • Verify device extensions on SSH certificate

      ssh-keygen -L -f ~/.tsh/keys/zarquon/llama-ssh/zarquon-cert.pub | grep teleport-device-
      teleport-device-asset-tag ...
      teleport-device-credential-id ...
      teleport-device-id ...
  • Device authorization

    • device_trust.mode other than "off" or "" not allowed (OSS)

    • device_trust.mode="off" doesn't impede access (Enterprise and OSS)

    • device_trust.mode="optional" doesn't impede access, but issues device
      extensions on login

    • device_trust.mode="required" enforces enrolled devices

    • device_trust.mode="required" is enforced by processes, and not only by
      Auth APIs

      Testing this requires issuing a certificate without device extensions
      (mode="off"), then changing the cluster configuration to mode="required" and
      attempting to access a process directly, without a login attempt.

    • Role-based authz enforces enrolled devices
      (device_trust.mode="off" or "optional",
      role.spec.options.device_trust_mode="required")

    • Device authorization works correctly for both require_session_mfa=false
      and require_session_mfa=true

    • Device authorization applies to SSH access (all items above)

    • Device authorization applies to Trusted Clusters (root with
      mode="optional" and leaf with mode="required")

    • Device authorization applies to Database access (all items above)

    • Device authorization applies to Kubernetes access (all items above)

    • Device authorization does not apply to App access
      (both cluster-wide and role)

    • Device authorization does not apply to Windows Desktop access
      (both cluster-wide and role) (@ibeckermayer)

  • Device audit (see lib/events/codes.go)

    • Inventory management actions issue events (success only)
    • Device enrollment issues device event (any outcomes)
    • Device authorization issues device event (any outcomes)
    • Events with UserMetadata contain TrustedDevice
      data (for certificates with device extensions)
  • Binary support

    • Non-signed and/or non-notarized tsh for macOS gives a sane error
      message for tsh device enroll attempts.
  • Device support commands

    • tsh device collect (macOS)
    • tsh device asset-tag (macOS)
    • tsh device collect (Windows)
    • tsh device asset-tag (Windows)

Hardware Key Support @jakule

Hardware Key Support is an Enterprise feature and is not available for OSS.

You will need a YubiKey 4.3+ to test this feature.

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-ent-v11.0.0-alpha.2-linux-amd64-bin.tar.gz).

Server Access

These tests should be carried out sequentially. tsh tests should be carried out on Linux, MacOS, and Windows.

  1. tsh login as user with Webauthn login and no hardware key requirement.
  2. Request a role with role.role_options.require_session_mfa: hardware_key - tsh login --request-roles=hardware_key_required
  • Assuming the role should force automatic re-login with yubikey
  • tsh ssh
    • Requires yubikey to be connected for re-login
    • Prompts for per-session MFA
  1. Request a role with role.role_options.require_session_mfa: hardware_key_touch - tsh login --request-roles=hardware_key_touch_required
  • Assuming the role should force automatic re-login with yubikey
    • Prompts for touch if not cached (last touch within 15 seconds)
  • tsh ssh
    • Requires yubikey to be connected for re-login
    • Prompts for touch if not cached
  1. tsh logout and tsh login as the user with no hardware key requirement.
  2. Upgrade auth settings to auth_service.authentication.require_session_mfa: hardware_key
  • Using the existing login session (tsh ls) should force automatic re-login with yubikey
  • tsh ssh
    • Requires yubikey to be connected for re-login
    • Prompts for per-session MFA
  1. Upgrade auth settings to auth_service.authentication.require_session_mfa: hardware_key_touch
  • Using the existing login session (tsh ls) should force automatic re-login with yubikey
    • Prompts for touch if not cached
  • tsh ssh
    • Requires yubikey to be connected for re-login
    • Prompts for touch if not cached

Other

Set auth_service.authentication.require_session_mfa: hardware_key_touch in your cluster auth settings.

  • Database Access: tsh proxy db --tunnel

HSM Support @tobiaszheller

Docs

  • YubiHSM2 Support (@rosstimothy)
    • Make sure docs/links are up to date
    • New cluster with YubiHSM2 CA works
    • Migrating a software cluster to YubiHSM2 works
    • CA rotation works
  • AWS CloudHSM Support
    • Make sure docs/links are up to date
    • New cluster with CloudHSM CA works
    • Migrating a software cluster to CloudHSM works
    • CA rotation works
  • GCP KMS Support
    • Make sure docs/links are up to date
    • New cluster with GCP KMS CA works
    • Migrating a software cluster to GCP KMS works
    • CA rotation works

Moderated session @tobiaszheller

Using tsh join an SSH session as two moderators (two separate terminals, role requires one moderator).

Using tsh join an SSH session as two moderators (two separate terminals, role requires one moderator).

  • t in any terminal should terminate the session for all participants.

Performance @rosstimothy @fspmarshall @espadolini

Scaling Test

Scale up the number of nodes/clusters a few times for each configuration below.

  1. Verify that there are no memory/goroutine/file descriptor leaks
  2. Compare the baseline metrics with the previous release to determine if resource usage has increased
  3. Restart all Auth instances and verify that all nodes/clusters reconnect

Perform reverse tunnel node scaling tests for all backend configurations:

  • etcd - 10k
  • DynamoDB - 10k
  • Firestore - 10k
  • Postgres - 10k

Perform the following additional scaling tests on DynamoDB:

  • 10k direct dial nodes.
  • 500 trusted clusters.

Soak Test

Run 30 minute soak test directly against direct and tunnel nodes
and via label based matching. Tests should be run against a Cloud
tenant.

tsh bench ssh --duration=30m user@direct-dial-node ls
tsh bench ssh --duration=30m user@reverse-tunnel-node ls
tsh bench ssh --duration=30m user@foo=bar ls
tsh bench ssh --duration=30m --random user@foo ls

Concurrent Session Test

  • Cluster with 1k reverse tunnel nodes

Run a concurrent session test that will spawn 5 interactive sessions per node in the cluster:

tsh bench web sessions --max=5000 user ls
  • Verify that all 5000 sessions are able to be established.
  • Verify that tsh and the web UI are still functional.

Robustness

  • Connectivity Issues:
  • Verify that a lack of connectivity to Auth does not prevent access to
    resources which do not require a moderated session and in async recording
    mode from an already issued certificate.
  • Verify that a lack of connectivity to Auth prevents access to resources
    which require a moderated session and in async recording mode from an already
    issued certificate.
  • Verify that an open session is not terminated when all Auth instances
    are restarted.

Teleport with Cloud Providers

AWS @camscale

GCP @tigrato

  • Deploy Teleport to GCP. Using Cloud Firestore & Cloud Storage
  • Deploy Teleport to GKE. Google Kubernetes engine.
  • Deploy Teleport Enterprise to GCP.

IBM @hugoShaka

  • Deploy Teleport to IBM Cloud. Using IBM Database for etcd & IBM Object Store
  • Deploy Teleport to IBM Cloud Kubernetes.
  • Deploy Teleport Enterprise to IBM Cloud.
  • Deploy Teleport to IBM Cloud. Using IDB Databases for Postgres.

Application Access @mdwn

  • Run an application within local cluster.
    • Verify the debug application debug_app: true works.
    • Verify an application can be configured with command line flags.
    • Verify an application can be configured from file configuration.
    • Verify that applications are available at auto-generated addresses name.rootProxyPublicAddr and well as publicAddr.
  • Run an application within a trusted cluster.
    • Verify that applications are available at auto-generated addresses name.rootProxyPublicAddr.
  • Verify Audit Records.
    • app.session.start and app.session.chunk events are created in the Audit Log.
    • app.session.chunk points to a 5 minute session archive with multiple app.session.request events inside.
    • tsh play <chunk-id> can fetch and print a session chunk archive.
  • Verify JWT using verify-jwt.go.
  • Verify RBAC.
  • Verify CLI access with tsh apps login.
  • Verify AWS console access.
    • Can log into AWS web console through the web UI.
    • Can interact with AWS using tsh commands.
      • tsh aws
      • tsh aws --endpoint-url (this is a hidden flag)
  • Verify Azure CLI access with tsh apps login.
    • Can interact with Azure using tsh az commands.
    • Can interact with Azure using a combination of tsh proxy az and az commands.
  • Verify GCP CLI access with tsh apps login.
    • Can interact with GCP using tsh gcloud commands.
    • Can interact with Google Cloud Storage using tsh gsutil commands.
    • Can interact with GCP/GCS using a combination of tsh proxy gcloud and gcloud/gsutil commands.
  • Verify dynamic registration.
    • Can register a new app using tctl create.
    • Can update registered app using tctl create -f.
    • Can delete registered app using tctl rm.
  • Test Applications screen in the web UI (tab is located on left side nav on dashboard):
    • Verify that all apps registered are shown
    • Verify that clicking on the app icon takes you to another tab
    • Verify Add Application links to documentation.

Database Access @smallinsky + team

  • Connect to a database within a local cluster.
  • Connect to a database within a remote cluster via a trusted cluster.
  • Verify auto user provisioning. @Tener
    • Self-hosted Postgres.
    • AWS RDS Postgres.
  • Verify audit events. @GavinFrazar
    • db.session.start is emitted when you connect.
    • db.session.end is emitted when you disconnect.
    • db.session.query is emitted when you execute a SQL query.
  • Verify RBAC. @gabrielcorado
    • tsh db ls shows only databases matching role's db_labels.
    • Can only connect as users from db_users.
    • (Postgres only) Can only connect to databases from db_names.
      • db.session.start is emitted when connection attempt is denied.
    • (MongoDB only) Can only execute commands in databases from db_names.
      • db.session.query is emitted when command fails due to permissions.
    • Can configure per-session MFA.
      • MFA tap is required on each tsh db connect.
  • Verify dynamic registration. @GavinFrazar
    • Can register a new database using tctl create.
    • Can update registered database using tctl create -f.
    • Can delete registered database using tctl rm.
  • Verify discovery.
    Please configure discovery in Discovery Service instead of Database Service.
    • AWS @greedy52
      • Can detect and register RDS instances.
        • Can detect and register RDS instances in an external AWS account when assume_role_arn and external_id is set.
      • Can detect and register RDS proxies, and their custom endpoints.
      • Can detect and register Aurora clusters, and their reader and custom endpoints.
      • Can detect and register RDS proxies, and their custom endpoints.
      • Can detect and register Redshift clusters.
      • Can detect and register Redshift serverless workgroups, and their VPC endpoints.
      • Can detect and register ElastiCache Redis clusters.
      • Can detect and register MemoryDB clusters.
      • Can detect and register OpenSearch domains.
    • Azure @GavinFrazar
      • Can detect and register MySQL and Postgres single-server instances.
      • Can detect and register MySQL and Postgres flexible-server instances.
      • Can detect and register Azure Cache for Redis servers.
      • Can detect and register Azure SQL Servers and Azure SQL Managed Instances.
  • Verify Teleport managed users (password rotation, auto 'auth' on connection, etc.). @greedy52
    • Can detect and manage ElastiCache users
    • Can detect and manage MemoryDB users
      - [ ] Test Databases screen in the web UI (tab is located on left side nav on dashboard): @GavinFrazar
      - [ ] Verify that all dbs registered are shown with correct name, description, type, and labels
      - [ ] Verify that clicking on a rows connect button renders a dialogue on manual instructions with Step 2 login value matching the rows name column
      - [ ] Verify searching for all columns in the search bar works
      - [ ] Verify you can sort by all columns except labels
  • Other @smallinsky
    • MySQL server version reported by Teleport is correct.

TLS Routing @smallinsky

  • Verify that teleport proxy v2 configuration starts only a single listener for proxy service, in contrast with v1 configuration.
    Given configuration: @smallinsky
    version: v2
    proxy_service:
      enabled: "yes"
      public_addr: ['root.example.com']
      web_listen_addr: 0.0.0.0:3080
    
    There should be total of three listeners, with only *:3080 for proxy service. Given the configuration above, 3022 and 3025 will be opened for other services.
    lsof -i -P | grep teleport | grep LISTEN
      teleport  ...  TCP *:3022 (LISTEN)
      teleport  ...  TCP *:3025 (LISTEN)
      teleport  ...  TCP *:3080 (LISTEN) # <-- proxy service
    
    In contrast for the same configuration with version v1, there should be additional ports 3023 and 3024.
    lsof -i -P | grep teleport | grep LISTEN
      teleport  ...  TCP *:3022 (LISTEN)
      teleport  ...  TCP *:3025 (LISTEN)
      teleport  ...  TCP *:3023 (LISTEN) # <-- extra proxy service port
      teleport  ...  TCP *:3024 (LISTEN) # <-- extra proxy service port
      teleport  ...  TCP *:3080 (LISTEN) # <-- proxy service
    
  • Run Teleport Proxy in multiplex mode auth_service.proxy_listener_mode: "multiplex"
    • Trusted cluster
      • Setup trusted clusters using single port setup web_proxy_addr == tunnel_addr
      kind: trusted_cluster
      spec:
        ...
        web_proxy_addr: root.example.com:443
        tunnel_addr: root.example.com:443
        ...
      
  • Database Access
  • Application Access @smallinsky
    • Verify app access through proxy running in multiplex mode
  • SSH Access @gabrielcorado
    • Connect to a OpenSSH server through a local ssh proxy ssh -o "ForwardAgent yes" -o "ProxyCommand tsh proxy ssh" [email protected]
    • Connect to a OpenSSH server on leaf-cluster through a local ssh proxyssh -o "ForwardAgent yes" -o "ProxyCommand tsh proxy ssh --user=%r --cluster=leaf-cluster %h:%p" [email protected]
    • Verify tsh ssh access through proxy running in multiplex mode
  • Kubernetes access: @smallinsky
    • Verify kubernetes access through proxy running in multiplex mode
  • Teleport Proxy single port multiplex mode behind L7 load balancer @Tener
    • Agent can join through Proxy and maintain reverse tunnel
    • tsh login and tctl
    • SSH Access: tsh ssh and tsh config
    • Database Access: tsh proxy db and tsh db connect
    • Application Access: tsh proxy app and tsh aws
    • Kubernetes Access: tsh proxy kube

Assist

Assist is not supported by tsh and WebUI is the only way to use it.
Assist test plan is in the core section instead of WebUI as most functionality is implemented in the core.

  • Configuration @xacrimon
    • Assist is disabled by default (OSS, Enterprise)
    • Assist can be enabled in the configuration file.
    • Assist is disabled in the Cloud.
    • Assist is enabled by default in the Cloud Team plan.
    • Assist is always disabled when etcd is used as a backend.
  • Conversations @xacrimon
    • A new conversation can be started.
    • SSH command can be executed on one server.
    • SSH command can be executed on multiple servers.
    • SSH command can be executed on a node with per session MFA enabled.
    • Execution output is explained when it fits the context window.
    • Assist can list all nodes/execute a command on all nodes (using embeddings).
    • Access request can be created.
    • Access request is created when approved.
    • Conversation title is set after the first message.
  • SSH integration @xacrimon
    • Assist icon is visible in WebUI's Terminal
    • A Bash command can be generated in the above window.
    • When an output is selected in the Terminal "Explain" option is available, and it generates the summary.
@r0mant r0mant added the test-plan A list of tasks required to ship a successful product release. label Aug 28, 2023
@r0mant
Copy link
Collaborator Author

r0mant commented Aug 28, 2023

Test plan cont'd (due to Github's issue description size limit).

Desktop Access @ibeckermayer @probakowski

  • Direct mode (set listen_addr):
    • Can connect to desktop defined in static hosts section.
    • Can connect to desktop discovered via LDAP
  • IoT mode (reverse tunnel through proxy):
    • Can connect to desktop defined in static hosts section.
    • Can connect to desktop discovered via LDAP
  • Connect multiple windows_desktop_services to the same Teleport cluster,
    verify that connections to desktops on different AD domains works. (Attempt to
    connect several times to verify that you are routed to the correct
    windows_desktop_service)
  • Verify user input
    • Download Keyboard Key Info and
      verify all keys are processed correctly in each supported browser. Known
      issues: F11 cannot be captured by the browser without
      special configuration
      on MacOS.
    • Left click and right click register as Windows clicks. (Right click on
      the desktop should show a Windows menu, not a browser context menu)
    • Vertical and horizontal scroll work.
      Horizontal Scroll Test
  • Locking
    • Verify that placing a user lock terminates an active desktop session.
    • Verify that placing a desktop lock terminates an active desktop session.
    • Verify that placing a role lock terminates an active desktop session.
  • Labeling
    • Set client_idle_timeout to a small value and verify that idle sessions
      are terminated (the session should end and an audit event will confirm it
      was due to idle connection)
    • All desktops have teleport.dev/origin label.
    • Dynamic desktops have additional teleport.dev labels for OS, OS
      Version, DNS hostname.
    • Regexp-based host labeling applies across all desktops, regardless of
      origin.
  • RBAC
    • RBAC denies access to a Windows desktop due to labels
    • RBAC denies access to a Windows desktop with the wrong OS-login.
  • Clipboard Support
    • When a user has a role with clipboard sharing enabled and is using a chromium based browser
      • Going to a desktop when clipboard permissions are in "Ask" mode (aka "prompt") causes the browser to show a prompt when you first click or press a key
      • The clipboard icon is highlighted in the top bar
      • After allowing clipboard permission, copy text from local workstation, paste into remote desktop
      • After allowing clipboard permission, copy text from remote desktop, paste into local workstation
      • After disallowing clipboard permission, confirm copying text from local workstation and pasting into remote desktop doesn't work
      • After disallowing clipboard permission, confirm copying text from remote desktop and pasting into local workstation doesn't work
    • When a user has a role with clipboard sharing enabled and is not using a chromium based browser
      • The clipboard icon is not highlighted in the top bar and copy/paste does not work
    • When a user has a role with clipboard sharing disabled and is using a chromium and non-chromium based browser (confirm both)
      • The clipboard icon is not highlighted in the top bar and copy/paste does not work
  • Directory Sharing
    • On supported, non-chromium based browsers (Firefox/Safari)
      • Attempting to share directory logs a sensible warning in the warning dropdown
    • On supported, chromium based browsers (Chrome/Edge)
      • Begin sharing works
        • The shared directory icon in the top right of the screen is highlighted when directory sharing is initiated
        • The shared directory appears as a network drive named "<directory_name> on teleport"
        • The share directory menu option disappears from the menu
      • Navigation
        • The folders of the shared directory are navigable (move up and down the directory tree)
      • CRUD
        • A new text file can be created
        • The text file can be written to (saved)
        • The text file can be read (close it, check that it's saved on the local machine, then open it again on the remote)
        • The text file can be deleted
      • File/Folder movement
        • In to out (make at least one of these from a non-top-level-directory)
          • A file from inside the shared directory can be drag-and-dropped outside the shared directory
          • A folder from inside the shared directory can be drag-and-dropped outside the shared directory (and its contents retained)
          • A file from inside the shared directory can be cut-pasted outside the shared directory
          • A folder from inside the shared directory can be cut-pasted outside the shared directory
          • A file from inside the shared directory can be copy-pasted outside the shared directory
          • A folder from inside the shared directory can be copy-pasted outside the shared directory
        • Out to in (make at least one of these overwrite an existing file, and one go into a non-top-level directory)
          • A file from outside the shared directory can be drag-and-dropped into the shared directory
          • A folder from outside the shared directory can be drag-and-dropped into the shared directory (and its contents retained)
          • A file from outside the shared directory can be cut-pasted into the shared directory
          • A folder from outside the shared directory can be cut-pasted into the shared directory
          • A file from outside the shared directory can be copy-pasted into the shared directory
          • A folder from outside the shared directory can be copy-pasted into the shared directory
        • Within
          • A file from inside the shared directory cannot be drag-and-dropped to another folder inside the shared directory: a dismissible "Unsupported Action" dialog is shown
          • A folder from inside the shared directory cannot be drag-and-dropped to another folder inside the shared directory: a dismissible "Unsupported Action" dialog is shown
          • A file from inside the shared directory cannot be cut-pasted to another folder inside the shared directory: a dismissible "Unsupported Action" dialog is shown
          • A folder from inside the shared directory cannot be cut-pasted to another folder inside the shared directory: a dismissible "Unsupported Action" dialog is shown
          • A file from inside the shared directory can be copy-pasted to another folder inside the shared directory
          • A folder from inside the shared directory can be copy-pasted to another folder inside shared directory (and its contents retained)
    • RBAC
      • Give the user one role that explicitly disables directory sharing (desktop_directory_sharing: false) and confirm that the option to share a directory doesn't appear in the menu
  • Per-Session MFA (try webauthn on each of Chrome, Safari, and Firefox; u2f only works with Firefox)
    • Attempting to start a session no keys registered shows an error message
    • Attempting to start a session with a webauthn registered pops up the "Verify Your Identity" dialog
      • Hitting "Cancel" shows an error message
      • Hitting "Verify" causes your browser to prompt you for MFA
      • Cancelling that browser MFA prompt shows an error
      • Successful MFA verification allows you to connect
  • Session Recording
    • Verify sessions are not recorded if all of a user's roles disable recording
    • Verify sync recording (mode: node-sync or mode: proxy-sync)
    • Verify async recording (mode: node or mode: proxy)
    • Sessions show up in session recordings UI with desktop icon
    • Sessions can be played back, including play/pause functionality
    • Sessions playback speed can be toggled while its playing
    • Sessions playback speed can be toggled while its paused
    • A session that ends with a TDP error message can be played back, ends by displaying the error message,
      and the progress bar progresses to the end.
    • Attempting to play back a session that doesn't exist (i.e. by entering a non-existing session id in the url) shows
      a relevant error message.
    • RBAC for sessions: ensure users can only see their own recordings when
      using the RBAC rule from our
      docs
  • Audit Events (check these after performing the above tests)
    • windows.desktop.session.start (TDP00I) emitted on start
    • windows.desktop.session.start (TDP00W) emitted when session fails to
      start (due to RBAC, for example)
    • client.disconnect (T3006I) emitted when session is terminated by or fails
      to start due to lock
    • windows.desktop.session.end (TDP01I) emitted on end
    • desktop.clipboard.send (TDP02I) emitted for local copy -> remote
      paste
    • desktop.clipboard.receive (TDP03I) emitted for remote copy -> local
      paste
    • desktop.directory.share (TDP04I) emitted when Teleport starts sharing a directory
    • desktop.directory.read (TDP05I) emitted when a file is read over the shared directory
    • desktop.directory.write (TDP06I) emitted when a file is written to over the shared directory
  • Warnings/Errors
    • Induce the backend to send a TDP Notification of severity warning (1), confirm that a warning is logged in the warning dropdown
    • Induce the backend to send a TDP Notification of severity error (2), confirm that session is terminated and error popup is shown
    • Induce the backend to send a TDP Error, confirm that session is terminated and error popup is shown (confirms backwards compatibility w/ older w_d_s starting in Teleport 12)
  • Trusted Cluster / Tunneling
    • Set up Teleport in a trusted cluster configuration where the root and leaf cluster has a w_d_s connected via tunnel (w_d_s running as a separate process)
      • Confirm that windows desktop sessions can be made on root cluster
      • Confirm that windows desktop sessions can be made on leaf cluster
  • Non-AD setup
    • Installer in GUI mode finishes successfully on instance that is not part of domain
    • Installer works correctly invoked from command line
    • Non-AD instance can be added to non_ad_hosts section in config file and is visible in UI
    • Non-AD can be added as dynamic resource and is visible in UI
    • Non-AD instance has label teleport.dev/ad: false
    • Connecting to non-AD instance works with OSS if there are no more than 5 non-AD desktops
    • Connecting to non-AD instance fails with OSS if there are more than 5 non-AD desktops
    • Connecting to non-AD instance works with Enterprise license always
    • In OSS version, if there are more than 5 non-AD desktops banner shows up telling you to upgrade
    • Banner goes away if you reduce number of non-AD desktops to less or equal 5
    • Installer in GUI mode successfully uninstalls Authentication Package (logging in is not possible)
    • Installer successfully uninstalls Authentication Package (logging in is not possible) when invoked from command line

Binaries compatibility @fheinecke

  • Verify tsh runs on:
    • Windows 10
    • MacOS

Machine ID @strideynet

SSH

With a default Teleport instance configured with a SSH node:

  • Verify you are able to create a new bot user with tctl bots add robot --roles=access. Follow the instructions provided in the output to start tbot
  • Verify you are able to connect to the SSH node using openssh with the generated ssh_config in the destination directory
  • Verify that after the renewal period (default 20m, but this can be reduced via configuration), that newly generated certificates are placed in the destination directory
  • Verify that sending both SIGUSR1 and SIGHUP to a running tbot process causes a renewal and new certificates to be generated
  • Verify that you are able to make a connection to the SSH node using the ssh_config provided by tbot after each phase of a manual CA rotation.

Ensure the above tests are completed for both:

  • Directly connecting to the auth server
  • Connecting to the auth server via the proxy reverse tunnel

DB Access

With a default Postgres DB instance, a Teleport instance configured with DB access and a bot user configured:

  • Verify you are able to connect to and interact with a database using tbot db while tbot start is running

Host users creation @jakule

Host users creation docs
Host users creation RFD

  • Verify host users creation functionality
    • non-existing users are created automatically
    • users are added to groups
      • non-existing configured groups are created
      • created users are added to the teleport-system group
    • users are cleaned up after their session ends
      • cleanup occurs if a program was left running after session ends
    • sudoers file creation is successful
      • Invalid sudoers files are not created
    • existing host users are not modified
    • setting disable_create_host_user: true stops user creation from occurring

CA rotations @espadolini

  • Verify the CA rotation functionality itself (by checking in the backend or with tctl get cert_authority)
    • standby phase: only active_keys, no additional_trusted_keys
    • init phase: active_keys and additional_trusted_keys
    • update_clients and update_servers phases: the certs from the init phase are swapped
    • standby phase: only the new certs remain in active_keys, nothing in additional_trusted_keys
    • rollback phase (second pass, after completing a regular rotation): same content as in the init phase
    • standby phase after rollback: same content as in the previous standby phase
  • Verify functionality in all phases (clients might have to log in again in lieu of waiting for credentials to expire between phases)
    • SSH session in tsh from a previous phase
    • SSH session in web UI from a previous phase
    • New SSH session with tsh
    • New SSH session with web UI
    • New SSH session in a child cluster on the same major version
    • New SSH session in a child cluster on the previous major version
    • New SSH session from a parent cluster
    • Application access through a browser
    • Application access through curl with tsh apps login
    • kubectl get po after tsh kube login
    • Database access (no configuration change should be necessary if the database CA isn't rotated, other Teleport functionality should not be affected if only the database CA is rotated)

Proxy Peering

Proxy Peering docs

EC2 Discovery @marcoandredinis

EC2 Discovery docs

  • Verify EC2 instance discovery
    • Only EC2 instances matching given AWS tags have the installer executed on them
    • Only the IAM permissions mentioned in the discovery docs are required for operation
    • Custom scripts specified in different matchers are executed
    • Custom SSM documents specified in different matchers are executed
    • New EC2 instances with matching AWS tags are discovered and added to the teleport cluster
      • Large numbers of EC2 instances (51+) are all successfully added to the cluster
    • Nodes that have been discovered do not have the install script run on the node multiple times

Azure Discovery @hugoShaka

Azure Discovery docs

GCP Discovery @tcsc

GCP Discovery docs

  • Verify GCP instance discovery
    • Only GCP instances matching given GCP tags have the installer executed on them
    • Only the IAM permissions mentioned in the discovery docs are required for operation
    • Custom scripts specified in different matchers are executed
    • New GCP instances with matching GCP tags are discovered and added to the teleport cluster
      • Large numbers of GCP instances (51+) are all successfully added to the cluster
    • Nodes that have been discovered do not have the install script run on the node multiple times

IP Pinning @AntonAM

Add a role with pin_source_ip: true (requires Enterprise) to test IP pinning.
Testing will require changing your IP (that Teleport Proxy sees).
Docs: IP Pinning

  • Verify that it works for SSH Access
    • You can access tunnel node with tsh ssh on root cluster
    • You can access direct access node with tsh ssh on root cluster
    • You can access tunnel node from Web UI on root cluster
    • You can access direct access node from Web UI on root cluster
    • You can access tunnel node with tsh ssh on leaf cluster
    • You can access direct access node with tsh ssh on leaf cluster
    • You can access tunnel node from Web UI on leaf cluster
    • You can access direct access node from Web UI on leaf cluster
    • You can download files from nodes in Web UI (small arrows at top left corner)
    • If you change your IP you no longer can access nodes.
  • Verify that it works for Kube Access
    • You can access Kubernetes cluster through standalone Kube service on root cluster
    • You can access Kubernetes cluster through agent inside Kubernetes on root cluster
    • You can access Kubernetes cluster through standalone Kube service on leaf cluster
    • You can access Kubernetes cluster through agent inside Kubernetes on leaf cluster
    • If you change your IP you no longer can access Kube clusters.
  • Verify that it works for DB Access
    • You can access DB servers on root cluster
    • You can access DB servers on leaf cluster
    • If you change your IP you no longer can access DB servers.
  • Verify that it works for App Access
    • You can access App service on root cluster
    • You can access App service on leaf cluster
    • If you change your IP you no longer can access App services.
  • Verify that it works for Desktop Access
    • You can access Desktop service on root cluster
    • You can access Desktop service on leaf cluster
    • If you change your IP you no longer can access Desktop services.

Resources

Quick GitHub/SAML/OIDC Setup Tips

@zmb3
Copy link
Collaborator

zmb3 commented Aug 30, 2023

Looks like passwordless registration broke due to a dependency update: #31187

Edit: fixed

@espadolini
Copy link
Contributor

espadolini commented Aug 31, 2023

PostgreSQL 10k test

Setup

Azure Database for PostgreSQL Flexible Server 15.3 on GP_Standard_D2ds_v4 (2 vCPU, 8GiB of RAM) with 128GiB of storage and Zone-Redundant HA. The kv table was manually altered to REPLICA IDENTITY FULL before running tests (which has no effect on Teleport 14.0.0-alpha.2 but increases the WAL load somewhat).

AKS with Kubernetes 1.26.6, 15 nodes Standard_D16s_v3 (16 vCPU, 64GiB of RAM), same region as the database (northeurope).

Teleport configured with 3 auths and 3 proxies, PostgreSQL pool_max_conns=50.

10k tunnel nodes

Metrics

Screenshot 2023-08-31 alle 12 14 25 Screenshot 2023-08-31 alle 12 14 05 Screenshot 2023-08-31 alle 12 13 58 Screenshot 2023-08-31 alle 12 13 50

Soak test

Ran from a pod in the same cluster as the control plane and the nodes.

# tsh bench --duration=30m ssh root@agents-5c8876d478-zp7bg-27 ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         142 ms            
50         148 ms            
75         155 ms            
90         163 ms            
95         170 ms            
99         239 ms            
100        2251 ms           

# tsh bench --duration=30m ssh --random root@all ls

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         169 ms            
50         180 ms            
75         215 ms            
90         242 ms            
95         253 ms            
99         294 ms            
100        2191 ms           

# tsh bench --duration=30m ssh root@fullname=agents-5c8876d478-226qm-25 ls 

* Requests originated: 17997
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         256 ms            
50         267 ms            
75         280 ms            
90         298 ms            
95         322 ms            
99         364 ms            
100        1408 ms           

5k sessions (on 1k tunnel nodes) then 10k direct connect nodes

Metrics

Screenshot 2023-09-01 alle 13 42 59 Screenshot 2023-09-01 alle 13 46 47 Screenshot 2023-09-01 alle 13 47 12 Screenshot 2023-09-01 alle 13 48 58

Soak test

Ran from a pod in the same cluster as the control plane and the nodes.

root@ubu2:/# tsh bench --duration=30m ssh root@agents-5c8876d478-zx6k5-23 ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         146 ms            
50         151 ms            
75         157 ms            
90         165 ms            
95         171 ms            
99         189 ms            
100        1854 ms           

root@ubu2:/# tsh bench --duration=30m ssh --random root@all ls

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         171 ms            
50         186 ms            
75         225 ms            
90         248 ms            
95         261 ms            
99         285 ms            
100        2185 ms           

root@ubu2:/# tsh bench --duration 30m ssh root@fullname=agents-5c8876d478-z497h-24 ls

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         295 ms            
50         307 ms            
75         325 ms            
90         350 ms            
95         370 ms            
99         412 ms            
100        2395 ms           

root@ubu2:/# 

Concurrent sessions test

No errors reported by tsh bench web session --max=5000 joe ls, however the backend metrics show a sizeable latency spike at the end of the test, here shown on three separate attempts:

Screenshot 2023-09-01 alle 13 49 25

This is not immediate cause for concern, as real workloads will never shut down 5000 sessions at the exact same time, but it might be possible to improve the behavior with some tuning (reducing the size of the connection pool might help, reducing the actual need for retries due to contention).

@strideynet
Copy link
Contributor

In addition to test plan tasks, Machine ID was also test for Kubernetes Access and Application Access ( I will add these to test plan before T15 )

@codingllama
Copy link
Contributor

codingllama commented Aug 31, 2023

tsh windows panics on mfa add: #31333.

Edit: solved.

@atburke
Copy link
Contributor

atburke commented Sep 1, 2023

Loading all CAs for tsh ssh is broken: #31339

@tcsc
Copy link
Contributor

tcsc commented Sep 4, 2023

tctl sso configure github is broken (possibly only in Enterprise cluster). See #31396 (Fix in #31397)

@lxea
Copy link
Contributor

lxea commented Sep 5, 2023

tsh join <agentless-node> seems to be broken #31422
This was never supported.

@GavinFrazar
Copy link
Contributor

I scratched off items for "Test Databases screen in the web UI" since that screen was removed and replaced by the unified resource view.

I did verify that searching, filtering, sorting etc work in the unified resources view for databases however. Need to update those testplan steps @avatus

@avatus
Copy link
Contributor

avatus commented Sep 6, 2023

Will do, thanks @GavinFrazar . #31214

@mdwn
Copy link
Contributor

mdwn commented Sep 6, 2023

teleport app start appears to be broken. #31496

@espadolini
Copy link
Contributor

espadolini commented Sep 7, 2023

Firestore 10k test

Metrics

Screenshot 2023-09-07 alle 17 51 03 Screenshot 2023-09-07 alle 17 51 15 Screenshot 2023-09-07 alle 17 51 29 Screenshot 2023-09-07 alle 17 51 53

Soak test

Single node, random and label-based (single node) 30min soak tests passed with no errors.

@mdwn
Copy link
Contributor

mdwn commented Sep 7, 2023

AWS roles do not show up when trying to log into the AWS console in the UI in the new unified resource view: #31573

@hugoShaka
Copy link
Contributor

hugoShaka commented Sep 7, 2023

Azure Discovery keeps running the discovery script on already-joined VMs every 10 minutes, but it seems the bug was here in 13: #28879

@hugoShaka
Copy link
Contributor

Azure Discovery permissions are not up-to-date and following the docs doesn't allow to setup a working discovery service: #31602

@rosstimothy
Copy link
Contributor

Agents running versions older than v14 are not able to connect to a v14 cluster: #31607

@rosstimothy
Copy link
Contributor

Cloud Load Tests

30k Scaling Test

image image image image

https://grafana-staging-onprem.platform.teleport.sh/goto/60SnOjzIR?orgId=1

10k Concurrent Sessions

image

Soak Tests

Origin: us-east-1 Target: us-east-1
kubectl logs -n soaktest -f pod/soaktest-7zpdz-56wqw
+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /etc/teleport/auth bench ssh --duration=30m root@node-agents-766996b7b9-zv7b2-09 ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         191 ms
50         195 ms
75         201 ms
90         210 ms
95         216 ms
99         270 ms
100        5199 ms

+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /etc/teleport/auth bench ssh --duration=30m root@fullname=node-agents-766996b7b9-zv7b2-09 ls

* Requests originated: 17996
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         466 ms
50         473 ms
75         481 ms
90         493 ms
95         505 ms
99         551 ms
100        1193 ms

+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /etc/teleport/auth bench ssh --duration=30m --random root@all ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         188 ms
50         195 ms
75         203 ms
90         215 ms
95         229 ms
99         284 ms
100        9191 ms

https://grafana-staging-onprem.platform.teleport.sh/goto/QDvmOCzIR?orgId=1

Origin: us-west-2 Target: us-east-1

kubectl logs -n soaktest -f pod/soaktest-rkbzk-zvf2z
+ tbot start --data-dir=/var/lib/teleport/bot --destination-dir=/opt/machine-id --token=163cbdb82281e399049c8034ef77219b --join-method=token --auth-server=benchmark.cloud.gravitational.io:443 --certificate-ttl=8h --oneshot
  [TBOT]      INFO Anonymous telemetry is not enabled. Find out more about Machine ID's anonymous telemetry at https://goteleport.com/docs/machine-id/reference/telemetry/ tbot/anonymous_telemetry.go:82
  [TBOT]      INFO Created directory "/var/lib/teleport/bot" config/destination_directory.go:135
  [TBOT]      INFO Created directory "/opt/machine-id" config/destination_directory.go:135
  [TBOT]      INFO Initializing bot identity. tbot/tbot.go:254
  [TBOT]      INFO Loading existing bot identity from store. store:directory: /var/lib/teleport/bot tbot/tbot.go:325
  [TBOT]      INFO No existing bot identity found in store. Bot will join using configured token. tbot/tbot.go:329
  [TBOT]      INFO Fetching bot identity using token. tbot/bot_identity.go:193
  [AUTH]      INFO Attempting registration via proxy server. auth/register.go:278
  [AUTH]      INFO Successfully registered via proxy server. auth/register.go:285
  [TBOT]      INFO Fetched new bot identity. identity:valid: after=2023-09-05T19:16:47Z, before=2023-09-06T03:17:47Z, duration=8h1m0s | kind=tls, renewable=true, disallow-reissue=false, roles=[bot-soaktest-bot], principals=[-teleport-internal-join], generation=1 tbot/tbot.go:298
  [TBOT]      INFO Bot initialization complete. tbot/tbot.go:316
  [TBOT]      INFO One-shot mode enabled. Generating outputs. tbot/tbot.go:118
  [TBOT]      INFO Generating output. output:identity (directory: /opt/machine-id) tbot/impersonated_identity.go:528
  [TBOT]      INFO Generated output. output:identity (directory: /opt/machine-id) tbot/impersonated_identity.go:573
  [TBOT]      INFO Generated outputs. One-shot mode is enabled so exiting. tbot/tbot.go:123
+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /opt/machine-id/identity bench ssh --duration=30m root@node-agents-5d68d45658-25b9q-00 ls

* Requests originated: 17992
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         881 ms
50         892 ms
75         912 ms
90         930 ms
95         936 ms
99         951 ms
100        5619 ms

+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /opt/machine-id/identity bench ssh --duration=30m root@fullname=node-agents-5d68d45658-25b9q-00 ls

* Requests originated: 17991
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         903 ms
50         917 ms
75         948 ms
90         1027 ms
95         1039 ms
99         1054 ms
100        1136 ms

https://grafana-staging-onprem.platform.teleport.sh/goto/S80kdCzIR?orgId=1

Origin: us-east-1 Target: us-west-2
kubectl logs -n soaktest -f pod/soaktest-9vg9b-k458s
+ tbot start --data-dir=/var/lib/teleport/bot --destination-dir=/opt/machine-id --token=b29218b11195ef04f77b1ab93b7382fc --join-method=token --auth-server=benchmark.cloud.gravitational.io:443 --certificate-ttl=8h --oneshot
  INFO [TBOT]      Created directory "/var/lib/teleport/bot" config/destination_directory.go:135
  INFO [TBOT]      Anonymous telemetry is not enabled. Find out more about Machine ID's anonymous telemetry at https://goteleport.com/docs/machine-id/reference/telemetry/ tbot/anonymous_telemetry.go:82
  INFO [TBOT]      Created directory "/opt/machine-id" config/destination_directory.go:135
  INFO [TBOT]      Initializing bot identity. tbot/tbot.go:254
  INFO [TBOT]      Loading existing bot identity from store. store:directory: /var/lib/teleport/bot tbot/tbot.go:325
  INFO [TBOT]      No existing bot identity found in store. Bot will join using configured token. tbot/tbot.go:329
  INFO [TBOT]      Fetching bot identity using token. tbot/bot_identity.go:193
  INFO [AUTH]      Attempting registration via proxy server. auth/register.go:278
  INFO [AUTH]      Successfully registered via proxy server. auth/register.go:285
  INFO [TBOT]      Fetched new bot identity. identity:valid: after=2023-09-05T20:25:33Z, before=2023-09-06T04:26:32Z, duration=8h0m59s | kind=tls, renewable=true, disallow-reissue=false, roles=[bot-soaktest-bot], principals=[-teleport-internal-join], generation=1 tbot/tbot.go:298
  INFO [TBOT]      Bot initialization complete. tbot/tbot.go:316
  INFO [TBOT]      One-shot mode enabled. Generating outputs. tbot/tbot.go:118
  INFO [TBOT]      Generating output. output:identity (directory: /opt/machine-id) tbot/impersonated_identity.go:528
  INFO [TBOT]      Generated output. output:identity (directory: /opt/machine-id) tbot/impersonated_identity.go:573
  INFO [TBOT]      Generated outputs. One-shot mode is enabled so exiting. tbot/tbot.go:123
+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /opt/machine-id/identity bench ssh --duration=30m root@node-agents-5d68d45658-z8nz7-00 ls

* Requests originated: 17992
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         854 ms
50         863 ms
75         868 ms
90         875 ms
95         880 ms
99         910 ms
100        1270 ms

+ tsh --proxy=benchmark.cloud.gravitational.io:443 -i /opt/machine-id/identity bench ssh --duration=30m root@fullname=node-agents-5d68d45658-z8nz7-00 ls

* Requests originated: 17989
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1125 ms
50         1133 ms
75         1143 ms
90         1151 ms
95         1159 ms
99         1191 ms
100        3631 ms

https://grafana-staging-onprem.platform.teleport.sh/goto/S80kdCzIR?orgId=1

@fspmarshall
Copy link
Contributor

ETCD 10k Loadtest (simulated)

This is a first attempt at creating a "simulated" loadtesting procedure. The procedure is simulated in that it stresses the backend by using tctl loadtest node-heartbeats rather than by creating actual teleport nodes. The backend and control-plane are created per-usual.

For this initial attempt the following command was run concurrently on each auth pod (note the 5k count, totaling 10k since two auth pods were created):

tctl loadtest node-heartbeats --count=5000 --ttl=2m --interval=1m --labels=2 --concurrency=32

This method of loadtesting produces significantly less load on the control-plane generally, but roughly equivalent load on the backend itself:
Screenshot 2023-09-08 at 7 43 41 PM

Screenshot 2023-09-08 at 7 45 52 PM

In addition to the above metrics, auth logs were specifically monitored for any cache and/or event system related errors. While we don't anticipate such things at 10k these days, such errors would be one of the main signs of regression that might not be immediately obvious.

As a point of future improvement, I think we should start tracking cache resets and watcher buffer overflows as part of our standard suite of backend metrics so that we can better monitor the health of the event system.

@fspmarshall
Copy link
Contributor

fspmarshall commented Sep 9, 2023

ETCD 30k Loadtest (simulated)

See #31122 (comment) for explanation of simulated loadtest procedure. The 30K procedure was identical, except that 15k heartbeats were applied per auth-server instead of 5k.

Screenshot 2023-09-08 at 8 25 12 PM Screenshot 2023-09-08 at 8 30 53 PM

As with the 10k procedure, logs were explicitly monitored for cache and event system issues. None were observed, but the metrics improvement thoughts from that comment still stand.

@gabrielcorado
Copy link
Contributor

Database Access load test (PostgreSQL and MySQL)

Setup

EKS with a single node group:

  • Min: 2, Max: 10 instances.
  • Instance class: m5.4xlarge
  • Kubernetes version: 1.27

Teleport cluster (all deployed on the EKS cluster):

  • DynamoDB backend
  • 3 Auth servers
  • 3 Proxies instances
  • 1 Database Agent

Databases:

  • Single PostgreSQL RDS instance on a db.t4g.xlarge instance class. Accessed through RDS Proxy with single RW endpoint.
  • Single MySQL RDS instance on a db.t4g.xlarge instance class. Accessed through RDS Proxy with single RW endpoint.

Note: Databases were configured using discovery running inside the database agent.

tsh bench commands were executed inside the cluster.

MySQL

10 connections/second

1
2
3
4
5

# tsh bench mysql mysql-proxy-rdsproxy --db-user=mysql --db-name=mysql --rate=10 --duration=30m

* Requests originated: 18000
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         60 ms
50         63 ms
75         68 ms
90         75 ms
95         81 ms
99         100 ms
100        3105 ms
50 connections/second

1
2
3
4
5

# tsh bench mysql mysql-proxy-rdsproxy --db-user=mysql --db-name=mysql --rate=50 --duration=30m

* Requests originated: 89951
* Requests failed: 81
* Last error: io.ReadFull(header) failed. err EOF: connection was bad

Histogram

Percentile Response Duration
---------- -----------------
25         520 ms
50         709 ms
75         880 ms
90         1036 ms
95         1142 ms
99         1363 ms
100        2289 ms

Notes

The failed connection happened at the end of the benchmark test, where the final connections didn't have a chance to complete as tsh bench canceled them.

PostgreSQL

10 connections/second

1
2
3
4
5

# tsh bench postgres postgres-proxy-rdsproxy --db-user=postgres --db-name=postgres --rate=10 --duration=30m

* Requests originated: 18000
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         74 ms
50         77 ms
75         82 ms
90         87 ms
95         93 ms
99         113 ms
100        1112 ms
50 connections/second

1
2
3
4
5

# tsh bench postgres postgres-proxy-rdsproxy --db-user=postgres --db-name=postgres --rate=50 --duration=30m

* Requests originated: 89916
* Requests failed: 6813
* Last error: failed to connect to `host=127.0.0.1 user=postgres database=postgres`: server error (: failed to connect to any of the database servers (SQLSTATE ))

Histogram

Percentile Response Duration
---------- -----------------
25         712 ms
50         1080 ms
75         1403 ms
90         1623 ms
95         1743 ms
99         1999 ms
100        2963 ms

Notes

Most of the connection failures were due to the proxy not being able to communicate with the database agent.

Logs
WARN [DB:PROXY]  Failed to dial database DatabaseServer(Name=gabrielcorado-loadtest-postgres-proxy-rdsproxy-us-east-1-278576220453, Version=14.0.0-alpha.2, Hostname=database-agents-0, HostID=203cb416-7a12-4075-a8d8-c92bd0b1c7c1, Database=Database(Name=gabrielcorado-loadtest-postgres-proxy-rdsproxy-us-east-1-278576220453, Type=rdsproxy, Labels=map[account-id:278576220453 engine:POSTGRESQL loadtest:gabrielcorado-loadtest region:us-east-1 teleport.dev/cloud:AWS teleport.dev/origin:cloud teleport.internal/discovered-name:gabrielcorado-loadtest-postgres-proxy vpc-id:vpc-0452b380742d5815c])). error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError Teleport proxy failed to connect to &#34;db&#34; agent &#34;@local-node&#34; over reverse tunnel:
  no tunnel connection found: no db reverse tunnel for 203cb416-7a12-4075-a8d8-c92bd0b1c7c1.gabrielcorado-loadtest.teleportdemo.net found
This usually means that the agent is offline or has disconnected. Check the
agent logs and, if the issue persists, try restarting it or re-registering it
with the cluster.
Stack Trace:
       github.com/gravitational/teleport/lib/reversetunnel/localsite.go:582 github.com/gravitational/teleport/lib/reversetunnel.(*localSite).getConn
       github.com/gravitational/teleport/lib/reversetunnel/localsite.go:306 github.com/gravitational/teleport/lib/reversetunnel.(*localSite).DialTCP
       github.com/gravitational/teleport/lib/reversetunnel/localsite.go:274 github.com/gravitational/teleport/lib/reversetunnel.(*localSite).Dial
       github.com/gravitational/teleport/lib/srv/db/proxyserver.go:471 github.com/gravitational/teleport/lib/srv/db.(*ProxyServer).Connect
       github.com/gravitational/teleport/lib/srv/db/postgres/proxy.go:102 github.com/gravitational/teleport/lib/srv/db/postgres.(*Proxy).handleConnection
       github.com/gravitational/teleport/lib/srv/db/postgres/proxy.go:66 github.com/gravitational/teleport/lib/srv/db/postgres.(*Proxy).HandleConnection
       github.com/gravitational/teleport/lib/srv/db/proxyserver.go:349 github.com/gravitational/teleport/lib/srv/db.(*ProxyServer).handleConnection
       github.com/gravitational/teleport/lib/srv/db/proxyserver.go:303 github.com/gravitational/teleport/lib/srv/db.(*ProxyServer).ServeTLS.func1
       runtime/asm_amd64.s:1650 runtime.goexit
User Message: Teleport proxy failed to connect to &#34;db&#34; agent &#34;@local-node&#34; over reverse tunnel:
  no tunnel connection found: no db reverse tunnel for 203cb416-7a12-4075-a8d8-c92bd0b1c7c1.gabrielcorado-loadtest.teleportdemo.net found
This usually means that the agent is offline or has disconnected. Check the
agent logs and, if the issue persists, try restarting it or re-registering it
with the cluster.] db/proxyserver.go:483

During the entire test the database agent was logging a warning of Failed to emit audit event db.session.query(TDB02I). This server's connection to the auth service appears to be slow. events/emitter.go:113 for all DB session events.

Worth noting that during the tests a single Audit instance was handling all the audit events, which could cause the delayed processing:

Screenshot 2023-09-06 at 12 42 06

@camscale
Copy link
Contributor

camscale commented Sep 11, 2023

Teleport fails to start with "distant" DynamoDB backend: #31690

@tigrato
Copy link
Contributor

tigrato commented Sep 11, 2023

Kubernetes Access load test

Setup

EKS with a single node group:

  • 5 instances.
  • Instance class: m5.4xlarge
  • Kubernetes version: 1.27

Teleport cluster (all deployed on the EKS cluster):

  • DynamoDB backend
  • 3 Auth servers
  • 3 Proxies instances
  • 1 Kubernetes Agent

tsh bench commands were executed inside the cluster.

kubectl get pods

This test involves forwarding the request to the upstream service, unmarshal the response, filtering it, and returning the response back to the end user.

10 connections/second

rate10ps

latency10ps

size10ps

tsh bench kube ls my-cluster --rate=10 --duration=30m

* Requests originated: 18000
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         28 ms             
50         29 ms             
75         30 ms             
90         33 ms             
95         36 ms             
99         47 ms             
100        167 ms            

50 connections/second

screenshot_2023-09-11_15:55:17_selection
screenshot_2023-09-11_15:55:22_selection
screenshot_2023-09-11_15:55:27_selection
screenshot_2023-09-11_15:55:31_selection
screenshot_2023-09-11_15:57:19_selection
screenshot_2023-09-11_15:57:27_selection

tsh bench kube ls my-cluster --duration=30m --rate 50

* Requests originated: 89999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         28 ms             
50         29 ms             
75         32 ms             
90         36 ms             
95         43 ms             
99         81 ms             
100        735 ms      

100 connections/second

screenshot_2023-09-11_15:49:28_selection
screenshot_2023-09-11_15:49:35_selection
screenshot_2023-09-11_15:49:40_selection
screenshot_2023-09-11_15:49:47_selection

screenshot_2023-09-11_15:57:55_selection
screenshot_2023-09-11_15:58:01_selection


tsh bench kube ls my-cluster --duration=30m --rate 100

* Requests originated: 179998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         26 ms             
50         28 ms             
75         32 ms             
90         47 ms             
95         72 ms             
99         176 ms            
100        743 ms 

kubectl exec

kubectl exec

For Kubernetes exec we are using the same cluster as above, but we are executing date command inside a Pod.
There are some limitations arround SPDY executor that prevents us from getting higher throughput than 30 commands/second. This is a known issue and we are going to address it using a different executor for Kubernetes 1.29+.

Until there, we are limited to lower numbers of commands/second.

5 connections/second
tsh bench --duration=10m --rate=5  kube exec my-cluster  ubuntu date

* Requests originated: 3000
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         122 ms            
50         134 ms            
75         145 ms            
90         155 ms            
95         160 ms            
99         173 ms            
100        437 ms      

30 connections/second

This test was achieved by running 3 tsh bench commands in parallel and combining the results.

tsh bench --duration=10m --rate=30  kube exec my-cluster  ubuntu date

* Requests originated: 18000
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         121 ms            
50         134 ms            
75         144 ms            
90         154 ms            
95         161 ms            
99         180 ms            
100        438 ms            


@smallinsky
Copy link
Contributor

tbot db access backward compatibility issue: #31750

@hugoShaka
Copy link
Contributor

Not sure if this is a bug, Azure VMs belonging to Scale Sets are not discovered: #31758

@fspmarshall
Copy link
Contributor

500 Trusted Clusters (etcd)

Screenshot 2023-09-11 at 9 30 02 PM

@capnspacehook
Copy link
Contributor

capnspacehook commented Sep 12, 2023

Playing a leaf SSH session recorded at the proxy fails: #31776

@tcsc
Copy link
Contributor

tcsc commented Sep 13, 2023

GCP Discovery appears totally broken. Existing issue: #31386

@AntonAM
Copy link
Contributor

AntonAM commented Sep 13, 2023

IP pinned users can't upload/download files after connecting to nodes in web UI #31845

@tigrato
Copy link
Contributor

tigrato commented Sep 15, 2023

#30248 broke some Kubernetes watch streams when they were not very active or too slow - i.e. the watcher created by kubectl run -it alpine --image alpine --command sh to receive the pod status.
Fix #31945

@fspmarshall
Copy link
Contributor

ETCD Soak Tests

tsh bench ssh --duration=30m root@node ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         103 ms            
50         108 ms            
75         119 ms            
90         125 ms            
95         128 ms            
99         146 ms            
100        326 ms            
tsh bench ssh --duration=30m root@label=value ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         111 ms            
50         114 ms            
75         127 ms            
90         132 ms            
95         135 ms            
99         156 ms            
100        5519 ms           
tsh bench ssh --duration=30m --random root@all ls

* Requests originated: 17999
* Requests failed: 345
* Last error: failed connecting to host ac849c5d-1406-4995-baae-a0f793079190:0: failed to receive cluster details response
	failed to dial target host
	Teleport proxy failed to connect to "node" agent "@local-node" over reverse tunnel:

  no tunnel connection found: no node reverse tunnel for ac849c5d-1406-4995-baae-a0f793079190.fspm-loadtest.teleport-test.com found

This usually means that the agent is offline or has disconnected. Check the
agent logs and, if the issue persists, try restarting it or re-registering it
with the cluster.

Histogram

Percentile Response Duration 
---------- ----------------- 
25         106 ms            
50         117 ms            
75         124 ms            
90         135 ms            
95         138 ms            
99         149 ms            
100        334 ms

note: missing node during --random was unrelated to the test.

@zmb3 zmb3 closed this as completed Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-plan A list of tasks required to ship a successful product release.
Projects
None yet
Development

No branches or pull requests