Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.0 Test Plan #5654

Closed
russjones opened this issue Feb 22, 2021 · 15 comments
Closed

6.0 Test Plan #5654

russjones opened this issue Feb 22, 2021 · 15 comments
Labels
test-plan A list of tasks required to ship a successful product release.

Comments

@russjones
Copy link
Contributor

russjones commented Feb 22, 2021

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport.
These tests should be run on both a fresh install of the version to be released
as well as an upgrade of the previous version of Teleport.

  • Adding nodes to a cluster @russjones

    • Adding Nodes via Valid Static Token
    • Adding Nodes via Valid Short-lived Tokens
    • Adding Nodes via Invalid Token Fails
    • Revoking Node Invitation
  • Labels @xacrimon

    • Static Labels
    • Dynamic Labels
  • Trusted Clusters @awly

    • Adding Trusted Cluster Valid Static Token
    • Adding Trusted Cluster Valid Short-lived Token
    • Adding Trusted Cluster Invalid Token
    • Removing Trusted Cluster
  • RBAC @quinqu

    Make sure that invalid and valid attempts are reflected in audit log.

    • Successfully connect to node with correct role
    • Unsuccessfully connect to a node in a role restricting access by label
    • Unsuccessfully connect to a node in a role restricting access by invalid SSH login
    • Allow/deny role option: SSH agent forwarding
    • Allow/deny role option: Port forwarding
  • Users @fspmarshall @awly
    With every user combination, try to login and signup with invalid second factor, invalid password to see how the system reacts.

    • Adding Users Password Only
    • Adding Users OTP
    • Adding Users U2F
    • Managing MFA devices
      • Add an OTP device with tsh mfa add
      • Add a U2F device with tsh mfa add
      • List MFA devices with tsh mfa ls
      • Remove an OTP device with tsh mfa rm
      • Remove a U2F device with tsh mfa rm
      • Attempt removing the last MFA device on the user
    • Login Password Only
    • Login with MFA
    • Login OIDC
    • Login SAML
    • Login GitHub
    • Deleting Users
  • Backends @andrejtokarcik

    • Teleport runs with etcd
    • Teleport runs with dynamodb
    • Teleport runs with dir (sqlite)
    • Teleport runs with Firestore.
  • Session Recording @benarent

    • Session recording can be disabled
    • Sessions can be recorded at the node
      • Sessions in remote clusters are recorded in remote clusters
    • Sessions can be recorded at the proxy
      • Sessions on remote clusters are recorded in the local cluster
      • Enable/disable host key checking.
  • Audit Log @a-palchikov

    • Failed login attempts are recorded
    • Interactive sessions have the correct Server ID
      • Server ID is the ID of the node in regular mode
      • Server ID is randomly generated for proxy node
    • Exec commands are recorded
    • scp commands are recorded
    • Subsystem results are recorded
  • Interact with a cluster using tsh @Joerger

    These commands should ideally be tested for recording and non-recording modes as they are implemented in a different ways.

    • tsh ssh <regular-node>
    • tsh ssh <node-remote-cluster>
    • tsh ssh -A <regular-node>
    • tsh ssh -A <node-remote-cluster>
    • tsh ssh <regular-node> ls
    • tsh ssh <node-remote-cluster> ls
    • tsh join <regular-node>
    • tsh join <node-remote-cluster>
    • tsh play <regular-node>
    • tsh play <node-remote-cluster>
    • tsh scp <regular-node>
    • tsh scp <node-remote-cluster>
    • tsh ssh -L <regular-node>
    • tsh ssh -L <node-remote-cluster>
    • tsh ls
    • tsh clusters
  • Interact with a cluster using ssh @webvictim
    Make sure to test both recording and regular proxy modes.

    • ssh <regular-node>
    • ssh <node-remote-cluster>
    • ssh -A <regular-node>
    • ssh -A <node-remote-cluster>
    • ssh <regular-node> ls
    • ssh <node-remote-cluster> ls
    • scp <regular-node>
    • scp <node-remote-cluster>
    • ssh -L <regular-node>
    • ssh -L <node-remote-cluster>
  • Interact with a cluster using the Web UI @russjones

    • Connect to a Teleport node
    • Connect to a OpenSSH node
    • Check agent forwarding is correct based on role and proxy mode.

Combinations @xacrimon

For some manual testing, many combinations need to be tested. For example, for
interactive sessions the 12 combinations are below.

  • Connect to a OpenSSH node in a local cluster using OpenSSH.
  • Connect to a OpenSSH node in a local cluster using Teleport.
  • Connect to a OpenSSH node in a local cluster using the Web UI.
  • Connect to a Teleport node in a local cluster using OpenSSH.
  • Connect to a Teleport node in a local cluster using Teleport.
  • Connect to a Teleport node in a local cluster using the Web UI.
  • Connect to a OpenSSH node in a remote cluster using OpenSSH.
  • Connect to a OpenSSH node in a remote cluster using Teleport.
  • Connect to a OpenSSH node in a remote cluster using the Web UI.
  • Connect to a Teleport node in a remote cluster using OpenSSH.
  • Connect to a Teleport node in a remote cluster using Teleport.
  • Connect to a Teleport node in a remote cluster using the Web UI.

Teleport with EKS/GKE @awly

  • Deploy Teleport on a single EKS cluster
  • Deploy Teleport on two EKS clusters and connect them via trusted cluster feature
  • Deploy Teleport Proxy outside of GKE cluster fronting connections to it (use this script to generate a kubeconfig)
  • Deploy Teleport Proxy outside of EKS cluster fronting connections to it (use this script to generate a kubeconfig)

Teleport with multiple Kubernetes clusters @quinqu @awly

Note: you can use GKE or EKS or minikube to run Kubernetes clusters.
Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

  • Deploy combo auth/proxy/kubernetes_service outside of a Kubernetes cluster, using a kubeconfig
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy/kubernetes_service inside of a Kubernetes cluster
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy_service outside of the Kubernetes cluster and kubernetes_service inside of a Kubernetes cluster, connected over a reverse tunnel
    • Login with tsh login, check that tsh kube ls has your cluster
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh
    • Verify that the audit log recorded the above request and session
  • Deploy a second kubernetes_service inside of another Kubernetes cluster, connected over a reverse tunnel
    • Login with tsh login, check that tsh kube ls has both clusters
    • Switch to a second cluster using tsh kube login
    • Run kubectl get nodes, kubectl exec -it $SOME_POD -- sh on the new cluster
    • Verify that the audit log recorded the above request and session
  • Deploy combo auth/proxy/kubernetes_service outside of a Kubernetes cluster, using a kubeconfig with multiple clusters in it
    • Login with tsh login, check that tsh kube ls has all clusters

Teleport with FIPS mode @russjones

  • Perform trusted clusters, Web and SSH sanity check with all teleport components deployed in FIPS mode.

Migrations @fspmarshall @russjones

  • Migrate trusted clusters from 5.0 to 6.0.
    • Migrate auth server on main cluster, then rest of the servers on main cluster
      SSH should work for both main and old clusters
    • Migrate auth server on remote cluster, then rest of the remote cluster
      SSH should work

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %[email protected] -s proxy:%h:%[email protected]" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers @andrejtokarcik @benarent

  • G Suite install instructions work
    • G Suite Screenshots are up to date
  • ActiveDirectoy install instructions work
    • Active Directoy Screenshots are up to date
  • Okta install instructions work
    • Okta Screenshots are up to date
  • OneLogin install instructions work
    • OneLogin Screenshots are up to date
  • OIDC install instructions work
    • OIDC Screenshots are up to date

Teleport Plugins @benarent

  • Test receiving a message via Teleport Slackbot
  • Test receiving a new Jira Ticket via Teleport Jira

WEB UI @alex-kovoy @kimlisa

Main

For main, test with admin role that has access to all resources.

Top Nav

  • Verify that cluster selector displays all (root + leaf) clusters
  • Verify that user name is displayed
  • Verify that user menu shows logout, help&support, and account settings

Side Nav

  • Verify that each item has an icon
  • Verify that Collapse/Expand works and collapsed has icon >, and expand has icon v
  • Verify that it automatically expands and highlights the item on page refresh

Servers aka Nodes

  • Verify that "Servers" table shows all joined nodes
  • Verify that "Connect" button shows a list of available logins
  • Verify that "Hostname", "Address" and "Labels" columns show the current values
  • Verify that "Search" by hostname, address, labels works
  • Verify that terminal opens when clicking on one of the available logins
  • Verify that clicking on Add Server button renders dialogue set to Automatically view
    • Verify clicking on Regenerate Script regenerates token value in the bash command
    • Verify using the bash command successfully adds the server (refresh server list)
    • Verify that clicking on Manually tab renders manual steps
    • Verify that clicking back to Automatically tab renders bash command

Applications

  • Verify that all apps registered are shown
  • Verify that clicking on the app icon takes you to another tab
  • Verify that clicking on Add Application button renders dialogue
    • Verify input validation (prevent empty value and invalid url)
    • Verify after input and clicking on Generate Script, bash command is rendered
    • Verify clicking on Regenerate button regenerates token value in bash command
    • Verify using the bash command successfully adds the application (refresh app list)

Active Sessions

  • Verify that "empty" state is handled
  • Verify that it displays the session when session is active
  • Verify that "Description", "Session ID", "Users", "Nodes" and "Duration" columns show correct values
  • Verify that "OPTIONS" button allows to join a session

Audit log

  • Verify that time range button is shown and works
  • Verify that clicking on Session Ended event icon, takes user to session player
  • Verify event detail dialogue renders when clicking on events details button
  • Verify searching by type, description, created works

Access Requests

  1. Create a role with limited permissions (defined below as allow-roles). This role allows you to see the Role screen and ssh into all nodes.
  2. Create another role with limited permissions (defined below as allow-users). This role session expires in 4 minutes, allows you to see Users screen, and denies access to all nodes.
  3. Create another role with no permissions other than being able to create requests (defined below as default)
  4. Create a user with role default assigned
  5. Create a few requests under this user:
  • Update requests to at least: one pending, two approved (for each requestable role), and one denied
kind: role
metadata:
  name: allow-roles
spec:
  allow:
    logins:
    - root
    node_labels:
      '*': '*'
    rules:
    - resources:
      - role
      verbs:
      - list
      - read
  options:
    max_session_ttl: 8h0m0s
version: v3
kind: role
metadata:
  name: allow-users
spec:
  allow:
    rules:
    - resources:
      - user
      verbs:
      - list
      - read
  deny:
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 4m0s
version: v3
kind: role
metadata:
  name: default
spec:
  allow:
    request:
      roles:
      - allow-roles
      - allow-users
    rules:
    - resources:
      - access_request
      verbs:
      - list
      - read
      - create
  options:
    max_session_ttl: 8h0m0s
version: v3
  • Verify that requests are shown and that correct states are applied to each request (pending, approved, denied)
  • Verify that creating a new request works
    • Verify that under requestable roles, only allow-roles and allow-users are listed
    • Verify input validation requires at least one role to be selected
  • Verify assume buttons are only present for approved request and for logged in user
    • Verify that assuming allow-roles allows you to see roles screen and ssh into nodes
    • Verify that after clicking on the assume button, it is disabled
    • After assuming allow-roles, verify that assuming allow-users allows you to see users screen, and denies access to nodes.
      • Verify that after 4 minutes, the user is automatically logged out
  • Verify that after logging out (or getting logged out automatically) and relogging in, permissions are reset to default, and requests that are not expired and are approved are assumable again.

Users

  • Verify that users are shown
  • Verify that creating a new user works
  • Verify that editing user roles works
  • Verify that removing a user works
  • Verify resetting a user's password works
  • Verify search by username, roles, and type works

Auth Connectors

  • Verify that creating OIDC/SAML/GITHUB connectors works
  • Verify that editing OIDC/SAML/GITHUB connectors works
  • Verify that error is shown when saving an invalid YAML
  • Verify that correct hint text is shown on the right side

Auth Connectors Card Icons

  • Verify that GITHUB card has github icon
  • Verify that SAML card has SAML icon
  • Verify that OIDC card has OIDC icon
  • Verify when there are no connectors, empty state renders

Roles

  • Verify that roles are shown
  • Verify that "Create New Role" dialog works
  • Verify that deleting and editing works
  • Verify that error is shown when saving an invalid YAML
  • Verify that correct hint text is shown on the right side

Managed Clusters

  • Verify that it displays a list of clusters (root + leaf)
  • Verify that every menu item works: nodes, apps, audit events, session recordings.

Help&Support

  • Verify that all URLs work and correct (no 404)

Access Request Waiting Room

Strategy Reason

Create the following role:

kind: role
metadata:
  name: restrict
spec:
  allow:
    request:
      roles:
      - <some other role to assign user after approval>
  options:
    max_session_ttl: 8h0m0s
    request_access: reason
    request_prompt: <some custom prompt to show in reason dialogue>
version: v3
  • Verify after login, reason dialogue is rendered with prompt set to request_prompt setting
  • Verify after clicking send request, pending dialogue renders
  • Verify after tctl requests approve <request-id>, dashboard is rendered
  • Verify the correct role was assigned

Strategy Always

With the previous role you created from Strategy Reason, change request_access to always:

  • Verify after login, pending dialogue is rendered
  • Verify after tctl requests approve <request-id>, dashboard is rendered
  • Verify after login, tctl requests deny <request-id>, access denied dialogue is rendered

Strategy Optional

With the previous role you created from Strategy Reason, change request_access to optional:

  • Verify after login, dashboard is rendered

Account

  • Verify that Account screen is accessibly from the user menu for local users.
  • Verify that changing a local password works (OTP, U2F)

Terminal

  • Verify that top nav has a user menu (Main and Logout)
  • Verify that switching between tabs works on alt+[1...9]

Node List Tab

  • Verify that Cluster selector works (URL should change too)
  • Verify that Quick launcher input works
  • Verify that Quick launcher input handles input errors
  • Verify that "Connect" button shows a list of available logins
  • Verify that "Hostname", "Address" and "Labels" columns show the current values
  • Verify that "Search" by hostname, address, labels work
  • Verify that new tab is created when starting a session

Session Tab

  • Verify that session and browser tabs both show the title with login and node name
  • Verify that terminal resize works
    • Install midnight commander on the node you ssh into: $ sudo apt-get install mc
    • Run the program: $ mc
    • Resize the terminal to see if panels resize with it
  • Verify that session tab shows/updates number of participants when a new user joins the session
  • Verify that tab automatically closes on "$ exit" command
  • Verify that SCP Upload works
  • Verify that SCP Upload handles invalid paths and network errors
  • Verify that SCP Download works
  • Verify that SCP Download handles invalid paths and network errors

Session Player

  • Verify that it can replay a session
  • Verify that when playing, scroller auto scrolls to bottom most content
  • Verify when resizing player to a small screen, scroller appears and is working
  • Verify that error message is displayed (enter a invalid SID in the URL)

Invite Form

  • Verify that input validates
  • Verify that invite works with 2FA disabled
  • Verify that invite works with OTP enabled
  • Verify that invite works with U2F enabled
  • Verify that error message is shown if an invite is expired/invalid

Login Form

  • Verify that input validates
  • Verify that login works with 2FA disabled
  • Verify that login works with OTP enabled
  • Verify that login works with U2F enabled
  • Verify that login works for Github/SAML/OIDC
  • Verify that account is locked after several unsuccessful attempts
  • Verify that redirect to original URL works after successful login

RBAC

Create a role, with no allow.rules defined:

kind: role
metadata:
  name: test
spec:
  allow:
    app_labels:
      '*': '*'
    logins:
    - root
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 8h0m0s
version: v3
  • Verify that a user has access only to: "Servers", "Applications", "Active Sessions" and "Manage Clusters"
  • Verify there is no Add Server button in Server view
  • Verify there is no Add Application button in Applications view
  • Verify only Nodes and Apps are listed under options button in Manage Clusters

Add the following under spec.allow.rules to enable read access to the audit log:

  - resources:
      - event
      verbs:
      - list
  • Verify that the Audit Log and Session Recordings is accessible
  • Verify that playing a recorded session is denied

Add the following to enable read access to recorded sessions

  - resources:
      - session
      verbs:
      - read
  • Verify that a user can re-play a session (session.end)

Add the following to enable read access to the roles

- resources:
      - role
      verbs:
      - list
      - read
  • Verify that a user can see the roles
  • Verify that a user cannot create/delete/update a role

Add the following to enable read access to the auth connectors

- resources:
      - auth_connector
      verbs:
      - list
      - read
  • Verify that a user can see the list of auth connectors.
  • Verify that a user cannot create/delete/update the connectors

Add the following to enable read access to users

  - resources:
      - user
      verbs:
      - list
      - read
  • Verify that a user can access the "Users" screen
  • Verify that a user cannot create/delete/update a user

Add the following to enable read access to trusted clusters

  - resources:
      - trusted_cluster
      verbs:
      - list
      - read
  • Verify that a user can access the "Trust" screen

  • Verify that a user cannot create/delete/update a trusted cluster.

  • Enterprise users has read/create access_request access, despite resource setting

Performance/Soak Test @fspmarshall @a-palchikov @quinqu

Using tsh bench tool, perform the soak tests and benchmark tests on the following configurations:

  • Cluster with 10K nodes in normal (non-IOT) node mode with ETCD

  • Cluster with 10K nodes in normal (non-IOT) mode with DynamoDB

  • Cluster with 1K IOT nodes with ETCD

  • Cluster with 1K IOT nodes with DynamoDB

  • Cluster with 500 trusted clusters with ETCD

  • Cluster with 500 trusted clusters with DynamoDB

Soak Tests @fspmarshall @a-palchikov @quinqu

Run 4hour soak test with a mix of interactive/non-interactive sessions:

tsh bench --duration=4h user@teleport-monster-6757d7b487-x226b ls
tsh bench -i --duration=4h user@teleport-monster-6757d7b487-x226b ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

  • Verify that prometheus metrics are accurate.

Breaking load tests @fspmarshall @a-palchikov @quinqu

Load system with tsh bench to the capacity and publish maximum numbers of concurrent sessions with interactive
and non interactive tsh bench loads.

Teleport with Cloud Providers

AWS @Joerger @webvictim

GCP @webvictim

  • Deploy Teleport to GCP. Using Cloud Firestore & Cloud Storage
  • Deploy Teleport to GKE. Google Kubernetes engine.
  • Deploy Teleport Enterprise to GCP.

IBM @webvictim

  • Deploy Teleport to IBM Cloud. Using IBM Database for etcd & IBM Object Store
  • Deploy Teleport to IBM Cloud Kubernetes.
  • Deploy Teleport Enterprise to IBM Cloud.

Application Access @russjones @r0mant

  • Run an application within local cluster.
    • Verify the debug application debug_app: true works.
    • Verify an application can be configured with command line flags.
    • Verify an application can be configured from file configuration.
    • Verify that applications are available at auto-generated addresses name.rootProxyPublicAddr and well as publicAddr.
  • Run an application within a trusted cluster.
    • Verify that applications are available at auto-generated addresses name.rootProxyPublicAddr.
  • Verify Audit Records.
    • app.session.start and app.session.chunk events are created in the Audit Log.
    • app.session.chunk points to a 5 minute session archive with multiple app.session.request events inside.
    • tsh play <chunk-id> can fetch and print a session chunk archive.
  • Verify JWT using verify-jwt.go.
  • Verify RBAC.

Database Access @r0mant

  • Connect to a database within a local cluster.
    • Self-hosted Postgres.
    • Self-hosted MySQL.
    • AWS Aurora Postgres.
    • AWS Aurora MySQL.
  • Connect to a database within a remote cluster via a trusted cluster.
    • Self-hosted Postgres.
    • Self-hosted MySQL.
    • AWS Aurora Postgres.
    • AWS Aurora MySQL.
  • Verify audit events.
    • db.session.start is emitted when you connect.
    • db.session.end is emitted when you disconnect.
    • db.session.query is emitted when you execute a SQL query.
  • Verify RBAC.
    • tsh db ls shows only databases matching role's db_labels.
    • Can only connect as users from db_users.
    • (Postgres only) Can only connect to databases from db_names.
    • db.session.start is emitted when connection attempt is denied.
@russjones russjones added the bug label Feb 22, 2021
@russjones russjones added this to the 6.0 "San Diego" milestone Feb 22, 2021
@webvictim webvictim added test-plan A list of tasks required to ship a successful product release. and removed bug labels Feb 24, 2021
@andrejtokarcik
Copy link
Contributor

andrejtokarcik commented Feb 24, 2021

Minor issue with etcd support: The inadvisable option insecure doesn't seem to work at all. My understanding is that it should cause the tls_ca_file-related checks to be skipped. However the tls_ca_file is still attempted to be parsed as a certificate, resulting in an error immediately upon startup (with insecure: true and no tls_ca_file in teleport.yaml):

DEBU [SQLITE]    Connected to: file:.data/proc/sqlite.db?_busy_timeout=10000&_sync=OFF, poll stream period: 1s lite/lite.go:172
DEBU [SQLITE]    Synchronous: 0, busy timeout: 10000 lite/lite.go:217
DEBU [KEYGEN]    SSH cert authority is going to pre-compute 25 keys. native/native.go:99
DEBU [PROC:1]    Using etcd backend. service/service.go:3073

ERROR REPORT:
Original Error: *trace.BadParameterError missing PEM encoded block
Stack Trace:
	/go/src/github.com/gravitational/teleport/lib/tlsca/parsegen.go:158 github.com/gravitational/teleport/lib/tlsca.ParseCertificatePEM
	/go/src/github.com/gravitational/teleport/lib/backend/etcdbk/etcd.go:331 github.com/gravitational/teleport/lib/backend/etcdbk.(*EtcdBackend).reconnect
	/go/src/github.com/gravitational/teleport/lib/backend/etcdbk/etcd.go:227 github.com/gravitational/teleport/lib/backend/etcdbk.New
	/go/src/github.com/gravitational/teleport/lib/service/service.go:3086 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthStorage
	/go/src/github.com/gravitational/teleport/lib/service/service.go:1010 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService
	/go/src/github.com/gravitational/teleport/lib/service/service.go:694 github.com/gravitational/teleport/lib/service.NewTeleport
	/go/src/github.com/gravitational/teleport/e/lib/pro/process.go:41 github.com/gravitational/teleport/e/lib/pro.NewTeleport
	/go/src/github.com/gravitational/teleport/e/tool/teleport/main.go:35 main.run.func1
	/go/src/github.com/gravitational/teleport/lib/service/service.go:442 github.com/gravitational/teleport/lib/service.Run
	/go/src/github.com/gravitational/teleport/e/tool/teleport/main.go:43 main.run
	/go/src/github.com/gravitational/teleport/e/tool/teleport/main.go:27 main.main
	/opt/go/src/runtime/proc.go:204 runtime.main
	/opt/go/src/runtime/asm_amd64.s:1374 runtime.goexit
User Message: initialization failed
	failed to parse CA certificate
		missing PEM encoded block

@webvictim
Copy link
Contributor

Found a regression in scp: #5695

@Joerger
Copy link
Contributor

Joerger commented Feb 24, 2021

tsh join and tsh play don't work in proxy recording mode - #5702
I found some messages in slack saying this may be intended, and this issue from an old test cycle #3913. Looks like the Testplan needs to be updated, unless this is an intended right of passage...

@r0mant
Copy link
Collaborator

r0mant commented Feb 25, 2021

Re: tsh play use-case for application access: replaying local archive with tsh play --format=json /path/to/tar works, tsh play <chunk-id> does not but seems like there's an open issue (#4943) about this so it looks like it just wasn't implemented.

@klizhentas
Copy link
Contributor

two-auth     | INFO [RBAC]      Access to read db_server in namespace default denied to roles RemoteProxy,default-implicit-role: no allow rule matched. services/role.go:2083

@r0mant I have OSS trusted clusters, that's what I see after root cluster upgrade (but not leaf)

@r0mant
Copy link
Collaborator

r0mant commented Feb 25, 2021

@klizhentas Good catch, this looks like it's because the version check that determines the "old proxy" and sets up caches needs to be updated to account for the new version. Submitted a PR: #5709. cc @russjones

@xacrimon
Copy link
Contributor

xacrimon commented Feb 25, 2021

Submitted #5693 which fixes

  • utmp/wtmp support not working on systems where _PATH_UTMP from "types.h" is a symlink.
  • The wrong hostname being written to the utmp entry.
  • The utmp entry not appearing properly cleared on some distros (had to spin up some vms to reproduce) since interpretation of the file varies somewhat between different tools.

@klizhentas
Copy link
Contributor

@russjones tested upgrades from 5.1.2 to 6.0 both for OSS and enterprise with and without trusted clusters. I will write migration guide.

@xacrimon
Copy link
Contributor

#5766 Needs to be reviewed & approved, merged and backported.

@fspmarshall
Copy link
Contributor

ETCD (non-iot)

1k

1k-etcd

10K

10k-etcd

Soak

tsh bench --duration=4h root@loadtest-96994f586-6b5zq ls

* Requests originated: 143977
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         2065 ms           
50         3465 ms           
75         5979 ms           
90         8239 ms           
95         9591 ms           
99         15071 ms          
100        63871 ms
tsh bench --interactive --duration=4h root@loadtest-96994f586-rb455 ps uax

* Requests originated: 143985
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         2075 ms           
50         3449 ms           
75         5863 ms           
90         8035 ms           
95         9431 ms           
99         15287 ms          
100        64959 ms

Notes

  • Response times for soak test may represent a regression. Requires some additional investigation to be certain.

  • Auth server mem usage did not fully return to pre-scale up levels. Repeated scale-up/scale-down did not continually grow this usage though, so this appears to be a one-time growth, rather than a per-scale-up memory leak.

@quinqu
Copy link
Contributor

quinqu commented Mar 2, 2021

Scaling up and down with 1K IOT nodes with DynamoDB

Screenshot from 2021-03-02 07-58-52

Notes

  • Each time I would scale up, the auth server "held onto" previously added nodes that could be why goorutines, heap, and open file descriptors kept slightly increasing
  • Memory didn't go down to the base-line and kept increasing with every scale up/down

@quinqu
Copy link
Contributor

quinqu commented Mar 3, 2021

soak

tsh bench --duration 4h root@host ls

* Requests originated: 144000
* Requests failed: 0
Histogram
Percentile Response Duration 
---------- ----------------- 
25         96 ms             
50         99 ms             
75         114 ms            
90         120 ms            
95         127 ms            
99         332 ms            
100        6811 ms 

tsh bench --interactive --duration 4h root@host ps uax

* Requests originated: 143999
* Requests failed: 63
* Last error: failed to authenticate with proxy host:3023: dial tcp: lookup host on 172.31.0.2:53: read udp 172.31.2.94:56467->172.31.0.2:53: i/o timeout
Histogram
Percentile Response Duration 
---------- ----------------- 
25         105 ms            
50         110 ms            
75         127 ms            
90         134 ms            
95         140 ms            
99         347 ms            
100        6215 ms     

@fspmarshall
Copy link
Contributor

fspmarshall commented Mar 3, 2021

ETCD (500 Trusted Clusters)

500-tc-etcd

Note the initial drop in resource consumption around the 21:48 mark. That is the point at which the 500 trusted clusters were taken offline. As you can see, the proxies failed to clean up ~25% of the related goroutines and ~60% of the related heap memory. The second drop around the 21:58 mark is when I manually deleted the remote_cluster resources from the root's backend. This caused the remaining goroutines to be cleaned up, and more than half of the remaining heap memory to be released.

In order to determine if the heap memory increase was an ongoing leak or a one-time capacity increase, I cycled the clusters two more times. After the second cycle, resting (post rc deletion) memory increased by another ~10%. After the third cycle, it returned to the approximate amount seen after the first cycle. I interpret this to mean that whatever the problem is, it isn't as simple as an ever-growing set of clusters being held somewhere in memory (plus side, churn is unlikely to cause memory use to grow indefinitely).

After the initial cycling, attempting to ssh into root cluster nodes started to result in unexpected end of JSON input errors on client and remote cluster "<cluster-name>" is not found regular/sshserver.go:1568 errors on the proxy. My hunch here is that this indicates that the reversetunnel system is falling out of sync with the remote_cluster resources that exist in the backend somehow, which would explain both the lingering heap memory usage, and not found errors for clusters which have already been removed.

edit: Looks like this is not a 6.0 regression, but rather an issue that we missed in 5.0. Requires further investigation to discover the root cause, but the errors are definitely being triggered by reversetunnel.Server holding onto tunnels after their associated remote_cluster resource has been destroyed and all connections closed.

@quinqu
Copy link
Contributor

quinqu commented Mar 4, 2021

Soak test with Teleport 5.0 and new bench changes

tsh bench  --interactive -d  --duration 4h root@host ps uax
* Requests originated: 144000
* Requests failed: 0
Histogram
Percentile Response Duration 
---------- ----------------- 
25         105 ms            
50         111 ms            
75         130 ms            
90         143 ms            
95         164 ms            
99         17631 ms          
100        65535 ms   
tsh bench -d  --duration 4h root@host ls
* Requests originated: 144000
* Requests failed: 0
Histogram
Percentile Response Duration 
---------- ----------------- 
25         96 ms             
50         102 ms            
75         118 ms            
90         130 ms            
95         151 ms            
99         17215 ms          
100        65535 ms 

It seems like 6.0 mostly improved in the 99th percentile: 6.0 soak test results

@fspmarshall
Copy link
Contributor

ETCD (iot)

1k

1k-etcd-iot

10k

10k-etcd-iot

Soak

tsh bench --duration=4h root@loadtest-96994f586-crv6q ls

* Requests originated: 143982
* Requests failed: 308
* Last error: timeout

Histogram

Percentile Response Duration 
---------- ----------------- 
25         2279 ms           
50         3885 ms           
75         6383 ms           
90         8567 ms           
95         10647 ms          
99         22975 ms          
100        65375 ms
tsh bench --interactive --duration=4h root@loadtest-96994f586-h6snb ps uax

* Requests originated: 143983
* Requests failed: 306
* Last error: timeout

Histogram

Percentile Response Duration 
---------- ----------------- 
25         2277 ms           
50         3903 ms           
75         6463 ms           
90         8655 ms           
95         10623 ms          
99         23103 ms          
100        65471 ms

Notes

  • Response times for soak test differ from numbers seen previously, but re-running with old teleport versions gave me similar numbers. Likely a problem with some other part of my infra rather than teleport itself.

  • Proxy mem usage did not fully return to pre-scale up levels. Repeated scale-up/scale-down in combination with manually triggering cleanup logic with tsh ssh did not continually grow this usage though, so this appears to be a one-time growth, rather than a per-scale-up memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-plan A list of tasks required to ship a successful product release.
Projects
None yet
Development

No branches or pull requests

9 participants