From 33031d87f5cacb0a74daac43ec749f31d04c9922 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:36:44 -0400 Subject: [PATCH 01/84] Create 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 271 ++++++++++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 rfd/0169-auto-updates-linux-agents.md diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md new file mode 100644 index 0000000000000..110df9195ae70 --- /dev/null +++ b/rfd/0169-auto-updates-linux-agents.md @@ -0,0 +1,271 @@ +--- +authors: Stephen Levine (stephen.levine@goteleport.com) +state: draft +--- + +# RFD 0169 - Automatic Updates for Linux Agents + +## Required Approvers + +* Engineering: @rjones && @bernardjkim +* Security: @reed + +## What + +This RFD proposes a new mechanism for Teleport agents installed on Linux servers to automatically update to a version set by an operator via tctl. + +The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: +- Analogous adjustments for Teleport agents installed on Kubernetes +- Phased rollouts of new agent versions for agents connected to an existing cluster +- Signing of agent artifacts via TUF +- Teleport Cloud APIs for updating agents + +This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. + +Additionally, this RFD parallels the auto-update functionality for client tools proposed in https://github.com/gravitational/teleport/pull/39805. + +## Why + +The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. + +1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades or confusing command output. +2. The use of system package management requires complex logic for each target distribution. +3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. +4. The use of bash to implement the updater makes changes difficult and prone to error. +5. The existing auto-updater has limited automated testing. +6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. +7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). +8. The rollout plan for the new agent version is not fully-configurable using tctl. +9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. +10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. +11. The existing auto-updater is not self-updating. +12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). + +We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. + +## Details + +We will ship a new auto-updater package written in Go that does not interface with the system package manager. +It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. +It will read the unauthenticated `/v1/webapi/ping` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. + +### Installation + +```shell +$ apt-get install teleport-ent-updater +$ teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + +### API + +#### Endpoints + +`/v1/webapi/ping` +```json +{ + "agent_version": "15.1.1", + "agent_auto_update": true, + "agent_update_after": "2024-04-23T18:00:00.000Z", + "agent_update_jitter": 10, +} +``` +Notes: +- Critical updates are achieved by serving `agent_update_after` with the current time. +- The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. +- If an agent misses an upgrade window, it will always update immediately. + +#### Teleport Resources + +```yaml +kind: cluster_maintenance_config +spec: + # agent_auto_update allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. + agent_auto_update: on|off + # agent_update_hour sets the hour in UTC at which clients should update their agents. + # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. + agent_update_hour: -1-23 + # agent_update_jitter sets a duration in which the upgrade will occur after the hour. + # The agent upgrader will pick a random time within this duration in which to upgrade. + agent_update_jitter: 0-MAXINT64 + + [...] +``` +``` +$ tctl autoupdate update --set-agent-auto-update=off +Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-hour=3 +Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-jitterr=600 +Automatic updates configuration has been updated. +``` + +```yaml +kind: autoupdate_version +spec: + # agent_version is the version of the agent the cluster will advertise. + # Can be auto (match the version of the proxy) or an exact semver formatted + # version. + agent_version: auto|X.Y.Z + + [...] +``` +``` +$ tctl autoupdate update --set-agent-version=15.1.1 +Automatic updates configuration has been updated. +``` + +Notes: +- These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. + +Questions: +- Should we use a time-only format for specifying the update hour? E.g., `agent_update_time: "18:00:00.000+01` + This would allow users to set an exact time via the CLI, instead of restricting to hours. + +### Filesystem + +``` +$ tree /var/lib/teleport +/var/lib/teleport +└── versions + ├── 15.0.0 + │ ├── bin + │ │ ├── ... + │ │ ├── teleport-updater + │ │ └── teleport + │ └── etc + │ ├── ... + │ └── systemd + │ └── teleport.service + ├── 15.1.1 + │ ├── bin + │ │ ├── ... + │ │ ├── teleport-updater + │ │ └── teleport + │ └── etc + │ ├── ... + │ └── systemd + │ └── teleport.service + └── updates.yaml +$ ls -l /usr/local/bin/teleport +/usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport +$ ls -l /usr/local/bin/teleport +/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater +$ ls -l /usr/local/lib/systemd/system/teleport.service +/usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service +``` + +updates.yaml: +``` +version: v1 +proxy: mytenant.teleport.sh +enabled: true +active_version: 15.1.1 +``` + +### Runtime + +The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. +The systemd service will run: +```shell +$ teleport-updater update +``` + +After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: +```shell +$ teleport-updater enable --proxy mytenant.teleport.sh +``` + +If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. + +On servers without Teleport installed already, the `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. +It will also run update teleport immediately, to ensure that subsequent executions succeed. + +The `enable` subcommand will: +1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. +2. Query the `/v1/webapi/ping` endpoint. +3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. +4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). +5. Download the desired Teleport tarball specified by `agent_version`. +6. Verify the checksum. +7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +8. Replace any existing binaries or symlinks with symlinks to the current version. +9. Restart the agent if the systemd service is already enabled. +10. Set `active_version` in `updates.yaml` if successful or not enabled. +11. Replace the old symlinks or binaries and quit (exit 1) if unsuccessful. +12. Remove any `teleport` package if installed. +13. Verify the symlinks to the active version still exists. +14. Remove all stored versions of the agent except the current version and last working version. + +The `disable` subcommand will: +1. Configure `updates.yaml` to set `enabled` to false. + +When `update` subcommand is otherwise executed, it will: +1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. +2. Query the `/v1/webapi/ping` endpoint. +3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. +4. If the current version of Teleport is the latest, quit. +5. Wait `random(0, agent_update_jitter)` seconds. +6. Download the desired Teleport tarball specified by `agent_version`. +7. Verify the checksum. +8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Update symlinks to point at the new version. +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the old symlink or binary and quit (exit 1) if unsuccessful. +13. Remove all stored versions of the agent except the current version and last working version. + +To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. +The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. + +### Manual Workflow + +For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. + +Cluster administrators that want to self-manage client tools updates will be +able to get and watch for changes to agent versions which can then be +used to trigger other integrations to update the installed version of agents. + +```shell +$ tctl autoupdate watch +{"agent_version": "1.0.0"} +{"agent_version": "1.0.1"} +{"agent_version": "2.0.0"} +[...] +``` + +```shell +$ tctl autoupdate get +{"agent_version": "2.0.0"} +``` + +### Scripts + +All scripts will install the latest updater and run `teleport-updater enable` with the proxy address. + +Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. + +This is out-of-scope for this proposal. + +## Security + +The initial version of automatic updates will rely on TLS to establish +connection authenticity to the Teleport download server. The authenticity of +assets served from the download server is out of scope for this RFD. Cluster +administrators concerned with the authenticity of assets served from the +download server can use self-managed updates with system package managers which +are signed. + +The Upgrade Framework (TUF) will be used to implement secure updates in the future. + +## Execution Plan + +1. Implement new auto-updater in Go. +2. Prep documentation changes. +3. Release new updater via teleport-ent-updater package. +4. Release documentation changes. From c4531240f7863417f1a1f54d59e9daf15344913f Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:40:03 -0400 Subject: [PATCH 02/84] Fix github handle --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 110df9195ae70..db5639e6eb8af 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -7,7 +7,7 @@ state: draft ## Required Approvers -* Engineering: @rjones && @bernardjkim +* Engineering: @russjones && @bernardjkim * Security: @reed ## What From 796fa9e583f4e3d86dce63d3847ff60c635759cd Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:40:36 -0400 Subject: [PATCH 03/84] Fix Github handle --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index db5639e6eb8af..04f1d55512dd5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -8,7 +8,7 @@ state: draft ## Required Approvers * Engineering: @russjones && @bernardjkim -* Security: @reed +* Security: @reedloden ## What From 1b759418b31f1a06f68ef99d0df67867ec3c83c2 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 16:32:09 -0400 Subject: [PATCH 04/84] Clarify jitter flag --- rfd/0169-auto-updates-linux-agents.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 04f1d55512dd5..576a3d61116e9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -70,7 +70,7 @@ $ systemctl enable teleport "agent_version": "15.1.1", "agent_auto_update": true, "agent_update_after": "2024-04-23T18:00:00.000Z", - "agent_update_jitter": 10, + "agent_update_jitter_seconds": 10, } ``` Notes: @@ -90,9 +90,9 @@ spec: # agent_update_hour sets the hour in UTC at which clients should update their agents. # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. agent_update_hour: -1-23 - # agent_update_jitter sets a duration in which the upgrade will occur after the hour. + # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. - agent_update_jitter: 0-MAXINT64 + agent_update_jitter_seconds: 0-MAXINT64 [...] ``` @@ -101,7 +101,7 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-jitterr=600 +$ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. ``` @@ -210,7 +210,7 @@ When `update` subcommand is otherwise executed, it will: 2. Query the `/v1/webapi/ping` endpoint. 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. -5. Wait `random(0, agent_update_jitter)` seconds. +5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Download the desired Teleport tarball specified by `agent_version`. 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. From e2811de6ceb66b1aaecedb039f8684fd7dd234e3 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 16:33:09 -0400 Subject: [PATCH 05/84] Remove time question --- rfd/0169-auto-updates-linux-agents.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 576a3d61116e9..25ae473846d97 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -123,10 +123,6 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. -Questions: -- Should we use a time-only format for specifying the update hour? E.g., `agent_update_time: "18:00:00.000+01` - This would allow users to set an exact time via the CLI, instead of restricting to hours. - ### Filesystem ``` From a119c60951ea9ae83426869b5c68140f7bffd699 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:16:23 -0400 Subject: [PATCH 06/84] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 25ae473846d97..2311867e11e37 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -8,6 +8,7 @@ state: draft ## Required Approvers * Engineering: @russjones && @bernardjkim +* Product: @klizhentas || @xinding33 * Security: @reedloden ## What From 2a8cdc7bd7e55f53811b2d367a6e8bd2bb7e7764 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:19:47 -0400 Subject: [PATCH 07/84] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2311867e11e37..5c1d0dedc9c2c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -224,7 +224,7 @@ The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid ree For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. -Cluster administrators that want to self-manage client tools updates will be +Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be used to trigger other integrations to update the installed version of agents. From 1f3278d6f34f80b08806792c086a44b691123bb5 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:23:53 -0400 Subject: [PATCH 08/84] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 5c1d0dedc9c2c..6d2c770a072ae 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -151,7 +151,7 @@ $ tree /var/lib/teleport └── updates.yaml $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport -$ ls -l /usr/local/bin/teleport +$ ls -l /usr/local/bin/teleport-updater /usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service From 05aad9262f9dc581a0e2bef6102b15976b83fa4b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:00:04 -0400 Subject: [PATCH 09/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6d2c770a072ae..bb760237234bc 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -160,9 +160,14 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service updates.yaml: ``` version: v1 -proxy: mytenant.teleport.sh -enabled: true -active_version: 15.1.1 +kind: agent_versions +spec: + # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. + proxy: mytenant.teleport.sh + # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. + enabled: true + # active_version specifies the active (symlinked) deployment of the telepport agent. + active_version: 15.1.1 ``` ### Runtime From ed4780dbfcede6c18d958e134810594153500903 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:24:09 -0400 Subject: [PATCH 10/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index bb760237234bc..336a488d25a81 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -227,7 +227,8 @@ The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid ree ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be @@ -235,15 +236,15 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch -{"agent_version": "1.0.0"} -{"agent_version": "1.0.1"} -{"agent_version": "2.0.0"} +{"agent_version": "1.0.0", ... } +{"agent_version": "1.0.1, ... } +{"agent_version": "2.0.0", ... } [...] ``` ```shell $ tctl autoupdate get -{"agent_version": "2.0.0"} +{"agent_version": "2.0.0", ... } ``` ### Scripts From 5bb60564dec5a2091bbfc85ed40d74a8ed59bb92 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:55:53 -0400 Subject: [PATCH 11/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 336a488d25a81..83f7ecef285f2 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -225,6 +225,19 @@ When `update` subcommand is otherwise executed, it will: To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +To retrieve known information about agent upgrades, the `status` subcommand will return the following: +```json +{ + "agent_version_installed": "15.1.1", + "agent_version_desired": "15.1.2", + "agent_version_previous": "15.1.0", + "update_time_next": "2020-12-09T16:09:53+00:00", + "update_time_last": "2020-12-10T16:00:00+00:00", + "update_time_jitter": 600, + "updates_enabled": true +} +``` + ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. From 74a452ebe01dd298e484d87fe7784cf883faba47 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 15:38:41 -0400 Subject: [PATCH 12/84] add editions --- rfd/0169-auto-updates-linux-agents.md | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 83f7ecef285f2..476facee68a15 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -68,6 +68,7 @@ $ systemctl enable teleport `/v1/webapi/ping` ```json { + "server_edition": "enterprise", "agent_version": "15.1.1", "agent_auto_update": true, "agent_update_after": "2024-04-23T18:00:00.000Z", @@ -78,6 +79,7 @@ Notes: - Critical updates are achieved by serving `agent_update_after` with the current time. - The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. - If an agent misses an upgrade window, it will always update immediately. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. #### Teleport Resources @@ -193,7 +195,7 @@ The `enable` subcommand will: 2. Query the `/v1/webapi/ping` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). -5. Download the desired Teleport tarball specified by `agent_version`. +5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. @@ -213,7 +215,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Download the desired Teleport tarball specified by `agent_version`. +6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. @@ -231,10 +233,13 @@ To retrieve known information about agent upgrades, the `status` subcommand will "agent_version_installed": "15.1.1", "agent_version_desired": "15.1.2", "agent_version_previous": "15.1.0", - "update_time_next": "2020-12-09T16:09:53+00:00", - "update_time_last": "2020-12-10T16:00:00+00:00", - "update_time_jitter": 600, - "updates_enabled": true + "agent_edition_installed": "enterprise", + "agent_edition_desired": "enterprise", + "agent_edition_previous": "enterprise", + "agent_update_time_next": "2020-12-09T16:09:53+00:00", + "agent_update_time_last": "2020-12-10T16:00:00+00:00", + "agent_update_time_jitter": 600, + "agent_updates_enabled": true } ``` @@ -249,15 +254,15 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch -{"agent_version": "1.0.0", ... } -{"agent_version": "1.0.1, ... } -{"agent_version": "2.0.0", ... } +{"agent_version": "1.0.0", "agent_edition": "enterprise", ... } +{"agent_version": "1.0.1, "agent_edition": "enterprise", ... } +{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } [...] ``` ```shell $ tctl autoupdate get -{"agent_version": "2.0.0", ... } +{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } ``` ### Scripts From 63c9a351df6e084e9a2478f1dc8ab821d3956364 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:12:52 -0400 Subject: [PATCH 13/84] Installers and docs --- rfd/0169-auto-updates-linux-agents.md | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 476facee68a15..0fb526e7609c0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -265,13 +265,29 @@ $ tctl autoupdate get {"agent_version": "2.0.0", "agent_edition": "enterprise", ... } ``` -### Scripts +### Installers -All scripts will install the latest updater and run `teleport-updater enable` with the proxy address. +The following install scripts will install the latest updater and run `teleport-updater enable` with the proxy address: + +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. -This is out-of-scope for this proposal. +Moving additional logic into the upgrader is out-of-scope for this proposal. + +### Documentation + +The following documentation will need to be updated to cover the new upgrader workflow: +- https://goteleport.com/docs/choose-an-edition/teleport-cloud/downloads +- https://goteleport.com/docs/installation +- https://goteleport.com/docs/upgrading/self-hosted-linux +- https://goteleport.com/docs/upgrading/self-hosted-automatic-agent-updates + +Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. ## Security From a0a912f78a633384fa40c997d29b40be208e668e Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:13:36 -0400 Subject: [PATCH 14/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0fb526e7609c0..4bfb444a60b1e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -267,8 +267,7 @@ $ tctl autoupdate get ### Installers -The following install scripts will install the latest updater and run `teleport-updater enable` with the proxy address: - +The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh From 6371c82ecb050863f1383c044af1b6cfb0b60c01 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:16:28 -0400 Subject: [PATCH 15/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4bfb444a60b1e..e3eddf662b735 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -268,11 +268,11 @@ $ tctl autoupdate get ### Installers The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh +- [/api/types/installers/agentless-installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl) +- [/api/types/installers/installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl) +- [/lib/web/scripts/oneoff/oneoff.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh) +- [/lib/web/scripts/node-join/install.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh) +- [/assets/aws/files/install-hardened.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh) Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. From 102263379aebe851b028db483714495653458e7c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:18:10 -0400 Subject: [PATCH 16/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e3eddf662b735..69e8521ec8852 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,7 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch {"agent_version": "1.0.0", "agent_edition": "enterprise", ... } -{"agent_version": "1.0.1, "agent_edition": "enterprise", ... } +{"agent_version": "1.0.1", "agent_edition": "enterprise", ... } {"agent_version": "2.0.0", "agent_edition": "enterprise", ... } [...] ``` From af20fe2ad3eb928d36b4345c21e9948ebce1b703 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:21:21 -0400 Subject: [PATCH 17/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 69e8521ec8852..57cf7d21bf590 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -91,8 +91,10 @@ spec: # agent updates are in place. agent_auto_update: on|off # agent_update_hour sets the hour in UTC at which clients should update their agents. - # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. - agent_update_hour: -1-23 + agent_update_hour: 0-23 + # agent_update_now overrides agent_update_hour and sets agent update time to the current time. + # This is useful for rolling out critical security updates and bug fixes. + agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. agent_update_jitter_seconds: 0-MAXINT64 @@ -104,6 +106,8 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-now=true +Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. ``` From 7fd207d3ea583964c2b9d29f8c1296f3a43b86d0 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:39:50 -0400 Subject: [PATCH 18/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 57cf7d21bf590..2eabcf3dabc4e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -282,6 +282,21 @@ Eventually, additional logic from the scripts could be added to `teleport-update Moving additional logic into the upgrader is out-of-scope for this proposal. +To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: +- Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. + This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. + Installers scripts will continue to function, as the package install operation will no-op. +- Install the `teleport-updater` package and run `teleport-updater enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. + This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. + Installers scripts would be skipped in favor of the `teleport configure` command. + +It is possible for a VM or container image to be created with a baked-in join token. +We should recommend against this workflow for security reasons, since a long-lived token improperly stored in an image could be leaked. + +Alternatively, users may prefer to skip pre-baked agent configuration, and run one of the script-based installers to join VMs to the cluster after the VM is started. + +Documentation should be created covering the above workflows. + ### Documentation The following documentation will need to be updated to cover the new upgrader workflow: From 27774cba0572c208621e0a8baa6d852eb928dbe8 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 15 Apr 2024 17:32:03 -0400 Subject: [PATCH 19/84] Downgrades --- rfd/0169-auto-updates-linux-agents.md | 66 +++++++++++++++++++++------ 1 file changed, 52 insertions(+), 14 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2eabcf3dabc4e..a017e5ff231b6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -141,10 +141,13 @@ $ tree /var/lib/teleport │ │ ├── ... │ │ ├── teleport-updater │ │ └── teleport - │ └── etc - │ ├── ... - │ └── systemd - │ └── teleport.service + │ ├── etc + │ │ ├── ... + │ │ └── systemd + │ │ └── teleport.service + │ └── backup + │ ├── teleport + │ └── backup.yaml ├── 15.1.1 │ ├── bin │ │ ├── ... @@ -176,6 +179,19 @@ spec: active_version: 15.1.1 ``` +backup.yaml: +``` +version: v1 +kind: config_backup +spec: + # proxy address from the backup + proxy: mytenant.teleport.sh + # version from the backup + version: 15.1.0 + # time the backup was created + creation_time: 2020-12-09T16:09:53+00:00 +``` + ### Runtime The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. @@ -203,12 +219,13 @@ The `enable` subcommand will: 6. Verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. -9. Restart the agent if the systemd service is already enabled. -10. Set `active_version` in `updates.yaml` if successful or not enabled. -11. Replace the old symlinks or binaries and quit (exit 1) if unsuccessful. -12. Remove any `teleport` package if installed. -13. Verify the symlinks to the active version still exists. -14. Remove all stored versions of the agent except the current version and last working version. +9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Remove any `teleport` package if installed. +14. Verify the symlinks to the active version still exists. +15. Remove all stored versions of the agent except the current version and last working version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -223,10 +240,11 @@ When `update` subcommand is otherwise executed, it will: 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. -10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the old symlink or binary and quit (exit 1) if unsuccessful. -13. Remove all stored versions of the agent except the current version and last working version. +10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Replace the old symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +14. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. @@ -247,6 +265,26 @@ To retrieve known information about agent upgrades, the `status` subcommand will } ``` +### Downgrades + +Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. +Downgrades are challenging, because `/var/lib/teleport` used by newer version of Teleport may not be valid for older versions of Teleport. + +When Teleport is downgraded to a previous version that has a backup of `/var/lib/teleport` present in `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`: +1. `/var/lib/teleport/versions/OLD-VERSION/backup/backup.yaml` is validated to determine if the backup is usable (proxy and version must match, age must be less than cert lifetime, etc.) +2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. +3. If the backup is invalid, we refuse to downgrade. + +Downgrades are still applied with `teleport-upgrader update`. +The above steps modulate the standard workflow in the section above. + +Notes: +- Downgrades can lead to downtime, as Teleport must be fully-stopped to safely replace `/var/lib/teleport`. +- `/var/lib/teleport/versions/` is not included in backups. + +Questions: +- Should we refuse to downgrade in step (3), or risk starting the older version of Teleport with the newer `/var/lib/teleport`? + ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. From 57fc5572d45a2d71f0786cd2bfbe2ba7bbf0e458 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 12:53:04 -0400 Subject: [PATCH 20/84] Feedback --- rfd/0169-auto-updates-linux-agents.md | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a017e5ff231b6..470acef51445e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -138,11 +138,12 @@ $ tree /var/lib/teleport └── versions ├── 15.0.0 │ ├── bin - │ │ ├── ... + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries │ │ ├── teleport-updater │ │ └── teleport │ ├── etc - │ │ ├── ... │ │ └── systemd │ │ └── teleport.service │ └── backup @@ -150,14 +151,19 @@ $ tree /var/lib/teleport │ └── backup.yaml ├── 15.1.1 │ ├── bin - │ │ ├── ... + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries │ │ ├── teleport-updater │ │ └── teleport │ └── etc - │ ├── ... │ └── systemd │ └── teleport.service └── updates.yaml +$ ls -l /usr/local/bin/tsh +/usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh +$ ls -l /usr/local/bin/tbot +/usr/local/bin/tbot -> /var/lib/teleport/versions/15.0.0/bin/tbot $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport $ ls -l /usr/local/bin/teleport-updater @@ -216,13 +222,13 @@ The `enable` subcommand will: 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -6. Verify the checksum. +6. Download and verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` 10. Restart the agent if the systemd service is already enabled. 11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 13. Remove any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. @@ -237,18 +243,20 @@ When `update` subcommand is otherwise executed, it will: 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Verify the checksum. +7. Download and verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. 10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the old symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 14. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +If `teleport-updater` fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. + To retrieve known information about agent upgrades, the `status` subcommand will return the following: ```json { From bc2815016bf2e0037e3bac60a012f2e78a5c2bd2 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 14:49:21 -0400 Subject: [PATCH 21/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 470acef51445e..15dd526180439 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,15 @@ When `update` subcommand is otherwise executed, it will: To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. -If `teleport-updater` fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. +#### Failure Conditions + +If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. + +If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. + +Known failure conditions caused by intentional configuration (e.g., upgrades disabled) will not trigger retry logic. + +#### Status To retrieve known information about agent upgrades, the `status` subcommand will return the following: ```json From 3da65251bcf77170c9dc1e5a96b5292488aa1a75 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 14:55:05 -0400 Subject: [PATCH 22/84] Remove last working copy of teleport --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 15dd526180439..9404e0c4d44a3 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -231,7 +231,7 @@ The `enable` subcommand will: 12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 13. Remove any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version and last working version. +15. Remove all stored versions of the agent except the current version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -250,7 +250,7 @@ When `update` subcommand is otherwise executed, it will: 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. 13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -14. Remove all stored versions of the agent except the current version and last working version. +14. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. From 4a81d9d95d15c2ec65c828809668ccb69c65c500 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 15:01:41 -0400 Subject: [PATCH 23/84] add step to ensure free disk space --- rfd/0169-auto-updates-linux-agents.md | 44 ++++++++++++++------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9404e0c4d44a3..9d9764027aed5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -220,18 +220,19 @@ The `enable` subcommand will: 1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. 2. Query the `/v1/webapi/ping` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. -4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). -5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -6. Download and verify the checksum. -7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -8. Replace any existing binaries or symlinks with symlinks to the current version. -9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` -10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -13. Remove any `teleport` package if installed. -14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version. +4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). +5. Ensure there is enough free disk space to upgrade Teleport. +6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +7. Download and verify the checksum (tarball URL suffixed with `.sha256`). +8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Replace any existing binaries or symlinks with symlinks to the current version. +10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +14. Remove any `teleport` package if installed. +15. Verify the symlinks to the active version still exists. +16. Remove all stored versions of the agent except the current version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -242,15 +243,16 @@ When `update` subcommand is otherwise executed, it will: 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Download and verify the checksum. -8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -9. Update symlinks to point at the new version. -10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. -11. Restart the agent if the systemd service is already enabled. -12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -14. Remove all stored versions of the agent except the current version. +6. Ensure there is enough free disk space to upgrade Teleport. +7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +8. Download and verify the checksum (tarball URL suffixed with `.sha256`). +9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +10. Update symlinks to point at the new version. +11. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +12. Restart the agent if the systemd service is already enabled. +13. Set `active_version` in `updates.yaml` if successful or not enabled. +14. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +15. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. From da278313740f9311c50d42ff3b527ed97160ba7b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 15:05:30 -0400 Subject: [PATCH 24/84] Typos --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9d9764027aed5..afefc14255b44 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,7 @@ When `update` subcommand is otherwise executed, it will: 15. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. -The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. #### Failure Conditions @@ -293,7 +293,7 @@ When Teleport is downgraded to a previous version that has a backup of `/var/lib 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are still applied with `teleport-upgrader update`. +Downgrades are still applied with `teleport-updater update`. The above steps modulate the standard workflow in the section above. Notes: From 994865d32a07a98b6306178edde1b75ce35898a8 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 23 May 2024 11:31:26 -0400 Subject: [PATCH 25/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index afefc14255b44..82847e65edd60 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -48,7 +48,7 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t We will ship a new auto-updater package written in Go that does not interface with the system package manager. It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/ping` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. ### Installation @@ -65,7 +65,7 @@ $ systemctl enable teleport #### Endpoints -`/v1/webapi/ping` +`/v1/webapi/find` ```json { "server_edition": "enterprise", @@ -89,7 +89,7 @@ spec: # agent_auto_update allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. - agent_auto_update: on|off + agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 # agent_update_now overrides agent_update_hour and sets agent update time to the current time. @@ -97,7 +97,7 @@ spec: agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. - agent_update_jitter_seconds: 0-MAXINT64 + agent_update_jitter_seconds: 0-3600 [...] ``` @@ -218,7 +218,7 @@ It will also run update teleport immediately, to ensure that subsequent executio The `enable` subcommand will: 1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. -2. Query the `/v1/webapi/ping` endpoint. +2. Query the `/v1/webapi/find` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). 5. Ensure there is enough free disk space to upgrade Teleport. @@ -239,7 +239,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. -2. Query the `/v1/webapi/ping` endpoint. +2. Query the `/v1/webapi/find` endpoint. 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. @@ -305,7 +305,7 @@ Questions: ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). Cluster administrators that want to self-manage agent updates will be From 052c490bb4f50d46817d5711c61663e44566967a Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 23 May 2024 12:26:49 -0400 Subject: [PATCH 26/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 82847e65edd60..f709b48171627 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -213,7 +213,7 @@ $ teleport-updater enable --proxy mytenant.teleport.sh If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. -On servers without Teleport installed already, the `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. +The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. The `enable` subcommand will: @@ -377,6 +377,9 @@ The Upgrade Framework (TUF) will be used to implement secure updates in the futu ## Execution Plan 1. Implement new auto-updater in Go. -2. Prep documentation changes. -3. Release new updater via teleport-ent-updater package. -4. Release documentation changes. +2. Test extensively on all supported Linux distributions. +3. Prep documentation changes. +4. Release new updater via teleport-ent-updater package. +5. Release documentation changes. +6. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. +7. Communicate to all Cloud customers that they must update their updater. From be4956b5cf6e1a89ea4467583654803c3ffb3b44 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 28 May 2024 17:37:01 -0400 Subject: [PATCH 27/84] feedback --- rfd/0169-auto-updates-linux-agents.md | 55 +++++++++++++-------------- 1 file changed, 26 insertions(+), 29 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index f709b48171627..7cff749bae7dd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -47,7 +47,7 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t ## Details We will ship a new auto-updater package written in Go that does not interface with the system package manager. -It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. +It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. @@ -71,14 +71,13 @@ $ systemctl enable teleport "server_edition": "enterprise", "agent_version": "15.1.1", "agent_auto_update": true, - "agent_update_after": "2024-04-23T18:00:00.000Z", - "agent_update_jitter_seconds": 10, + "agent_update_jitter_seconds": 10 } ``` Notes: -- Critical updates are achieved by serving `agent_update_after` with the current time. -- The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. -- If an agent misses an upgrade window, it will always update immediately. +- The Teleport proxy translates upgrade hours (below) into a specific time after which the served `agent_version` changes, resulting in all agents being upgraded. +- Critical updates are achieved by serving the desired `agent_version` immediately. +- If an agent misses an upgrade window, it will always update immediately due to the new agent version being served. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. #### Teleport Resources @@ -92,11 +91,11 @@ spec: agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 - # agent_update_now overrides agent_update_hour and sets agent update time to the current time. + # agent_update_now overrides agent_update_hour and serves the new version immediately. # This is useful for rolling out critical security updates and bug fixes. agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. - # The agent upgrader will pick a random time within this duration in which to upgrade. + # The agent upgrader will pick a random time within this duration to wait to upgrade. agent_update_jitter_seconds: 0-3600 [...] @@ -116,9 +115,7 @@ Automatic updates configuration has been updated. kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. - # Can be auto (match the version of the proxy) or an exact semver formatted - # version. - agent_version: auto|X.Y.Z + agent_version: X.Y.Z [...] ``` @@ -147,7 +144,7 @@ $ tree /var/lib/teleport │ │ └── systemd │ │ └── teleport.service │ └── backup - │ ├── teleport + │ ├── sqlite.db │ └── backup.yaml ├── 15.1.1 │ ├── bin @@ -188,7 +185,7 @@ spec: backup.yaml: ``` version: v1 -kind: config_backup +kind: db_backup spec: # proxy address from the backup proxy: mytenant.teleport.sh @@ -226,13 +223,13 @@ The `enable` subcommand will: 7. Download and verify the checksum (tarball URL suffixed with `.sha256`). 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Replace any existing binaries or symlinks with symlinks to the current version. -10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 14. Remove any `teleport` package if installed. 15. Verify the symlinks to the active version still exists. -16. Remove all stored versions of the agent except the current version. +16. Remove all stored versions of the agent except the current version and last working version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -240,7 +237,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. +3. Check that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport. @@ -248,15 +245,17 @@ When `update` subcommand is otherwise executed, it will: 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 10. Update symlinks to point at the new version. -11. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. 13. Set `active_version` in `updates.yaml` if successful or not enabled. -14. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -15. Remove all stored versions of the agent except the current version. +14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +15. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. +To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. @@ -276,7 +275,6 @@ To retrieve known information about agent upgrades, the `status` subcommand will "agent_edition_installed": "enterprise", "agent_edition_desired": "enterprise", "agent_edition_previous": "enterprise", - "agent_update_time_next": "2020-12-09T16:09:53+00:00", "agent_update_time_last": "2020-12-10T16:00:00+00:00", "agent_update_time_jitter": 600, "agent_updates_enabled": true @@ -286,9 +284,9 @@ To retrieve known information about agent upgrades, the `status` subcommand will ### Downgrades Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. -Downgrades are challenging, because `/var/lib/teleport` used by newer version of Teleport may not be valid for older versions of Teleport. +Downgrades are challenging, because `sqlite.db` used by newer version of Teleport may not be valid for older versions of Teleport. -When Teleport is downgraded to a previous version that has a backup of `/var/lib/teleport` present in `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`: +When Teleport is downgraded to a previous version that has a backup of `sqlite.db` present in `/var/lib/teleport/versions/OLD-VERSION/backup/`: 1. `/var/lib/teleport/versions/OLD-VERSION/backup/backup.yaml` is validated to determine if the backup is usable (proxy and version must match, age must be less than cert lifetime, etc.) 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. @@ -296,17 +294,16 @@ When Teleport is downgraded to a previous version that has a backup of `/var/lib Downgrades are still applied with `teleport-updater update`. The above steps modulate the standard workflow in the section above. -Notes: -- Downgrades can lead to downtime, as Teleport must be fully-stopped to safely replace `/var/lib/teleport`. -- `/var/lib/teleport/versions/` is not included in backups. +Downgrades lead to downtime, as Teleport must be fully-stopped to safely replace `sqlite.db`. -Questions: -- Should we refuse to downgrade in step (3), or risk starting the older version of Teleport with the newer `/var/lib/teleport`? +Teleport CA certificate rotations will break rollbacks. +This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. +This would prevent downgrades to backups with invalid certs. ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. -This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be From c1784a71e55c2cf49a1b81fabf4510d0ddf50a2f Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 28 May 2024 17:49:58 -0400 Subject: [PATCH 28/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 7cff749bae7dd..dba4c72923fd5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -284,6 +284,8 @@ To retrieve known information about agent upgrades, the `status` subcommand will ### Downgrades Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. +To initiate a downgrade, `agent_version` is set to an older version than it was previously set to. + Downgrades are challenging, because `sqlite.db` used by newer version of Teleport may not be valid for older versions of Teleport. When Teleport is downgraded to a previous version that has a backup of `sqlite.db` present in `/var/lib/teleport/versions/OLD-VERSION/backup/`: @@ -291,10 +293,12 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are still applied with `teleport-updater update`. +Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. -Downgrades lead to downtime, as Teleport must be fully-stopped to safely replace `sqlite.db`. +Teleport must be fully-stopped to safely replace `sqlite.db`. +When restarting the agent during an upgrade, `SIGHUP` is used. +When restarting the agent during a downgrade, `systemd stop/start` are used before/after the downgrade. Teleport CA certificate rotations will break rollbacks. This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. From 511bf59679f346f9842232e391fc6410c293dd90 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 13:48:46 -0400 Subject: [PATCH 29/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dba4c72923fd5..bdfe33352d6ea 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -9,7 +9,7 @@ state: draft * Engineering: @russjones && @bernardjkim * Product: @klizhentas || @xinding33 -* Security: @reedloden +* Security: Vendor TBD ## What @@ -301,8 +301,21 @@ When restarting the agent during an upgrade, `SIGHUP` is used. When restarting the agent during a downgrade, `systemd stop/start` are used before/after the downgrade. Teleport CA certificate rotations will break rollbacks. -This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. -This would prevent downgrades to backups with invalid certs. +In the future, this could be addressed with additional validation of the agent's client certificate issuer fingerprints. +For now, rolling forward will allow recovery from a broken rollback. + +Given that rollbacks may fail, we must maintain the following invariants: +1. Broken rollbacks can always be reverted by reversing the rollback exactly. +2. Broken versions can always be reverted by rolling back and then skipping the broken version. + +When rolling forward, the backup of the newer version's `sqlite.db` is only restored if that exact version is the roll-forward version. +Otherwise, the older, rollback version of `sqlite.db` is preserved (i.e., the newer version's backup is not used). +This ensures that a version upgrade which broke the database can be recovered with a rollback and a new patch. +It also ensures that a broken rollback is always recoverable by reversing the rollback. + +Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: +1. v1 -> v2 -> v1 -> v3 => DB from v1 is migrated directly to v3, avoiding v2 breakage. +2. v1 -> v2 -> v1 -> v2 -> v3 => DB from v2 is recovered, in case v1 database no longer has a valid certificate. ### Manual Workflow From a1316cd3a1792c4e9c76316a42193beecea4d041 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:00:41 -0400 Subject: [PATCH 30/84] apt purge --- rfd/0169-auto-updates-linux-agents.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index bdfe33352d6ea..5430679aa1fbd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -256,6 +256,9 @@ The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reex To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. +To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. +Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. From f6bab8b6f530529f15d3a755174d42a6ce539951 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:08:59 -0400 Subject: [PATCH 31/84] Only enable auto-upgrades if successful --- rfd/0169-auto-updates-linux-agents.md | 32 +++++++++++++-------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 5430679aa1fbd..dd6d9152c60f5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -214,22 +214,22 @@ The `enable` subcommand will change the behavior of `teleport-update update` to It will also run update teleport immediately, to ensure that subsequent executions succeed. The `enable` subcommand will: -1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. -2. Query the `/v1/webapi/find` endpoint. -3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. -4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). -5. Ensure there is enough free disk space to upgrade Teleport. -6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Download and verify the checksum (tarball URL suffixed with `.sha256`). -8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -9. Replace any existing binaries or symlinks with symlinks to the current version. -10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. -11. Restart the agent if the systemd service is already enabled. -12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -14. Remove any `teleport` package if installed. -15. Verify the symlinks to the active version still exists. -16. Remove all stored versions of the agent except the current version and last working version. +1. Query the `/v1/webapi/find` endpoint. +2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). +3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). +4. Ensure there is enough free disk space to upgrade Teleport. +5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +6. Download and verify the checksum (tarball URL suffixed with `.sha256`). +7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +8. Replace any existing binaries or symlinks with symlinks to the current version. +9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +13. Remove and purge any `teleport` package if installed. +14. Verify the symlinks to the active version still exists. +15. Remove all stored versions of the agent except the current version and last working version. +16. Configure `updates.yaml` with the current proxy address and set `enabled` to true. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. From 6f5565893fca2f634f04b4f94af023640367c740 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:11:37 -0400 Subject: [PATCH 32/84] reentrant lock --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dd6d9152c60f5..765fd10fafe4a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -213,6 +213,8 @@ If the proxy address is not provided with `--proxy`, the current proxy address f The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. +Both `update` and `enable` will maintain a shared lock file preventing any re-entrant executions. + The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). From d3e5b09cd842a9ea2d7a809b1c9a106beffbb619 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:25:37 -0400 Subject: [PATCH 33/84] reset --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 765fd10fafe4a..6089cd975af69 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -109,6 +109,8 @@ $ tctl autoupdate update --set-agent-update-now=true Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. +$ tctl autoupdate reset +Automatic updates configuration has been reset to defaults. ``` ```yaml From 3555212ee47858839a277b66cecfde7430344852 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 31 May 2024 19:06:30 -0400 Subject: [PATCH 34/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6089cd975af69..c49ab35aa2503 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -51,6 +51,8 @@ It will be distributed as a separate package from Teleport, and manage the insta It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. +Source code for the updater will live in `integrations/updater`. + ### Installation ```shell From f820b521ece99ae7d1d6aa815eb44892f7b8f640 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 4 Jun 2024 16:51:29 -0400 Subject: [PATCH 35/84] add note on backups --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c49ab35aa2503..dd026686547f0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -265,6 +265,8 @@ To ensure that SELinux permissions do not prevent the `teleport-updater` binary To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. +To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. From 88bdda43acdd684a490e512742ccd11292ccef69 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 6 Jun 2024 18:39:20 -0400 Subject: [PATCH 36/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dd026686547f0..cecaae08e9906 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -93,9 +93,6 @@ spec: agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 - # agent_update_now overrides agent_update_hour and serves the new version immediately. - # This is useful for rolling out critical security updates and bug fixes. - agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration to wait to upgrade. agent_update_jitter_seconds: 0-3600 @@ -107,12 +104,17 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-now=true -Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. +$ tctl autoupdate status +Status: disabled +Current: v1.2.3 +Desired: v1.2.4 (critical) +Window: 3 +Jitter: 600s + ``` ```yaml @@ -120,6 +122,10 @@ kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z + # agent_critical makes the version as critical. + # This overrides agent_update_hour in cluster_maintenance_config and serves the version immediately. + # This is useful for rolling out critical security updates and bug fixes. + agent_critical: true|false [...] ``` From f98258cfa93cdb7c8b55d82a9e7c50cd5aa54ba8 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 6 Jun 2024 18:46:23 -0400 Subject: [PATCH 37/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index cecaae08e9906..60647b539926e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -132,6 +132,8 @@ spec: ``` $ tctl autoupdate update --set-agent-version=15.1.1 Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-version=15.1.2 --critical +Automatic updates configuration has been updated. ``` Notes: From 00a1ea07a53abf6a253da0503140c2035ac3dad2 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 10 Jun 2024 13:48:39 -0400 Subject: [PATCH 38/84] Clarify restore/rollback process and validations --- rfd/0169-auto-updates-linux-agents.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 60647b539926e..ea7e96cc2e43a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -9,7 +9,7 @@ state: draft * Engineering: @russjones && @bernardjkim * Product: @klizhentas || @xinding33 -* Security: Vendor TBD +* Security: Doyensec ## What @@ -234,7 +234,7 @@ The `enable` subcommand will: 4. Ensure there is enough free disk space to upgrade Teleport. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). -7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 10. Restart the agent if the systemd service is already enabled. @@ -257,7 +257,7 @@ When `update` subcommand is otherwise executed, it will: 6. Ensure there is enough free disk space to upgrade Teleport. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). -9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. 10. Update symlinks to point at the new version. 11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. @@ -314,6 +314,9 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. +If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. +To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. +To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. Teleport must be fully-stopped to safely replace `sqlite.db`. When restarting the agent during an upgrade, `SIGHUP` is used. From 7dd114456b9a934636654505c275369399c293ce Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 10 Jun 2024 14:11:49 -0400 Subject: [PATCH 39/84] Added section on logging --- rfd/0169-auto-updates-linux-agents.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index ea7e96cc2e43a..59db915416ce1 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -410,6 +410,13 @@ are signed. The Upgrade Framework (TUF) will be used to implement secure updates in the future. +## Logging + +All installation steps will be logged locally, such that they are viewable with `journalctl`. +Care will be taken to ensure that updater logs are sharable with Teleport Support for debugging and auditing purposes. + +When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. + ## Execution Plan 1. Implement new auto-updater in Go. From 345d10372a6f9fd6278948942a88f4bc31bd700b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 9 Jul 2024 14:54:31 -0400 Subject: [PATCH 40/84] Add schedules --- rfd/0169-auto-updates-linux-agents.md | 221 +++++++++++++++++--------- 1 file changed, 147 insertions(+), 74 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 59db915416ce1..963c97beb830f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -3,7 +3,7 @@ authors: Stephen Levine (stephen.levine@goteleport.com) state: draft --- -# RFD 0169 - Automatic Updates for Linux Agents +# RFD 0169 - Automatic Updates for Agents ## Required Approvers @@ -13,13 +13,14 @@ state: draft ## What -This RFD proposes a new mechanism for Teleport agents installed on Linux servers to automatically update to a version set by an operator via tctl. +This RFD proposes a new mechanism for Teleport agents to automatically update to a version scheduled by an operator via tctl. + +All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: -- Analogous adjustments for Teleport agents installed on Kubernetes -- Phased rollouts of new agent versions for agents connected to an existing cluster - Signing of agent artifacts via TUF - Teleport Cloud APIs for updating agents +- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -29,7 +30,7 @@ Additionally, this RFD parallels the auto-update functionality for client tools The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. -1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades or confusing command output. +1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. 2. The use of system package management requires complex logic for each target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. 4. The use of bash to implement the updater makes changes difficult and prone to error. @@ -44,30 +45,15 @@ The existing mechanism for automatic agent updates does not provide a hands-off We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. -## Details - -We will ship a new auto-updater package written in Go that does not interface with the system package manager. -It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. -It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. - -Source code for the updater will live in `integrations/updater`. - -### Installation +## Details - Teleport API -```shell -$ apt-get install teleport-ent-updater -$ teleport-update enable --proxy example.teleport.sh - -# if not enabled already, configure teleport and: -$ systemctl enable teleport -``` +Teleport will be updated to serve the desired version of Teleport from `/v1/webapi/find`. -### API +The version served with be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. -#### Endpoints +### Endpoints -`/v1/webapi/find` +`/v1/webapi/find?host=[host_uuid]` ```json { "server_edition": "enterprise", @@ -77,12 +63,11 @@ $ systemctl enable teleport } ``` Notes: -- The Teleport proxy translates upgrade hours (below) into a specific time after which the served `agent_version` changes, resulting in all agents being upgraded. -- Critical updates are achieved by serving the desired `agent_version` immediately. -- If an agent misses an upgrade window, it will always update immediately due to the new agent version being served. +- The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. +- Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -#### Teleport Resources +### Teleport Resources ```yaml kind: cluster_maintenance_config @@ -91,30 +76,94 @@ spec: # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. agent_auto_update: true|false - # agent_update_hour sets the hour in UTC at which clients should update their agents. - agent_update_hour: 0-23 - # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. - # The agent upgrader will pick a random time within this duration to wait to upgrade. - agent_update_jitter_seconds: 0-3600 - - [...] + + # agent_auto_update_groups contains both "regular" or "critical" schedules. + # The schedule used is determined by the agent_version_schedule associated + # with the version in autoupdate_version. + agent_auto_update_groups: + # schedule is "regular" or "critical" + regular: + - name: staging-group + # agent_selection defines which agents are included in the group. + agent_selection: + # query selects agents by resource query. + # default: all connected agents + query: 'labels["environment"]=="staging"' + # days specifies the days of the week when the group may be upgraded. + # default: ["*"] (all days) + days: [“Sun”, “Mon”, ... | "*"] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # default: 100% + max_in_flight: 0-100% + # timeout_seconds specifies the amount of time, after the specified jitter, after which + # an agent upgrade will be considered timed out if the version does not change. + # default: 60 + timeout_seconds: 30-900 + # failure_seconds specifies the amount of time after which an agent upgrade will be considered + # failed if the agent heartbeat stops before the upgrade is complete. + # default: 0 + failure_seconds: 0-900 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 + # max_failed_before_halt specifies the percentage of clients that may fail before this group + # and all dependent groups are halted. + # default: 0 + max_failed_before_halt: 0-100% + # max_timeout_before_halt specifies the percentage of clients that may time out before this group + # and all dependent groups are halted. + # default: 10% + max_timeout_before_halt: 0-100% + # requires specifies groups that must pass with the current version before this group is allowed + # to run using that version. + requires: ["test-group"] + # ... ``` + +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX. +This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. + +```yaml +kind: cluster_maintenance_config +spec: + # ... + + # agent_auto_update contains both "regular" or "critical" schedules. + # The schedule used is determined by the agent_version_schedule associated + # with the version in autoupdate_version. + agent_auto_update: + regular: # or "critical" + # days specifies the days of the week when the group may be upgraded. + # default: ["*"] (all days) + days: [“Sun”, “Mon”, ... | "*"] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 + # ... ``` -$ tctl autoupdate update --set-agent-auto-update=off + + +```shell +$ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-hour=3 +$ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-jitter-seconds=600 +$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=600 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. $ tctl autoupdate status Status: disabled -Current: v1.2.3 -Desired: v1.2.4 (critical) -Window: 3 -Jitter: 600s - +Version: v1.2.4 +Schedule: regular ``` ```yaml @@ -122,14 +171,14 @@ kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z - # agent_critical makes the version as critical. - # This overrides agent_update_hour in cluster_maintenance_config and serves the version immediately. - # This is useful for rolling out critical security updates and bug fixes. - agent_critical: true|false + # agent_version_schedule specifies the rollout schedule associated with the version. + # Currently, only critical and regular schedules are permitted. + agent_version_schedule: critical|regular - [...] -``` + # ... ``` + +```shell $ tctl autoupdate update --set-agent-version=15.1.1 Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-version=15.1.2 --critical @@ -139,6 +188,25 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +## Details - Linux Agents + +We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. +It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. + +Source code for the updater will live in `integrations/updater`. + +### Installation + +```shell +$ apt-get install teleport-ent-updater +$ teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + ### Filesystem ``` @@ -251,7 +319,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check that `agent_auto_updates` is true. +3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport. @@ -344,22 +412,7 @@ Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). -Cluster administrators that want to self-manage agent updates will be -able to get and watch for changes to agent versions which can then be -used to trigger other integrations to update the installed version of agents. - -```shell -$ tctl autoupdate watch -{"agent_version": "1.0.0", "agent_edition": "enterprise", ... } -{"agent_version": "1.0.1", "agent_edition": "enterprise", ... } -{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } -[...] -``` - -```shell -$ tctl autoupdate get -{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } -``` +Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. ### Installers @@ -399,6 +452,19 @@ The following documentation will need to be updated to cover the new upgrader wo Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. + +## Details - Kubernetes Agents + +The Kubernetes agent updater will be updated for compatibility with the new scheduling system. + +This means that it will stop reading upgrade windows using the authenticated connection to the proxy, and instead upgrade when indicated by the `/v1/webapi/find` endpoint. + +Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX and compatibility, will be covered in a future RFD. + +## Migration + +The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. + ## Security The initial version of automatic updates will rely on TLS to establish @@ -410,6 +476,9 @@ are signed. The Upgrade Framework (TUF) will be used to implement secure updates in the future. +Anyone who possesses a host UUID can determine when that host is scheduled to upgrade by repeatedly querying the public `/v1/webapi/find` endpoint. +It is not possible to discover the current version of that host, only the designated upgrade window. + ## Logging All installation steps will be logged locally, such that they are viewable with `journalctl`. @@ -419,10 +488,14 @@ When TUF is added, that events related to supply chain security may be sent to t ## Execution Plan -1. Implement new auto-updater in Go. -2. Test extensively on all supported Linux distributions. -3. Prep documentation changes. -4. Release new updater via teleport-ent-updater package. -5. Release documentation changes. -6. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. -7. Communicate to all Cloud customers that they must update their updater. +1. Implement Teleport APIs for new scheduling system (without groups and backpressure) +2. Implement new auto-updater in Go. +3. Implement changes to Kubernetes auto-updater. +4. Test extensively on all supported Linux distributions. +5. Prep documentation changes. +6. Release new updater via teleport-ent-updater package. +7. Release documentation changes. +8. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. +9. Communicate to all Cloud customers that they must update their updater. +10. Deprecate old auto-updater endpoints. +11. Add groups and backpressure features. From a022fd5e08cca1729bcbdcf88db71b5e2c7c7344 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 9 Jul 2024 16:51:32 -0400 Subject: [PATCH 41/84] immediate schedule + note on cycles and chains --- rfd/0169-auto-updates-linux-agents.md | 28 ++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 963c97beb830f..9e0aefce9ea6f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -77,9 +77,10 @@ spec: # agent updates are in place. agent_auto_update: true|false - # agent_auto_update_groups contains both "regular" or "critical" schedules. + # agent_auto_update_groups contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. + # Groups are not configurable with the "immediate" schedule. agent_auto_update_groups: # schedule is "regular" or "critical" regular: @@ -95,6 +96,10 @@ spec: # start_hour specifies the hour when the group may start upgrading. # default: 0 start_hour: 0-23 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. # default: 100% max_in_flight: 0-100% @@ -106,10 +111,6 @@ spec: # failed if the agent heartbeat stops before the upgrade is complete. # default: 0 failure_seconds: 0-900 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 # max_failed_before_halt specifies the percentage of clients that may fail before this group # and all dependent groups are halted. # default: 0 @@ -124,7 +125,9 @@ spec: # ... ``` -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX. +Note that cycles and dependency chains longer than a week will be rejected. + +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. ```yaml @@ -132,10 +135,17 @@ kind: cluster_maintenance_config spec: # ... - # agent_auto_update contains both "regular" or "critical" schedules. + # agent_auto_update contains "regular," "critical," and "immediate" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. agent_auto_update: + # The immediate schedule results in all agents updating simultaneously. + # Only client-side jitter is configurable. + immediate: + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 regular: # or "critical" # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) @@ -172,8 +182,8 @@ spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z # agent_version_schedule specifies the rollout schedule associated with the version. - # Currently, only critical and regular schedules are permitted. - agent_version_schedule: critical|regular + # Currently, only critical, regular, and immediate schedules are permitted. + agent_version_schedule: regular|critical|immediate # ... ``` From 9e6090f8a2e2f8a29d377cd4d5b17d2740146b53 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 10 Jul 2024 17:07:24 -0400 Subject: [PATCH 42/84] more details, more tctl commands --- rfd/0169-auto-updates-linux-agents.md | 44 +++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9e0aefce9ea6f..4d24504a9c756 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -47,9 +47,21 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t ## Details - Teleport API -Teleport will be updated to serve the desired version of Teleport from `/v1/webapi/find`. +Teleport will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. +Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. -The version served with be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. +Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. + +Rollouts are specified as interdependent groups of hosts, selected by resource label. +A host is eligible to upgrade if the label is present on any of its connected resources. + +At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. +A fixed number of hosts (`max_in_flight`) are instructed to upgrade via `/v1/webapi/find`. +Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. +The group rollout is halted if timeouts or failures exceed their specified thresholds. +Group rollouts may be retried with `tctl autoupdate run`. ### Endpoints @@ -66,6 +78,7 @@ Notes: - The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. - Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The host UUID is ready from `/var/lib/teleport` by the updater. ### Teleport Resources @@ -126,6 +139,10 @@ spec: ``` Note that cycles and dependency chains longer than a week will be rejected. +Otherwise, updates could take up to 7 weeks to propagate. + +Changing the version or schedule completely resets progress. +Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. @@ -162,6 +179,7 @@ spec: ```shell +# configuration $ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 @@ -170,10 +188,31 @@ $ tctl autoupdate update --schedule regular --group staging-group --set-jitter-s Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. + +# status $ tctl autoupdate status Status: disabled Version: v1.2.4 Schedule: regular + +Groups: +staging-group: succeeded at 2024-01-03 23:43:22 UTC +prod-group: scheduled for 2024-01-03 23:43:22 UTC (depends on prod-group) +other-group: failed at 2024-01-05 22:53:22 UTC + +$ tctl autoupdate status --group staging-group +Status: succeeded +Date: 2024-01-03 23:43:22 UTC +Requires: (none) + +Upgraded: 230 (95%) +Unchanged: 10 (2%) +Failed: 15 (3%) +Timed-out: 0 + +# re-running failed group +$ tctl autoupdate run --group staging-group +Executing auto-update for group 'staging-group' immediately. ``` ```yaml @@ -462,7 +501,6 @@ The following documentation will need to be updated to cover the new upgrader wo Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. - ## Details - Kubernetes Agents The Kubernetes agent updater will be updated for compatibility with the new scheduling system. From 3f5721c2e1fc82dd38489e4e973ff8afccf08ccd Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 11 Jul 2024 15:39:29 -0400 Subject: [PATCH 43/84] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4d24504a9c756..52d8d50961f46 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -58,7 +58,7 @@ Rollouts are specified as interdependent groups of hosts, selected by resource l A host is eligible to upgrade if the label is present on any of its connected resources. At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -A fixed number of hosts (`max_in_flight`) are instructed to upgrade via `/v1/webapi/find`. +An arbitrarily selected fixed number of hosts (`max_in_flight x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. From 46a7a2af5ea0fde48dd51a1aef3ede9860d6b0f2 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:23:46 -0400 Subject: [PATCH 44/84] scalability --- rfd/0169-auto-updates-linux-agents.md | 59 +++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 52d8d50961f46..036baf7b667bd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,15 +54,57 @@ Whether the updater querying the endpoint is instructed to upgrade (via `agent_a To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by resource label. -A host is eligible to upgrade if the label is present on any of its connected resources. +Rollouts are specified as interdependent groups of hosts, selected by SSH resource or instance label query. +A host is eligible to upgrade if the seleciton query returns true. +Instance labels are a new feature introduced by this RFD that may be used when SSH service is not running or it is undesirable to reuse SSH labels: + +``` +teleport: + labels: + environment: staging + commands: + # this command will add a label 'arch=x86_64' to an instance + - name: arch + command: ['/bin/uname', '-p'] + period: 1h0m0s +``` + +Only static and command-based and labels may be used. At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -An arbitrarily selected fixed number of hosts (`max_in_flight x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +An arbitrary but UUID-deterministic fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. +### Scalability + +Instance heartbeats will now be cached at both the auth server and the proxy. + +All rollout logic is trigger by instance heartbeat backend writes, as changes can only occur on these events. +The following data related to the rollout are stored in each instance heartbeat: +- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time +- `agent_upgrade_group_schedule`: schedule type of group (e.g., critical) +- `agent_upgrade_group_name`: name of group (e.g., staging) +- `agent_upgrade_group_start_time`: timestamp of current window start time +- `agent_upgrade_group_end_time`: timestamp of current window start time + +At the start of the window, all queried instance heartbeats are marked with updated values for the `agent_upgrade_group_*` fields. +Instance heartbeats are included in the current window if all three fields match the window defined in `cluster_maintenance_config`. + +On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. +If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. +When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. + +To avoid synchronization issues between auth servers, the rollout order is deterministically sorted by UUID. +Two concurrent writes to different auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. + +Upgrading all agents generates the following write load: +- One write of `agent_upgrade_group_*` fields per agent +- One write of `agent_upgrade_start_time` field per agent + +All reads are from cache. + ### Endpoints `/v1/webapi/find?host=[host_uuid]` @@ -98,11 +140,14 @@ spec: # schedule is "regular" or "critical" regular: - name: staging-group - # agent_selection defines which agents are included in the group. - agent_selection: - # query selects agents by resource query. + # agents defines which agents are included in the group. + agents: + # node_labels_expression selects agents by SSH resource query. + # default: all connected agents + node_labels_expression: 'labels["environment"]=="staging"' + # instance_labels_expression selects agents by instance query. # default: all connected agents - query: 'labels["environment"]=="staging"' + instance_labels_expression: 'labels["environment"]=="staging"' # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] From 6f62e3d68641cee27ee7464caa42250a84c5dc27 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:41:31 -0400 Subject: [PATCH 45/84] df --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 036baf7b667bd..e1e0ca7a82c8f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -393,7 +393,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport. +4. Ensure there is enough free disk space to upgrade Teleport via `df .`. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -416,7 +416,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport. +6. Ensure there is enough free disk space to upgrade Teleport via `df .`. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From b86a1ce2a88aa1c43962f50300fe53ed62a89d74 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:47:02 -0400 Subject: [PATCH 46/84] content-length --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e1e0ca7a82c8f..8af0b7f4b275f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -393,7 +393,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport via `df .`. +4. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -416,7 +416,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport via `df .`. +6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From 7587fa5a487ee9e0daaa75761cd89acee85beabf Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 15:20:53 -0400 Subject: [PATCH 47/84] cache init --- rfd/0169-auto-updates-linux-agents.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 8af0b7f4b275f..56490acaa1e70 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -104,6 +104,9 @@ Upgrading all agents generates the following write load: - One write of `agent_upgrade_start_time` field per agent All reads are from cache. +If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. +This is safe because `agent_upgrade_start_time` is only written once during the upgrade. +However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. ### Endpoints From 0e90455dfcabcd7124bd01ae317656635d53e428 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 16:10:28 -0400 Subject: [PATCH 48/84] binary --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 56490acaa1e70..2ee8e1a3813f0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -292,7 +292,7 @@ It will be distributed as a separate package from Teleport, and manage the insta It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. -Source code for the updater will live in `integrations/updater`. +Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. ### Installation From 3cabeb8c3cc56faff07267ad36ad8c7233d19caa Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 2 Aug 2024 15:19:48 -0400 Subject: [PATCH 49/84] more rollout mechanism changes --- rfd/0169-auto-updates-linux-agents.md | 81 ++++++++++++++++----------- 1 file changed, 47 insertions(+), 34 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2ee8e1a3813f0..d0c2fee3ce8e0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,25 +54,16 @@ Whether the updater querying the endpoint is instructed to upgrade (via `agent_a To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by SSH resource or instance label query. -A host is eligible to upgrade if the seleciton query returns true. -Instance labels are a new feature introduced by this RFD that may be used when SSH service is not running or it is undesirable to reuse SSH labels: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. +A host is eligible to upgrade if the upgrade group identifier matches, set in teleport.yaml: ``` teleport: - labels: - environment: staging - commands: - # this command will add a label 'arch=x86_64' to an instance - - name: arch - command: ['/bin/uname', '-p'] - period: 1h0m0s + upgrade_group: staging ``` -Only static and command-based and labels may be used. - -At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -An arbitrary but UUID-deterministic fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. +An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. @@ -81,29 +72,41 @@ Group rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will now be cached at both the auth server and the proxy. -All rollout logic is trigger by instance heartbeat backend writes, as changes can only occur on these events. +The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. + The following data related to the rollout are stored in each instance heartbeat: - `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_group_schedule`: schedule type of group (e.g., critical) -- `agent_upgrade_group_name`: name of group (e.g., staging) -- `agent_upgrade_group_start_time`: timestamp of current window start time -- `agent_upgrade_group_end_time`: timestamp of current window start time +- `agent_upgrade_group_name`: name of auto-update group + +At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. +This plan is protected by optimistic locking, and contains the following data: + +Data key: `[name of group]@[scheduled type]` (e.g., `staging@critical`) -At the start of the window, all queried instance heartbeats are marked with updated values for the `agent_upgrade_group_*` fields. -Instance heartbeats are included in the current window if all three fields match the window defined in `cluster_maintenance_config`. +Data value JSON: +- `group_start_time`: timestamp of current window start time +- `group_end_time`: timestamp of current window start time +- `host_order`: list of UUIDs in randomized order + +At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `group_start_time` to the current time and the desired window. +If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. +The first auth server to write the plan wins; others will be rejected by the optimistic lock. +Auth servers will only write the plan if their instance heartbeat cache is initialized and recently updated. On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. -To avoid synchronization issues between auth servers, the rollout order is deterministically sorted by UUID. -Two concurrent writes to different auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes to by auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. Upgrading all agents generates the following write load: -- One write of `agent_upgrade_group_*` fields per agent -- One write of `agent_upgrade_start_time` field per agent +- One write of plan. +- One write of `agent_upgrade_start_time` field per agent. All reads are from cache. +Each instance heartbeat write will trigger an eventually consistent cache update on all auth servers and proxies, but not agents. If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. This is safe because `agent_upgrade_start_time` is only written once during the upgrade. However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. @@ -142,15 +145,8 @@ spec: agent_auto_update_groups: # schedule is "regular" or "critical" regular: + # name of the group - name: staging-group - # agents defines which agents are included in the group. - agents: - # node_labels_expression selects agents by SSH resource query. - # default: all connected agents - node_labels_expression: 'labels["environment"]=="staging"' - # instance_labels_expression selects agents by instance query. - # default: all connected agents - instance_labels_expression: 'labels["environment"]=="staging"' # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] @@ -186,7 +182,7 @@ spec: # ... ``` -Note that cycles and dependency chains longer than a week will be rejected. +Cycles and dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. Changing the version or schedule completely resets progress. @@ -285,6 +281,21 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +### Version Promotion + +Maintaining the version of different groups of agents is out-of-scope for this RFD. +This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. +This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production. + +To solve this in the future, we can add an additional `--group` flag to `teleport-update`: +```shell +$ teleport-update enable --proxy example.teleport.sh --group staging +``` + +This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. + +This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. + ## Details - Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. @@ -438,6 +449,8 @@ To ensure that SELinux permissions do not prevent the `teleport-updater` binary To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. +To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. + To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. #### Failure Conditions From 0d492f89f002a4b3b8e02eee1594e84187f6123a Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 7 Aug 2024 16:07:47 -0400 Subject: [PATCH 50/84] scalability --- rfd/0169-auto-updates-linux-agents.md | 112 ++++++++++++++++++-------- 1 file changed, 80 insertions(+), 32 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d0c2fee3ce8e0..129e29cf1af7c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -7,7 +7,7 @@ state: draft ## Required Approvers -* Engineering: @russjones && @bernardjkim +* Engineering: @russjones * Product: @klizhentas || @xinding33 * Security: Doyensec @@ -18,9 +18,10 @@ This RFD proposes a new mechanism for Teleport agents to automatically update to All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: -- Signing of agent artifacts via TUF +- Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. +- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion). This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -31,13 +32,13 @@ Additionally, this RFD parallels the auto-update functionality for client tools The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. 1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. -2. The use of system package management requires complex logic for each target distribution. +2. The use of system package management requires logic that varies significantly by target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. 4. The use of bash to implement the updater makes changes difficult and prone to error. 5. The existing auto-updater has limited automated testing. 6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. 7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). -8. The rollout plan for the new agent version is not fully-configurable using tctl. +8. The rollout plan for new agent versions is not fully-configurable using tctl. 9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. 10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. 11. The existing auto-updater is not self-updating. @@ -51,7 +52,7 @@ Teleport will be updated to serve the desired agent version and edition from `/v The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. @@ -70,46 +71,85 @@ Group rollouts may be retried with `tctl autoupdate run`. ### Scalability -Instance heartbeats will now be cached at both the auth server and the proxy. +#### Window Capture -The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. - -The following data related to the rollout are stored in each instance heartbeat: -- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_group_name`: name of auto-update group +Instance heartbeats will be cached by auth servers using a dedicated cache. +This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. +The rate is modulated by the total number of instance heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `[name of group]@[scheduled type]` (e.g., `staging@critical`) +Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/autoupdate/staging/critical/8745823`) Data value JSON: -- `group_start_time`: timestamp of current window start time -- `group_end_time`: timestamp of current window start time -- `host_order`: list of UUIDs in randomized order +- `start_time`: timestamp of current window start time +- `version`: version for which this rollout is valid +- `hosts`: list of UUIDs in randomized order +- `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs +- `expiry`: 2 weeks -At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `group_start_time` to the current time and the desired window. +At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. The first auth server to write the plan wins; others will be rejected by the optimistic lock. -Auth servers will only write the plan if their instance heartbeat cache is initialized and recently updated. +Auth servers will only write the plan if their instance heartbeat cache is healthy. -On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. -If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. -When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. +If the list is greater than 100,000 UUIDs, auth servers will first write pages with a randomly generated suffix, in a linked-link, before the atomic non-suffixed write. +If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. +If cleanup fails, the unusable pages will expire after 2 weeks. +``` +Winning auth: + WRITE: /autoupdate/staging/critical/4324234 | next_page: null + WRITE: /autoupdate/staging/critical/8745823 | next_page: 4324234 + WRITE: /autoupdate/staging/critical | next_page: 8745823 + +Losing auth: + WRITE: /autoupdate/staging/critical/2342343 | next_page: null + WRITE: /autoupdate/staging/critical/7678686 | next_page: 2342343 + WRITE CONFLICT: /autoupdate/staging/critical | next_page: 7678686 + DELETE: /autoupdate/staging/critical/7678686 + DELETE: /autoupdate/staging/critical/2342343 +``` -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes to by auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. +#### Rollout + +The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time +- `agent_upgrade_version`: current agent version +- `expiry`: expiration time of the heartbeat (extended to 24 hours at `agent_upgrade_start_time`) + +Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. +This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. +Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. +``` +upgrading := make(map[Rollout][UUID]int) +``` -Upgrading all agents generates the following write load: -- One write of plan. -- One write of `agent_upgrade_start_time` field per agent. +On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. +This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. +If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. -All reads are from cache. -Each instance heartbeat write will trigger an eventually consistent cache update on all auth servers and proxies, but not agents. -If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. -This is safe because `agent_upgrade_start_time` is only written once during the upgrade. -However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. +The auth server writes the index of the last host that is allowed to upgrade to `/autoupdate/[name of group]/[scheduled type]/progress` (e.g., `/autoupdate/staging/critical/progress`). +Writes are rate-limited such that the progress is only updated every 10 seconds. + +Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: +``` +upgrading := make(map[UUID]bool) +``` +Proxies watch for changes to `/progress` and update the map accordingly. + +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. + +Upgrading all agents generates the following additional backend write load: +- One write per page of the rollout plan per upgrade group. +- One write per auth server every 10 seconds, during rollouts. ### Endpoints @@ -185,6 +225,9 @@ spec: Cycles and dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. +The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. +After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. + Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. @@ -228,7 +271,7 @@ $ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=600 +$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -289,7 +332,7 @@ This could lead to a production outage, as the latest Teleport version may not r To solve this in the future, we can add an additional `--group` flag to `teleport-update`: ```shell -$ teleport-update enable --proxy example.teleport.sh --group staging +$ teleport-update enable --proxy example.teleport.sh --group staging-group ``` This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. @@ -315,6 +358,11 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` +For air-gapped Teleport installs, the agent may be configured with a custom tarball path template: +```shell +$ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' +``` + ### Filesystem ``` From de53461c07f37c4dd612c507c7eeb60cd1fc3d82 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 7 Aug 2024 16:39:22 -0400 Subject: [PATCH 51/84] more scalability --- rfd/0169-auto-updates-linux-agents.md | 33 +++++++++++++-------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 129e29cf1af7c..50c6ad8615918 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -20,8 +20,8 @@ All agent installations are in-scope for this proposal, including agents install The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: - Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents -- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. -- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion). +- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD +- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -44,20 +44,18 @@ The existing mechanism for automatic agent updates does not provide a hands-off 11. The existing auto-updater is not self-updating. 12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). -We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. +We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain. ## Details - Teleport API -Teleport will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. -Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `cluster_maintenance_config` and `autoupdate_version` resources. +Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. +Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. -A host is eligible to upgrade if the upgrade group identifier matches, set in teleport.yaml: - ``` teleport: upgrade_group: staging @@ -75,7 +73,7 @@ Group rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will be cached by auth servers using a dedicated cache. This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. -The rate is modulated by the total number of instance heartbeats. +The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. @@ -86,7 +84,7 @@ Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/au Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid -- `hosts`: list of UUIDs in randomized order +- `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs - `expiry`: 2 weeks @@ -125,8 +123,8 @@ The following data related to the rollout are stored in each instance heartbeat: Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. -``` -upgrading := make(map[Rollout][UUID]int) +```golang +unfinished := make(map[Rollout][UUID]int) ``` On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. @@ -137,7 +135,7 @@ The auth server writes the index of the last host that is allowed to upgrade to Writes are rate-limited such that the progress is only updated every 10 seconds. Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: -``` +```golang upgrading := make(map[UUID]bool) ``` Proxies watch for changes to `/progress` and update the map accordingly. @@ -163,7 +161,6 @@ Upgrading all agents generates the following additional backend write load: } ``` Notes: -- The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. - Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is ready from `/var/lib/teleport` by the updater. @@ -222,7 +219,8 @@ spec: # ... ``` -Cycles and dependency chains longer than a week will be rejected. +Dependency cycles are rejected. +Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. @@ -328,7 +326,8 @@ Notes: Maintaining the version of different groups of agents is out-of-scope for this RFD. This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. -This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production. + +**This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** To solve this in the future, we can add an additional `--group` flag to `teleport-update`: ```shell From 34a82cdec79542d209bff87fe1e2ea27de2eb26f Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 8 Aug 2024 17:29:29 -0400 Subject: [PATCH 52/84] use 100kib pages for plan --- rfd/0169-auto-updates-linux-agents.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 50c6ad8615918..291993f274676 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -93,7 +93,10 @@ If a new plan is needed, auth servers will query their cache of instance heartbe The first auth server to write the plan wins; others will be rejected by the optimistic lock. Auth servers will only write the plan if their instance heartbeat cache is healthy. -If the list is greater than 100,000 UUIDs, auth servers will first write pages with a randomly generated suffix, in a linked-link, before the atomic non-suffixed write. +If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. +Each page will duplicate all values besides `hosts`, which will be different for each page. +All pages besides the first page will be suffixed with a randomly generated number. +Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. If cleanup fails, the unusable pages will expire after 2 weeks. From 5a62d6ba6f33540ef8a0b624d607532baffd0377 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:03:19 -0400 Subject: [PATCH 53/84] Add RPCs, tweak API design --- rfd/0169-auto-updates-linux-agents.md | 412 +++++++++++++++++++++++--- 1 file changed, 370 insertions(+), 42 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 291993f274676..d3e693d9a56a6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -49,8 +49,8 @@ We must provide a seamless, hands-off experience for auto-updates of Teleport Ag ## Details - Teleport API Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `cluster_maintenance_config` and `autoupdate_version` resources. -Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `cluster_autoupdate_config` and `autoupdate_version` resources. +Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_autoupdate`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. @@ -79,11 +79,12 @@ The cache is considered healthy when all instance heartbeats present on the back At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/autoupdate/staging/critical/8745823`) +Data key: `/autoupdate/[name of group](/[page-id])` (e.g., `/autoupdate/staging/8745823`) Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid +- `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs - `expiry`: 2 weeks @@ -102,16 +103,16 @@ If cleanup fails, the unusable pages will expire after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/critical/4324234 | next_page: null - WRITE: /autoupdate/staging/critical/8745823 | next_page: 4324234 - WRITE: /autoupdate/staging/critical | next_page: 8745823 + WRITE: /autoupdate/staging/4324234 | next_page: null + WRITE: /autoupdate/staging/8745823 | next_page: 4324234 + WRITE: /autoupdate/staging | next_page: 8745823 Losing auth: - WRITE: /autoupdate/staging/critical/2342343 | next_page: null - WRITE: /autoupdate/staging/critical/7678686 | next_page: 2342343 - WRITE CONFLICT: /autoupdate/staging/critical | next_page: 7678686 - DELETE: /autoupdate/staging/critical/7678686 - DELETE: /autoupdate/staging/critical/2342343 + WRITE: /autoupdate/staging/2342343 | next_page: null + WRITE: /autoupdate/staging/7678686 | next_page: 2342343 + WRITE CONFLICT: /autoupdate/staging | next_page: 7678686 + DELETE: /autoupdate/staging/7678686 + DELETE: /autoupdate/staging/2342343 ``` #### Rollout @@ -134,55 +135,66 @@ On each instance heartbeat write, the auth server looks at the data structure to This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. -The auth server writes the index of the last host that is allowed to upgrade to `/autoupdate/[name of group]/[scheduled type]/progress` (e.g., `/autoupdate/staging/critical/progress`). +The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): +- `last_active_host_index`: index of the last host allowed to upgrade +- `failed_host_count`: failed host count +- `timeout_host_count`: timed-out host count + Writes are rate-limited such that the progress is only updated every 10 seconds. +If the auth server's cached progress is greater than its calculated progress, the auth server declines to update the progress. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. -Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: +Each group rollout is represented by an `agent_rollout_plan` Teleport resource that includes the progress and host count, but not the list of UUIDs. +Proxies use the start time in the resource to determine when to stream the list of UUIDs via a dedicated RPC. +Proxies watch the status section of `agent_rollout_plan` for updates to progress. + +Proxies read all started rollouts and maintain an in-memory map of host UUID to upgrading status: ```golang upgrading := make(map[UUID]bool) ``` -Proxies watch for changes to `/progress` and update the map accordingly. +Proxies watch for changes to the plan and update the map accordingly. When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. - Upgrading all agents generates the following additional backend write load: - One write per page of the rollout plan per upgrade group. - One write per auth server every 10 seconds, during rollouts. -### Endpoints +### REST Endpoints `/v1/webapi/find?host=[host_uuid]` ```json { "server_edition": "enterprise", "agent_version": "15.1.1", - "agent_auto_update": true, + "agent_autoupdate": true, "agent_update_jitter_seconds": 10 } ``` Notes: -- Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. +- Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is ready from `/var/lib/teleport` by the updater. ### Teleport Resources +#### Scheduling + ```yaml -kind: cluster_maintenance_config +kind: cluster_autoupdate_config spec: - # agent_auto_update allows turning agent updates on or off at the + # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. - agent_auto_update: true|false + agent_autoupdate: true|false - # agent_auto_update_groups contains both "regular" and "critical" schedules. + # agent_group_schedules contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. # Groups are not configurable with the "immediate" schedule. - agent_auto_update_groups: + agent_group_schedules: # schedule is "regular" or "critical" regular: # name of the group @@ -197,9 +209,6 @@ spec: # The agent upgrader client will pick a random time within this duration to wait to upgrade. # default: 0 jitter_seconds: 0-60 - # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. - # default: 100% - max_in_flight: 0-100% # timeout_seconds specifies the amount of time, after the specified jitter, after which # an agent upgrade will be considered timed out if the version does not change. # default: 60 @@ -208,14 +217,17 @@ spec: # failed if the agent heartbeat stops before the upgrade is complete. # default: 0 failure_seconds: 0-900 - # max_failed_before_halt specifies the percentage of clients that may fail before this group - # and all dependent groups are halted. - # default: 0 - max_failed_before_halt: 0-100% + # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # default: 100% + max_in_flight: 0-100% # max_timeout_before_halt specifies the percentage of clients that may time out before this group # and all dependent groups are halted. # default: 10% max_timeout_before_halt: 0-100% + # max_failed_before_halt specifies the percentage of clients that may fail before this group + # and all dependent groups are halted. + # default: 0 + max_failed_before_halt: 0-100% # requires specifies groups that must pass with the current version before this group is allowed # to run using that version. requires: ["test-group"] @@ -226,24 +238,28 @@ Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. +The updater will receive `agent_autoupdate: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. -This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_default_schedules` field. +This field will remain indefinitely to cover connected agents that are not matched to a group. ```yaml -kind: cluster_maintenance_config +kind: cluster_autoupdate_config spec: - # ... + # agent_autoupdate allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. + agent_autoupdate: true|false - # agent_auto_update contains "regular," "critical," and "immediate" schedules. + # agent_default_schedules contains "regular," "critical," and "immediate" schedules. + # These schedules apply to agents not scheduled by agent_group_schedules. # The schedule used is determined by the agent_version_schedule associated - # with the version in autoupdate_version. - agent_auto_update: + # with the agent_version in the autoupdate_version resource. + agent_default_schedules: # The immediate schedule results in all agents updating simultaneously. # Only client-side jitter is configurable. immediate: @@ -265,6 +281,7 @@ spec: # ... ``` +To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `default` `agent_rollout_plan` will be created. ```shell # configuration @@ -274,6 +291,8 @@ $ tctl autoupdate update --schedule regular --group staging-group --set-start-ho Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. +$ tctl autoupdate update --schedule regular --default --set-jitter-seconds=60 +Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -323,7 +342,32 @@ Automatic updates configuration has been updated. ``` Notes: -- These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +- `autoupdate_version` is separate from `cluster_autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. + +#### Rollout + +```yaml +kind: agent_rollout_plan +spec: + # start time of the rollout + start_time: 0001-01-01T00:00:00Z + # target version of the rollout + version: X.Y.Z + # schedule that triggered the rollout + schedule: regular + # hosts updated by the rollout + host_count: 127 +status: + # current host index in rollout progress + last_active_host_index: 23 + # failed hosts + failed_host_count: 3 + # timed-out hosts + timeout_host_count: 1 +``` + +Notes: +- This resource is stored in a paginated format with separate keys for each page and progress ### Version Promotion @@ -477,7 +521,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check that `agent_auto_updates` is true, quit otherwise. +3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. @@ -624,6 +668,8 @@ Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. +Eventually, the `cluster_maintenance_config` resource will be deprecated. + ## Security The initial version of automatic updates will rely on TLS to establish @@ -645,6 +691,288 @@ Care will be taken to ensure that updater logs are sharable with Teleport Suppor When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. +## Protobuf API Changes + +Note: all updates use revisions to prevent data loss in case of concurrent access. + +### clusterconfig/v1 + +```protobuf +syntax = "proto3"; + +package teleport.clusterconfig.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/clusterconfig/v1;clusterconfigv1"; + +// ClusterConfigService provides methods to manage cluster configuration resources. +service ClusterConfigService { + // ... + + // GetClusterAutoupdateConfig updates the cluster autoupdate config. + rpc GetClusterAutoupdateConfig(GetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // CreateClusterAutoupdateConfig creates the cluster autoupdate config. + rpc CreateClusterAutoupdateConfig(CreateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // UpdateClusterAutoupdateConfig updates the cluster autoupdate config. + rpc UpdateClusterAutoupdateConfig(UpdateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // UpsertClusterAutoupdateConfig overwrites the cluster autoupdate config. + rpc UpsertClusterAutoupdateConfig(UpsertClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // ResetClusterAutoupdateConfig restores the cluster autoupdate config to default values. + rpc ResetClusterAutoupdateConfig(ResetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); +} + +// GetClusterAutoupdateConfigRequest requests the contents of the ClusterAutoupdateConfig. +message GetClusterAutoupdateConfigRequest {} + +// CreateClusterAutoupdateConfigRequest requests creation of the the ClusterAutoupdateConfig. +message CreateClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// UpdateClusterAutoupdateConfigRequest requests an update of the the ClusterAutoupdateConfig. +message UpdateClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// UpsertClusterAutoupdateConfigRequest requests an upsert of the the ClusterAutoupdateConfig. +message UpsertClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// ResetClusterAutoupdateConfigRequest requests a reset of the the ClusterAutoupdateConfig to default values. +message ResetClusterAutoupdateConfigRequest {} + +// ClusterAutoupdateConfig holds dynamic configuration settings for cluster maintenance activities. +message ClusterAutoupdateConfig { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + ClusterAutoupdateConfigSpec spec = 7; +} + +// ClusterAutoupdateConfigSpec is the spec for the cluster autoupdate config. +message ClusterAutoupdateConfigSpec { + // agent_autoupdate specifies whether agent autoupdates are enabled. + bool agent_autoupdate = 1; + // agent_default_schedules specifies schedules for upgrades of agents. + // not scheduled by agent_group_schedules. + AgentAutoupdateDefaultSchedules agent_default_schedules = 2; + // agent_group_schedules specifies schedules for upgrades of grouped agents. + AgentAutoupdateGroupSchedules agent_group_schedules = 3; +} + +// AgentAutoupdateDefaultSchedules specifies the default update schedules for non-grouped agent. +message AgentAutoupdateDefaultSchedules { + // regular schedule for non-critical versions. + AgentAutoupdateSchedule regular = 1; + // critical schedule for urgently needed versions. + AgentAutoupdateSchedule critical = 2; + // immediate schedule for versions that must be deployed with no delay. + AgentAutoupdateImmediateSchedule immediate = 3; +} + +// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents. +message AgentAutoupdateSchedule { + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; +} + +// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents on the immediate scehdule. +message AgentAutoupdateImmediateSchedule { + // jitter to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; +} + +// AgentAutoupdateGroupSchedules specifies update scheduled for grouped agents. +message AgentAutoupdateGroupSchedules { + // regular schedules for non-critical versions. + repeated AgentAutoupdateGroup regular = 1; + // critical schedules for urgently needed versions. + repeated AgentAutoupdateGroup critical = 2; +} + +// AgentAutoupdateGroup specifies the update schedule for a group of agents. +message AgentAutoupdateGroup { + // name of the group + string name = 1; + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; + // timeout_seconds before an agent is considered time-out (no version change) + int32 timeout_seconds = 5; + // failure_seconds before an agent is considered failed (loses connection) + int32 failure_seconds = 6; + // max_in_flight specifies agents that can be upgraded at the same time, by percent. + string max_in_flight = 7; + // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. + string max_timeout_before_halt = 8; + // max_failed_before_halt specifies agents that can fail before the rollout is halted, by percent. + string max_failed_before_halt = 9; + // requires specifies rollout groups that must succeed for the current version/schedule before this rollout can run. + repeated string requires = 10; +} + +// Day of the week +enum Day { + ALL = 0; + SUNDAY = 1; + MONDAY = 2; + TUESDAY = 3; + WEDNESDAY = 4; + THURSDAY = 5; + FRIDAY = 6; + SATURDAY = 7; +} +``` + +### autoupdate/v1 + +```protobuf +syntax = "proto3"; + +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; + +// AutoupdateService serves agent and client automatic version updates. +service AutoupdateService { + // GetAutoupdateVersion returns the autoupdate version. + rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); + // CreateAutoupdateVersion creates the autoupdate version. + rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpdateAutoupdateVersion updates the autoupdate version. + rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpsertAutoupdateVersion overwrites the autoupdate version. + rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); + + // GetAgentRolloutPlan returns the agent rollout plan and current progress. + rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); + // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); +} + +// GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. +message GetAutoupdateVersionRequest {} + +// GetAutoupdateVersionRequest requests creation of the autoupdate_version singleton resource. +message CreateAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// GetAutoupdateVersionRequest requests an update of the autoupdate_version singleton resource. +message UpdateAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// GetAutoupdateVersionRequest requests an upsert of the autoupdate_version singleton resource. +message UpsertAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// AutoupdateVersion holds dynamic configuration settings for autoupdate versions. +message AutoupdateVersion { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoupdateVersionSpec spec = 6; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AutoupdateVersionSpec { + // agent_version is the desired agent version for new rollouts. + string agent_version = 1; + // agent_version schedule is the schedule to use for rolling out the agent_version. + Schedule agent_version_schedule = 2; +} + +// Schedule type for the rollout +enum Schedule { + // REGULAR update schedule + REGULAR = 0; + // CRITICAL update schedule for critical bugs and vulnerabilities + CRITICAL = 1; + // IMMEDIATE update schedule for updating all agents immediately + IMMEDIATE = 2; +} + +// GetAgentRolloutPlanRequest requests an agent_rollout_plan. +message GetAgentRolloutPlanRequest { + // name of the agent_rollout_plan + string name = 1; +} + +// GetAgentRolloutPlanHostsRequest requests the ordered host UUIDs for an agent_rollout_plan. +message GetAgentRolloutPlanHostsRequest { + // name of the agent_rollout_plan + string name = 1; +} + +// AgentRolloutPlan defines a version update rollout consisting a fixed group of agents. +message AgentRolloutPlan { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AgentRolloutPlanSpec spec = 5; + // status is the status of the resource. + AgentRolloutPlanStatus status = 6; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AgentRolloutPlanSpec { + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; + // version targetted by the rollout + string version = 2; + // schedule that triggered the rollout + string schedule = 3; + // host_count of hosts to update + int64 host_count = 4; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AgentRolloutPlanStatus { + // last_active_host_index specifies the index of the last host that may be updated. + int64 last_active_host_index = 1; + // failed_host_count specifies the number of failed hosts. + int64 failed_host_count = 2; + // timeout_host_count specifies the number of timed-out hosts. + int64 timeout_host_count = 3; +} + +// AgentRolloutPlanHost identifies an agent by host ID +message AgentRolloutPlanHost { + // host_id of a host included in the rollout + string host_id = 1; +} +``` + ## Execution Plan 1. Implement Teleport APIs for new scheduling system (without groups and backpressure) From 0362cd19e8d8735785dd7fa35963e2daaa563e69 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:14:40 -0400 Subject: [PATCH 54/84] clarify wording --- rfd/0169-auto-updates-linux-agents.md | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d3e693d9a56a6..d57a24b786a55 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -13,7 +13,9 @@ state: draft ## What -This RFD proposes a new mechanism for Teleport agents to automatically update to a version scheduled by an operator via tctl. +This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. + +Users of Teleport will be able to use the tctl CLI to specify desired versions and update schedules. All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. @@ -43,19 +45,25 @@ The existing mechanism for automatic agent updates does not provide a hands-off 10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. 11. The existing auto-updater is not self-updating. 12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). +13. There is no phased rollout mechanism for updates. +14. There is no way to automatically detect and halt failed updates. -We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain. +We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. ## Details - Teleport API Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `cluster_autoupdate_config` and `autoupdate_version` resources. -Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_autoupdate`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_version` resource. + +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `cluster_autoupdate_config` resource +- The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `teleport.yaml` file. ``` teleport: upgrade_group: staging @@ -65,7 +73,7 @@ At the start of a group rollout, the Teleport auth server captures the desired g An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. -Group rollouts may be retried with `tctl autoupdate run`. +Rollouts may be retried with `tctl autoupdate run`. ### Scalability From 107092701aa24de84a85ddeb60e03d0cbb1cfd6b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:24:35 -0400 Subject: [PATCH 55/84] wording --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d57a24b786a55..a79e7962a56da 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -82,7 +82,7 @@ Rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will be cached by auth servers using a dedicated cache. This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read within a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: @@ -95,7 +95,8 @@ Data value JSON: - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs -- `expiry`: 2 weeks + +Expiration time of each key is 2 weeks. At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. @@ -103,6 +104,8 @@ The first auth server to write the plan wins; others will be rejected by the opt Auth servers will only write the plan if their instance heartbeat cache is healthy. If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. +This is necessary to support backends with a value size limit. + Each page will duplicate all values besides `hosts`, which will be different for each page. All pages besides the first page will be suffixed with a randomly generated number. Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. @@ -130,7 +133,8 @@ The rollout logic is progressed by instance heartbeat backend writes, as changes The following data related to the rollout are stored in each instance heartbeat: - `agent_upgrade_start_time`: timestamp of individual agent's upgrade time - `agent_upgrade_version`: current agent version -- `expiry`: expiration time of the heartbeat (extended to 24 hours at `agent_upgrade_start_time`) + +Expiration time of the heartbeat is extended to 24 hours when `agent_upgrade_start_time` is written. Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. @@ -184,7 +188,7 @@ Upgrading all agents generates the following additional backend write load: Notes: - Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The host UUID is ready from `/var/lib/teleport` by the updater. +- The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. ### Teleport Resources @@ -289,7 +293,7 @@ spec: # ... ``` -To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `default` `agent_rollout_plan` will be created. +To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `agent_rollout_plan` named `default` will be employed. ```shell # configuration From 7b384ff3bcefca172204d4c31e1c5125c27feb1e Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:30:39 -0400 Subject: [PATCH 56/84] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> --- rfd/0169-auto-updates-linux-agents.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a79e7962a56da..16072331ff563 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -838,14 +838,15 @@ message AgentAutoupdateGroup { // Day of the week enum Day { - ALL = 0; - SUNDAY = 1; - MONDAY = 2; - TUESDAY = 3; - WEDNESDAY = 4; - THURSDAY = 5; - FRIDAY = 6; - SATURDAY = 7; + DAY_UNSPECIFIED = 0; + DAY_ALL = 1; + DAY_SUNDAY = 2; + DAY_MONDAY = 3; + DAY_TUESDAY = 4; + DAY_WEDNESDAY = 5; + DAY_THURSDAY = 6; + DAY_FRIDAY = 7; + DAY_SATURDAY = 8; } ``` From 4b03c02551558d84eb182fcb6494ecf10491dbc4 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:30:52 -0400 Subject: [PATCH 57/84] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> --- rfd/0169-auto-updates-linux-agents.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 16072331ff563..6156b3e21fbd9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -921,12 +921,13 @@ message AutoupdateVersionSpec { // Schedule type for the rollout enum Schedule { + Schedule_UNSPECIFIED = 0; // REGULAR update schedule - REGULAR = 0; + Schedule_REGULAR = 1; // CRITICAL update schedule for critical bugs and vulnerabilities - CRITICAL = 1; + Schedule_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - IMMEDIATE = 2; + Schedule_IMMEDIATE = 3; } // GetAgentRolloutPlanRequest requests an agent_rollout_plan. From beb7c97235687463befdf826164f7c8c6d897e00 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:37:59 -0400 Subject: [PATCH 58/84] linting --- rfd/0169-auto-updates-linux-agents.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6156b3e21fbd9..4299ace12af7a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -921,13 +921,14 @@ message AutoupdateVersionSpec { // Schedule type for the rollout enum Schedule { - Schedule_UNSPECIFIED = 0; + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; // REGULAR update schedule - Schedule_REGULAR = 1; + SCHEDULE_REGULAR = 1; // CRITICAL update schedule for critical bugs and vulnerabilities - Schedule_CRITICAL = 2; + SCHEDULE_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - Schedule_IMMEDIATE = 3; + SCHEDULE_IMMEDIATE = 3; } // GetAgentRolloutPlanRequest requests an agent_rollout_plan. From a6403eee651cd112d7bb1f2cfab1aa6279d95636 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 21 Aug 2024 22:04:31 -0400 Subject: [PATCH 59/84] Move all RPCs into autoupdate/v1 --- rfd/0169-auto-updates-linux-agents.md | 121 +++++++++++--------------- 1 file changed, 53 insertions(+), 68 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4299ace12af7a..3dfe6a35a7c1d 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -57,7 +57,7 @@ The version and edition served from that endpoint will be configured using new ` Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `cluster_autoupdate_config` resource +- The schedule defined in the new `autoupdate_config` resource - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. @@ -195,7 +195,7 @@ Notes: #### Scheduling ```yaml -kind: cluster_autoupdate_config +kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed @@ -260,7 +260,7 @@ Note the MVP version of this resource will not support host UUIDs, groups, or ba This field will remain indefinitely to cover connected agents that are not matched to a group. ```yaml -kind: cluster_autoupdate_config +kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed @@ -354,7 +354,7 @@ Automatic updates configuration has been updated. ``` Notes: -- `autoupdate_version` is separate from `cluster_autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +- `autoupdate_version` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. #### Rollout @@ -707,54 +707,66 @@ When TUF is added, that events related to supply chain security may be sent to t Note: all updates use revisions to prevent data loss in case of concurrent access. -### clusterconfig/v1 +### autoupdate/v1 ```protobuf syntax = "proto3"; -package teleport.clusterconfig.v1; +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/clusterconfig/v1;clusterconfigv1"; +// AutoupdateService serves agent and client automatic version updates. +service AutoupdateService { + // GetAutoupdateConfig updates the autoupdate config. + rpc GetAutoupdateConfig(GetAutoupdateConfigRequest) returns (AutoupdateConfig); + // CreateAutoupdateConfig creates the autoupdate config. + rpc CreateAutoupdateConfig(CreateAutoupdateConfigRequest) returns (AutoupdateConfig); + // UpdateAutoupdateConfig updates the autoupdate config. + rpc UpdateAutoupdateConfig(UpdateAutoupdateConfigRequest) returns (AutoupdateConfig); + // UpsertAutoupdateConfig overwrites the autoupdate config. + rpc UpsertAutoupdateConfig(UpsertAutoupdateConfigRequest) returns (AutoupdateConfig); + // ResetAutoupdateConfig restores the autoupdate config to default values. + rpc ResetAutoupdateConfig(ResetAutoupdateConfigRequest) returns (AutoupdateConfig); -// ClusterConfigService provides methods to manage cluster configuration resources. -service ClusterConfigService { - // ... + // GetAutoupdateVersion returns the autoupdate version. + rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); + // CreateAutoupdateVersion creates the autoupdate version. + rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpdateAutoupdateVersion updates the autoupdate version. + rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpsertAutoupdateVersion overwrites the autoupdate version. + rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); - // GetClusterAutoupdateConfig updates the cluster autoupdate config. - rpc GetClusterAutoupdateConfig(GetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // CreateClusterAutoupdateConfig creates the cluster autoupdate config. - rpc CreateClusterAutoupdateConfig(CreateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // UpdateClusterAutoupdateConfig updates the cluster autoupdate config. - rpc UpdateClusterAutoupdateConfig(UpdateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // UpsertClusterAutoupdateConfig overwrites the cluster autoupdate config. - rpc UpsertClusterAutoupdateConfig(UpsertClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // ResetClusterAutoupdateConfig restores the cluster autoupdate config to default values. - rpc ResetClusterAutoupdateConfig(ResetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // GetAgentRolloutPlan returns the agent rollout plan and current progress. + rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); + // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); } -// GetClusterAutoupdateConfigRequest requests the contents of the ClusterAutoupdateConfig. -message GetClusterAutoupdateConfigRequest {} +// GetAutoupdateConfigRequest requests the contents of the AutoupdateConfig. +message GetAutoupdateConfigRequest {} -// CreateClusterAutoupdateConfigRequest requests creation of the the ClusterAutoupdateConfig. -message CreateClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// CreateAutoupdateConfigRequest requests creation of the the AutoupdateConfig. +message CreateAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// UpdateClusterAutoupdateConfigRequest requests an update of the the ClusterAutoupdateConfig. -message UpdateClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// UpdateAutoupdateConfigRequest requests an update of the the AutoupdateConfig. +message UpdateAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// UpsertClusterAutoupdateConfigRequest requests an upsert of the the ClusterAutoupdateConfig. -message UpsertClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// UpsertAutoupdateConfigRequest requests an upsert of the the AutoupdateConfig. +message UpsertAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// ResetClusterAutoupdateConfigRequest requests a reset of the the ClusterAutoupdateConfig to default values. -message ResetClusterAutoupdateConfigRequest {} +// ResetAutoupdateConfigRequest requests a reset of the the AutoupdateConfig to default values. +message ResetAutoupdateConfigRequest {} -// ClusterAutoupdateConfig holds dynamic configuration settings for cluster maintenance activities. -message ClusterAutoupdateConfig { +// AutoupdateConfig holds dynamic configuration settings for automatic updates. +message AutoupdateConfig { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -764,11 +776,11 @@ message ClusterAutoupdateConfig { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - ClusterAutoupdateConfigSpec spec = 7; + AutoupdateConfigSpec spec = 7; } -// ClusterAutoupdateConfigSpec is the spec for the cluster autoupdate config. -message ClusterAutoupdateConfigSpec { +// AutoupdateConfigSpec is the spec for the autoupdate config. +message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; // agent_default_schedules specifies schedules for upgrades of agents. @@ -848,33 +860,6 @@ enum Day { DAY_FRIDAY = 7; DAY_SATURDAY = 8; } -``` - -### autoupdate/v1 - -```protobuf -syntax = "proto3"; - -package teleport.autoupdate.v1; - -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; - -// AutoupdateService serves agent and client automatic version updates. -service AutoupdateService { - // GetAutoupdateVersion returns the autoupdate version. - rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); - // CreateAutoupdateVersion creates the autoupdate version. - rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpdateAutoupdateVersion updates the autoupdate version. - rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpsertAutoupdateVersion overwrites the autoupdate version. - rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); - - // GetAgentRolloutPlan returns the agent rollout plan and current progress. - rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. - rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); -} // GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. message GetAutoupdateVersionRequest {} @@ -908,7 +893,7 @@ message AutoupdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateVersionSpec spec = 6; + AutoupdateVersionSpec spec = 5; } // AutoupdateVersionSpec is the spec for the autoupdate version. @@ -959,7 +944,7 @@ message AgentRolloutPlan { AgentRolloutPlanStatus status = 6; } -// AutoupdateVersionSpec is the spec for the autoupdate version. +// AutoupdateVersionSpec is the spec for the AgentRolloutPlan. message AgentRolloutPlanSpec { // start_time of the rollout google.protobuf.Timestamp start_time = 1; @@ -971,7 +956,7 @@ message AgentRolloutPlanSpec { int64 host_count = 4; } -// AutoupdateVersionSpec is the spec for the autoupdate version. +// AutoupdateVersionStatus is the status for the AgentRolloutPlan. message AgentRolloutPlanStatus { // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; From 568e0fef4d59ab11aa53305d7a26301710b26897 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 15:33:49 -0400 Subject: [PATCH 60/84] Move groups to MVP --- rfd/0169-auto-updates-linux-agents.md | 125 +++++++------------------- 1 file changed, 31 insertions(+), 94 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 3dfe6a35a7c1d..70a761bbd0cb2 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -57,16 +57,16 @@ The version and edition served from that endpoint will be configured using new ` Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` - The schedule defined in the new `autoupdate_config` resource - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. +Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `teleport.yaml` file. -``` -teleport: - upgrade_group: staging +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-updater enable`: +```shell +$ teleport-updater enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. @@ -176,7 +176,7 @@ Upgrading all agents generates the following additional backend write load: ### REST Endpoints -`/v1/webapi/find?host=[host_uuid]` +`/v1/webapi/find?host=[host_uuid]&group=[name]` ```json { "server_edition": "enterprise", @@ -189,6 +189,7 @@ Notes: - Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. +- The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. ### Teleport Resources @@ -202,11 +203,11 @@ spec: # agent updates are in place. agent_autoupdate: true|false - # agent_group_schedules contains both "regular" and "critical" schedules. + # agent_schedules contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. # Groups are not configurable with the "immediate" schedule. - agent_group_schedules: + agent_schedules: # schedule is "regular" or "critical" regular: # name of the group @@ -256,44 +257,7 @@ After 24 hours, the upgrade is halted in-place, and the group is considered fail Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_default_schedules` field. -This field will remain indefinitely to cover connected agents that are not matched to a group. - -```yaml -kind: autoupdate_config -spec: - # agent_autoupdate allows turning agent updates on or off at the - # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. - agent_autoupdate: true|false - - # agent_default_schedules contains "regular," "critical," and "immediate" schedules. - # These schedules apply to agents not scheduled by agent_group_schedules. - # The schedule used is determined by the agent_version_schedule associated - # with the agent_version in the autoupdate_version resource. - agent_default_schedules: - # The immediate schedule results in all agents updating simultaneously. - # Only client-side jitter is configurable. - immediate: - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 - regular: # or "critical" - # days specifies the days of the week when the group may be upgraded. - # default: ["*"] (all days) - days: [“Sun”, “Mon”, ... | "*"] - # start_hour specifies the hour when the group may start upgrading. - # default: 0 - start_hour: 0-23 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 - # ... -``` - -To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `agent_rollout_plan` named `default` will be employed. +Note that the `default` schedule applies to agents that do not specify a group name. ```shell # configuration @@ -383,17 +347,13 @@ Notes: ### Version Promotion -Maintaining the version of different groups of agents is out-of-scope for this RFD. +This RFD only proposed a mechanism to signal when agent auto-updates should occur. +Advertising different target Teleport versions for different groups of agents is out-of-scope for this RFD. This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. **This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** -To solve this in the future, we can add an additional `--group` flag to `teleport-update`: -```shell -$ teleport-update enable --proxy example.teleport.sh --group staging-group -``` - -This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. +To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-updater enable`) to determine which version should be served. This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. @@ -416,6 +376,11 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` +For grouped upgrades, a group identifier may be configured: +```shell +$ teleport-update enable --proxy example.teleport.sh --group staging +``` + For air-gapped Teleport installs, the agent may be configured with a custom tarball path template: ```shell $ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' @@ -470,6 +435,8 @@ kind: agent_versions spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh + # group specifies the update group + group: staging # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. enabled: true # active_version specifies the active (symlinked) deployment of the telepport agent. @@ -499,7 +466,7 @@ $ teleport-updater update After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: ```shell -$ teleport-updater enable --proxy mytenant.teleport.sh +$ teleport-updater enable --proxy mytenant.teleport.sh --group staging ``` If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. @@ -525,7 +492,7 @@ The `enable` subcommand will: 13. Remove and purge any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `updates.yaml` with the current proxy address and set `enabled` to true. +16. Configure `updates.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -783,41 +750,12 @@ message AutoupdateConfig { message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; - // agent_default_schedules specifies schedules for upgrades of agents. - // not scheduled by agent_group_schedules. - AgentAutoupdateDefaultSchedules agent_default_schedules = 2; - // agent_group_schedules specifies schedules for upgrades of grouped agents. - AgentAutoupdateGroupSchedules agent_group_schedules = 3; -} - -// AgentAutoupdateDefaultSchedules specifies the default update schedules for non-grouped agent. -message AgentAutoupdateDefaultSchedules { - // regular schedule for non-critical versions. - AgentAutoupdateSchedule regular = 1; - // critical schedule for urgently needed versions. - AgentAutoupdateSchedule critical = 2; - // immediate schedule for versions that must be deployed with no delay. - AgentAutoupdateImmediateSchedule immediate = 3; -} - -// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents. -message AgentAutoupdateSchedule { - // days to run update - repeated Day days = 2; - // start_hour to initiate update - int32 start_hour = 3; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; -} - -// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents on the immediate scehdule. -message AgentAutoupdateImmediateSchedule { - // jitter to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; + // agent_schedules specifies schedules for upgrades of grouped agents. + AgentAutoupdateSchedules agent_schedules = 3; } -// AgentAutoupdateGroupSchedules specifies update scheduled for grouped agents. -message AgentAutoupdateGroupSchedules { +// AgentAutoupdateSchedules specifies update scheduled for grouped agents. +message AgentAutoupdateSchedules { // regular schedules for non-critical versions. repeated AgentAutoupdateGroup regular = 1; // critical schedules for urgently needed versions. @@ -975,14 +913,13 @@ message AgentRolloutPlanHost { ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without groups and backpressure) -2. Implement new auto-updater in Go. +1. Implement Teleport APIs for new scheduling system (without backpressure) +2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. 6. Release new updater via teleport-ent-updater package. 7. Release documentation changes. -8. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. -9. Communicate to all Cloud customers that they must update their updater. -10. Deprecate old auto-updater endpoints. -11. Add groups and backpressure features. +8. Communicate to users that they should update their updater. +9. Deprecate old auto-updater endpoints. +10. Add groups and backpressure features. From 797b7907817b7d5b06b18a86082ff4fa3da9d193 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 17:51:24 -0400 Subject: [PATCH 61/84] note about checksum --- rfd/0169-auto-updates-linux-agents.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 70a761bbd0cb2..3e3142b1755b9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -385,6 +385,7 @@ For air-gapped Teleport installs, the agent may be configured with a custom tarb ```shell $ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' ``` +(Checksum will use template path + `.sha256`) ### Filesystem From 1b90a34ad4c417f56f7046847dbc9be9fbb97824 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 20:01:19 -0400 Subject: [PATCH 62/84] typos, consistency --- rfd/0169-auto-updates-linux-agents.md | 74 +++++++++++++-------------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 3e3142b1755b9..b26efad93c702 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -131,24 +131,24 @@ Losing auth: The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. The following data related to the rollout are stored in each instance heartbeat: -- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_version`: current agent version +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_version`: current agent version -Expiration time of the heartbeat is extended to 24 hours when `agent_upgrade_start_time` is written. +Expiration time of the heartbeat is extended to 24 hours when `agent_update_start_time` is written. Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. -Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. +Instance heartbeats are considered completed when either `agent_update_version` matches the plan version, or `agent_update_start_time` is past the expiration time. ```golang unfinished := make(map[Rollout][UUID]int) ``` On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. -If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. +If the stored number is fewer, `agent_update_start_time` is updated to the current time when the heartbeat is written. The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): -- `last_active_host_index`: index of the last host allowed to upgrade +- `last_active_host_index`: index of the last host allowed to update - `failed_host_count`: failed host count - `timeout_host_count`: timed-out host count @@ -168,10 +168,10 @@ upgrading := make(map[UUID]bool) ``` Proxies watch for changes to the plan and update the map accordingly. -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_autoupdate: true`. -Upgrading all agents generates the following additional backend write load: -- One write per page of the rollout plan per upgrade group. +Updating all agents generates the following additional backend write load: +- One write per page of the rollout plan per update group. - One write per auth server every 10 seconds, during rollouts. ### REST Endpoints @@ -186,7 +186,7 @@ Upgrading all agents generates the following additional backend write load: } ``` Notes: -- Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. - The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. @@ -212,25 +212,25 @@ spec: regular: # name of the group - name: staging-group - # days specifies the days of the week when the group may be upgraded. + # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] # start_hour specifies the hour when the group may start upgrading. # default: 0 start_hour: 0-23 # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # The agent updater client will pick a random time within this duration to wait to update. # default: 0 jitter_seconds: 0-60 # timeout_seconds specifies the amount of time, after the specified jitter, after which - # an agent upgrade will be considered timed out if the version does not change. + # an agent update will be considered timed out if the version does not change. # default: 60 timeout_seconds: 30-900 - # failure_seconds specifies the amount of time after which an agent upgrade will be considered - # failed if the agent heartbeat stops before the upgrade is complete. + # failure_seconds specifies the amount of time after which an agent update will be considered + # failed if the agent heartbeat stops before the update is complete. # default: 0 failure_seconds: 0-900 - # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # max_in_flight specifies the maximum number of agents that may be updated at the same time. # default: 100% max_in_flight: 0-100% # max_timeout_before_halt specifies the percentage of clients that may time out before this group @@ -251,8 +251,8 @@ Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_autoupdate: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. -After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the version changes in `autoupdate_version`. +After 24 hours, the update is halted in-place, and the group is considered failed if unfinished. Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. @@ -288,7 +288,7 @@ Status: succeeded Date: 2024-01-03 23:43:22 UTC Requires: (none) -Upgraded: 230 (95%) +Updated: 230 (95%) Unchanged: 10 (2%) Failed: 15 (3%) Timed-out: 0 @@ -361,7 +361,7 @@ This will require tracking the desired version of groups in the backend, which w We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified update plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. @@ -376,7 +376,7 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` -For grouped upgrades, a group identifier may be configured: +For grouped updates, a group identifier may be configured: ```shell $ teleport-update enable --proxy example.teleport.sh --group staging ``` @@ -481,7 +481,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. +4. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -504,7 +504,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. +6. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -531,13 +531,13 @@ To ensure that backups are consistent, the updater will use the [SQLite backup A If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. -If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. +If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the update will retry with the older version. -Known failure conditions caused by intentional configuration (e.g., upgrades disabled) will not trigger retry logic. +Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. #### Status -To retrieve known information about agent upgrades, the `status` subcommand will return the following: +To retrieve known information about agent updates, the `status` subcommand will return the following: ```json { "agent_version_installed": "15.1.1", @@ -567,8 +567,8 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. -To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. -To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. +To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. +To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. Teleport must be fully-stopped to safely replace `sqlite.db`. When restarting the agent during an upgrade, `SIGHUP` is used. @@ -584,7 +584,7 @@ Given that rollbacks may fail, we must maintain the following invariants: When rolling forward, the backup of the newer version's `sqlite.db` is only restored if that exact version is the roll-forward version. Otherwise, the older, rollback version of `sqlite.db` is preserved (i.e., the newer version's backup is not used). -This ensures that a version upgrade which broke the database can be recovered with a rollback and a new patch. +This ensures that a version update which broke the database can be recovered with a rollback and a new patch. It also ensures that a broken rollback is always recoverable by reversing the rollback. Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: @@ -609,7 +609,7 @@ The following install scripts will be updated to install the latest updater and Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. -Moving additional logic into the upgrader is out-of-scope for this proposal. +Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: - Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. @@ -628,7 +628,7 @@ Documentation should be created covering the above workflows. ### Documentation -The following documentation will need to be updated to cover the new upgrader workflow: +The following documentation will need to be updated to cover the new updater workflow: - https://goteleport.com/docs/choose-an-edition/teleport-cloud/downloads - https://goteleport.com/docs/installation - https://goteleport.com/docs/upgrading/self-hosted-linux @@ -640,7 +640,7 @@ Additionally, the Cloud dashboard tenants downloads tab will need to be updated The Kubernetes agent updater will be updated for compatibility with the new scheduling system. -This means that it will stop reading upgrade windows using the authenticated connection to the proxy, and instead upgrade when indicated by the `/v1/webapi/find` endpoint. +This means that it will stop reading update windows using the authenticated connection to the proxy, and instead update when indicated by the `/v1/webapi/find` endpoint. Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX and compatibility, will be covered in a future RFD. @@ -659,10 +659,10 @@ administrators concerned with the authenticity of assets served from the download server can use self-managed updates with system package managers which are signed. -The Upgrade Framework (TUF) will be used to implement secure updates in the future. +The Update Framework (TUF) will be used to implement secure updates in the future. -Anyone who possesses a host UUID can determine when that host is scheduled to upgrade by repeatedly querying the public `/v1/webapi/find` endpoint. -It is not possible to discover the current version of that host, only the designated upgrade window. +Anyone who possesses a host UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +It is not possible to discover the current version of that host, only the designated update window. ## Logging @@ -751,7 +751,7 @@ message AutoupdateConfig { message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; - // agent_schedules specifies schedules for upgrades of grouped agents. + // agent_schedules specifies schedules for updates of grouped agents. AgentAutoupdateSchedules agent_schedules = 3; } @@ -777,7 +777,7 @@ message AgentAutoupdateGroup { int32 timeout_seconds = 5; // failure_seconds before an agent is considered failed (loses connection) int32 failure_seconds = 6; - // max_in_flight specifies agents that can be upgraded at the same time, by percent. + // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 7; // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. string max_timeout_before_halt = 8; From dc20017310d98a52a86aa9a8418ab28cdec46179 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 20:54:13 -0400 Subject: [PATCH 63/84] clarify binary is teleport-update, package is teleport-ent-updater --- rfd/0169-auto-updates-linux-agents.md | 46 +++++++++++++-------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index b26efad93c702..6803de84873aa 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -64,9 +64,9 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-updater enable`: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-update enable`: ```shell -$ teleport-updater enable --proxy teleport.example.com --group staging +$ teleport-update enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. @@ -353,7 +353,7 @@ This means that groups which employ auto-scaling or ephemeral resources will slo **This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** -To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-updater enable`) to determine which version should be served. +To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-update enable`) to determine which version should be served. This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. @@ -398,7 +398,7 @@ $ tree /var/lib/teleport │ │ ├── tsh │ │ ├── tbot │ │ ├── ... # other binaries - │ │ ├── teleport-updater + │ │ ├── teleport-update │ │ └── teleport │ ├── etc │ │ └── systemd @@ -411,7 +411,7 @@ $ tree /var/lib/teleport │ │ ├── tsh │ │ ├── tbot │ │ ├── ... # other binaries - │ │ ├── teleport-updater + │ │ ├── teleport-update │ │ └── teleport │ └── etc │ └── systemd @@ -423,8 +423,8 @@ $ ls -l /usr/local/bin/tbot /usr/local/bin/tbot -> /var/lib/teleport/versions/15.0.0/bin/tbot $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport -$ ls -l /usr/local/bin/teleport-updater -/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater +$ ls -l /usr/local/bin/teleport-update +/usr/local/bin/teleport-update -> /var/lib/teleport/versions/15.0.0/bin/teleport-update $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` @@ -438,7 +438,7 @@ spec: proxy: mytenant.teleport.sh # group specifies the update group group: staging - # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. + # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true # active_version specifies the active (symlinked) deployment of the telepport agent. active_version: 15.1.1 @@ -462,12 +462,12 @@ spec: The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: ```shell -$ teleport-updater update +$ teleport-update update ``` -After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: +After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-update` command: ```shell -$ teleport-updater enable --proxy mytenant.teleport.sh --group staging +$ teleport-update enable --proxy mytenant.teleport.sh --group staging ``` If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. @@ -515,12 +515,12 @@ When `update` subcommand is otherwise executed, it will: 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. -To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. -The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. +To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. +The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. -To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. +To ensure that SELinux permissions do not prevent the `teleport-update` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. -To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. +To ensure that `teleport` package removal does not interfere with `teleport-update`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. @@ -531,7 +531,7 @@ To ensure that backups are consistent, the updater will use the [SQLite backup A If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. -If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the update will retry with the older version. +If `teleport-update` itself fails with an error, and an older version of `teleport-update` is available, the update will retry with the older version. Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. @@ -564,7 +564,7 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are applied with `teleport-updater update`, just like upgrades. +Downgrades are applied with `teleport-update update`, just like upgrades. The above steps modulate the standard workflow in the section above. If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. @@ -593,29 +593,29 @@ Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. -This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). +For use cases that fall outside of the functionality provided by `teleport-update`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-update` because they use their own automation for updates (e.g., JamF or Ansible). Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. ### Installers -The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: +The following install scripts will be updated to install the latest updater and run `teleport-update enable` with the proxy address: - [/api/types/installers/agentless-installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl) - [/api/types/installers/installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl) - [/lib/web/scripts/oneoff/oneoff.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh) - [/lib/web/scripts/node-join/install.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh) - [/assets/aws/files/install-hardened.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh) -Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. +Eventually, additional logic from the scripts could be added to `teleport-update`, such that `teleport-update` can configure teleport. Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: -- Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport-ent-updater` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. Installers scripts will continue to function, as the package install operation will no-op. -- Install the `teleport-updater` package and run `teleport-updater enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport-ent-updater` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. Installers scripts would be skipped in favor of the `teleport configure` command. From c0650607a95b28bba0942bd64af4fd41edcd80b2 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 27 Aug 2024 20:53:21 -0400 Subject: [PATCH 64/84] switch from df to unix.Statfs --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6803de84873aa..40f727d4f3686 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -481,7 +481,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. +4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -504,7 +504,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. +6. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From 9bcd324b2e5364d1936d3371930f25749c0bdc1e Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 4 Sep 2024 11:25:42 -0400 Subject: [PATCH 65/84] security feedback + naming adjustments --- rfd/0169-auto-updates-linux-agents.md | 157 +++++++++++++------------- 1 file changed, 79 insertions(+), 78 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 40f727d4f3686..d672615b4e72c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -87,7 +87,7 @@ The cache is considered healthy when all instance heartbeats present on the back At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[page-id])` (e.g., `/autoupdate/staging/8745823`) +Data key: `/autoupdate/[name of group](/[page uuid])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`) Data value JSON: - `start_time`: timestamp of current window start time @@ -95,6 +95,7 @@ Data value JSON: - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs +- `auth_server`: ID of auth server writing the plan Expiration time of each key is 2 weeks. @@ -110,20 +111,20 @@ Each page will duplicate all values besides `hosts`, which will be different for All pages besides the first page will be suffixed with a randomly generated number. Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. -If cleanup fails, the unusable pages will expire after 2 weeks. +If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/4324234 | next_page: null - WRITE: /autoupdate/staging/8745823 | next_page: 4324234 - WRITE: /autoupdate/staging | next_page: 8745823 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56 | next_page: null + WRITE: /autoupdate/staging/9ae65c11-35f2-483c-987e-73ef36989d3b | next_page: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 + WRITE: /autoupdate/staging | next_page: 9ae65c11-35f2-483c-987e-73ef36989d3b Losing auth: - WRITE: /autoupdate/staging/2342343 | next_page: null - WRITE: /autoupdate/staging/7678686 | next_page: 2342343 - WRITE CONFLICT: /autoupdate/staging | next_page: 7678686 - DELETE: /autoupdate/staging/7678686 - DELETE: /autoupdate/staging/2342343 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 | next_page: null + WRITE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d | next_page: dd850e65-d2b2-4557-8ffb-def893c52530 + WRITE CONFLICT: /autoupdate/staging | next_page: dc27497b-ce25-4d85-b537-d0639996110d + DELETE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 ``` #### Rollout @@ -210,7 +211,7 @@ spec: agent_schedules: # schedule is "regular" or "critical" regular: - # name of the group + # name of the group. Must only contain valid backend / resource name characters. - name: staging-group # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) @@ -684,57 +685,57 @@ package teleport.autoupdate.v1; option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; -// AutoupdateService serves agent and client automatic version updates. -service AutoupdateService { - // GetAutoupdateConfig updates the autoupdate config. - rpc GetAutoupdateConfig(GetAutoupdateConfigRequest) returns (AutoupdateConfig); - // CreateAutoupdateConfig creates the autoupdate config. - rpc CreateAutoupdateConfig(CreateAutoupdateConfigRequest) returns (AutoupdateConfig); - // UpdateAutoupdateConfig updates the autoupdate config. - rpc UpdateAutoupdateConfig(UpdateAutoupdateConfigRequest) returns (AutoupdateConfig); - // UpsertAutoupdateConfig overwrites the autoupdate config. - rpc UpsertAutoupdateConfig(UpsertAutoupdateConfigRequest) returns (AutoupdateConfig); - // ResetAutoupdateConfig restores the autoupdate config to default values. - rpc ResetAutoupdateConfig(ResetAutoupdateConfigRequest) returns (AutoupdateConfig); - - // GetAutoupdateVersion returns the autoupdate version. - rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); - // CreateAutoupdateVersion creates the autoupdate version. - rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpdateAutoupdateVersion updates the autoupdate version. - rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpsertAutoupdateVersion overwrites the autoupdate version. - rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); +// AutoUpdateService serves agent and client automatic version updates. +service AutoUpdateService { + // GetAutoUpdateConfig updates the autoupdate config. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // CreateAutoUpdateConfig creates the autoupdate config. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpdateAutoUpdateConfig updates the autoupdate config. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpsertAutoUpdateConfig overwrites the autoupdate config. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // ResetAutoUpdateConfig restores the autoupdate config to default values. + rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // GetAutoUpdateVersion returns the autoupdate version. + rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // CreateAutoUpdateVersion creates the autoupdate version. + rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // UpdateAutoUpdateVersion updates the autoupdate version. + rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // UpsertAutoUpdateVersion overwrites the autoupdate version. + rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); // GetAgentRolloutPlan returns the agent rollout plan and current progress. rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + // GetAutoUpdateVersion streams the agent rollout plan's list of all hosts. rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); } -// GetAutoupdateConfigRequest requests the contents of the AutoupdateConfig. -message GetAutoupdateConfigRequest {} +// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. +message GetAutoUpdateConfigRequest {} -// CreateAutoupdateConfigRequest requests creation of the the AutoupdateConfig. -message CreateAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// UpdateAutoupdateConfigRequest requests an update of the the AutoupdateConfig. -message UpdateAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// UpsertAutoupdateConfigRequest requests an upsert of the the AutoupdateConfig. -message UpsertAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// ResetAutoupdateConfigRequest requests a reset of the the AutoupdateConfig to default values. -message ResetAutoupdateConfigRequest {} +// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. +message ResetAutoUpdateConfigRequest {} -// AutoupdateConfig holds dynamic configuration settings for automatic updates. -message AutoupdateConfig { +// AutoUpdateConfig holds dynamic configuration settings for automatic updates. +message AutoUpdateConfig { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -744,27 +745,27 @@ message AutoupdateConfig { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateConfigSpec spec = 7; + AutoUpdateConfigSpec spec = 7; } -// AutoupdateConfigSpec is the spec for the autoupdate config. -message AutoupdateConfigSpec { +// AutoUpdateConfigSpec is the spec for the autoupdate config. +message AutoUpdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoupdateSchedules agent_schedules = 3; + AgentAutoUpdateSchedules agent_schedules = 3; } -// AgentAutoupdateSchedules specifies update scheduled for grouped agents. -message AgentAutoupdateSchedules { +// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. +message AgentAutoUpdateSchedules { // regular schedules for non-critical versions. - repeated AgentAutoupdateGroup regular = 1; + repeated AgentAutoUpdateGroup regular = 1; // critical schedules for urgently needed versions. - repeated AgentAutoupdateGroup critical = 2; + repeated AgentAutoUpdateGroup critical = 2; } -// AgentAutoupdateGroup specifies the update schedule for a group of agents. -message AgentAutoupdateGroup { +// AgentAutoUpdateGroup specifies the update schedule for a group of agents. +message AgentAutoUpdateGroup { // name of the group string name = 1; // days to run update @@ -800,29 +801,29 @@ enum Day { DAY_SATURDAY = 8; } -// GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. -message GetAutoupdateVersionRequest {} +// GetAutoUpdateVersionRequest requests the autoupdate_version singleton resource. +message GetAutoUpdateVersionRequest {} -// GetAutoupdateVersionRequest requests creation of the autoupdate_version singleton resource. -message CreateAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests creation of the autoupdate_version singleton resource. +message CreateAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// GetAutoupdateVersionRequest requests an update of the autoupdate_version singleton resource. -message UpdateAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests an update of the autoupdate_version singleton resource. +message UpdateAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// GetAutoupdateVersionRequest requests an upsert of the autoupdate_version singleton resource. -message UpsertAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests an upsert of the autoupdate_version singleton resource. +message UpsertAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// AutoupdateVersion holds dynamic configuration settings for autoupdate versions. -message AutoupdateVersion { +// AutoUpdateVersion holds dynamic configuration settings for autoupdate versions. +message AutoUpdateVersion { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -832,11 +833,11 @@ message AutoupdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateVersionSpec spec = 5; + AutoUpdateVersionSpec spec = 5; } -// AutoupdateVersionSpec is the spec for the autoupdate version. -message AutoupdateVersionSpec { +// AutoUpdateVersionSpec is the spec for the autoupdate version. +message AutoUpdateVersionSpec { // agent_version is the desired agent version for new rollouts. string agent_version = 1; // agent_version schedule is the schedule to use for rolling out the agent_version. @@ -883,7 +884,7 @@ message AgentRolloutPlan { AgentRolloutPlanStatus status = 6; } -// AutoupdateVersionSpec is the spec for the AgentRolloutPlan. +// AutoUpdateVersionSpec is the spec for the AgentRolloutPlan. message AgentRolloutPlanSpec { // start_time of the rollout google.protobuf.Timestamp start_time = 1; @@ -895,7 +896,7 @@ message AgentRolloutPlanSpec { int64 host_count = 4; } -// AutoupdateVersionStatus is the status for the AgentRolloutPlan. +// AutoUpdateVersionStatus is the status for the AgentRolloutPlan. message AgentRolloutPlanStatus { // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; @@ -914,7 +915,7 @@ message AgentRolloutPlanHost { ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure) +1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) 2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. @@ -923,4 +924,4 @@ message AgentRolloutPlanHost { 7. Release documentation changes. 8. Communicate to users that they should update their updater. 9. Deprecate old auto-updater endpoints. -10. Add groups and backpressure features. +10. Add group interdependence and backpressure features. From e748820dbe98bbd8a71b18733f99382b2c868d77 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 6 Sep 2024 13:45:30 -0400 Subject: [PATCH 66/84] tweak rollout paging --- rfd/0169-auto-updates-linux-agents.md | 43 ++++++++++++++++----------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d672615b4e72c..50a7c72be7ecb 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -80,22 +80,27 @@ Rollouts may be retried with `tctl autoupdate run`. #### Window Capture Instance heartbeats will be cached by auth servers using a dedicated cache. -This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. +This cache is initialized from the backend when the auth server starts, and kept up-to-date when the heartbeats are broadcast to all auth servers. + +When the auth server is started, the cache is initialized using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read within a time period that is also modulated by the total number of heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read at least once. + +Instance heartbeats are currently broadcast to all auth servers. +The cache will be kept up-to-date when the auth server receives updates. -At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. +At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[page uuid])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`) +Data key: `/autoupdate/[name of group](/[auth ID]/page[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1`) Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order -- `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs -- `auth_server`: ID of auth server writing the plan +- `auth_id`: ID of the auth server writing the plan + Expiration time of each key is 2 weeks. @@ -108,25 +113,27 @@ If the resource size is greater than 100 KiB, auth servers will divide the resou This is necessary to support backends with a value size limit. Each page will duplicate all values besides `hosts`, which will be different for each page. -All pages besides the first page will be suffixed with a randomly generated number. -Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. -If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. +All pages besides the first page will be prefixed with the auth server's ID. +Pages will be written in reverse order before the final atomic non-prefixed write of the first page. +If the non-prefixed write fails, the auth server is responsible for cleaning up the unusable pages. If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56 | next_page: null - WRITE: /autoupdate/staging/9ae65c11-35f2-483c-987e-73ef36989d3b | next_page: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 - WRITE: /autoupdate/staging | next_page: 9ae65c11-35f2-483c-987e-73ef36989d3b + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page2 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1 + WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 | next_page: null - WRITE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d | next_page: dd850e65-d2b2-4557-8ffb-def893c52530 - WRITE CONFLICT: /autoupdate/staging | next_page: dc27497b-ce25-4d85-b537-d0639996110d - DELETE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 ``` +To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. + #### Rollout The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. @@ -169,7 +176,7 @@ upgrading := make(map[UUID]bool) ``` Proxies watch for changes to the plan and update the map accordingly. -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_autoupdate: true`. +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]&group=[name]`, the proxies query the map to determine the value of `agent_autoupdate: true`. Updating all agents generates the following additional backend write load: - One write per page of the rollout plan per update group. From 4f93a7f2310b4b10a56fb3f9af809ac854ffa092 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 6 Sep 2024 13:47:03 -0400 Subject: [PATCH 67/84] tweak rollout paging again --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 50a7c72be7ecb..25f58f89994ec 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -92,7 +92,7 @@ The cache will be kept up-to-date when the auth server receives updates. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[auth ID]/page[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1`) +Data key: `/autoupdate/[name of group](/[auth ID]/[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1`) Data value JSON: - `start_time`: timestamp of current window start time @@ -120,16 +120,16 @@ If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page2 - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/2 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1 WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 ``` To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. From aff1df3b965af7f708368a2313061f042d7901e5 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 9 Sep 2024 22:19:06 -0400 Subject: [PATCH 68/84] feedback --- rfd/0169-auto-updates-linux-agents.md | 43 +++++++++++++++++++++++---- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 25f58f89994ec..9f4d5a430b868 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,9 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions and update schedules. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout speed. + +Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. @@ -24,6 +26,7 @@ The following anti-goals are out-of-scope for this proposal, but will be address - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD - Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) +- Support for progressive rollouts of tbot, when not installed on the same system as a Teleport agent This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -36,7 +39,7 @@ The existing mechanism for automatic agent updates does not provide a hands-off 1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. 2. The use of system package management requires logic that varies significantly by target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. -4. The use of bash to implement the updater makes changes difficult and prone to error. +4. The use of bash to implement the updater makes long-term maintenance difficult. 5. The existing auto-updater has limited automated testing. 6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. 7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). @@ -437,10 +440,13 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -updates.yaml: +#### updates.yaml + +This file stores configuration for `teleport-update`. + ``` version: v1 -kind: agent_versions +kind: updates spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh @@ -452,7 +458,10 @@ spec: active_version: 15.1.1 ``` -backup.yaml: +#### backup.yaml + +This file stores metadata about an individual backup of the Teleport agent's sqlite DB. + ``` version: v1 kind: db_backup @@ -920,6 +929,30 @@ message AgentRolloutPlanHost { } ``` +## Alternatives + +### `teleport update` Subcommand + +`teleport-update` is intended to be a minimal binary, with few dependencies, that is used to bootstrap initial Teleport agent installations. +It may be baked into AMIs or containers. + +If the entirely `teleport` binary were used instead, security scanners would match vulnerabilities all Teleport dependencies, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often. +Deploying these updates is often more disruptive than a soft restart of the agent triggered by the auto-updater. + +`teleport-update` will also handle `tbot` updates in the future, and it would be undesirable to distribute `teleport` with `tbot` just to enable automated updates. + +Finally, `teleport-update`'s API contract with the cluster must remain stable to ensure that outdated agent installations can always be recovered. +The first version of `teleport-update` will need to work with Teleport v14 and all future versions of Teleport. +This contract may be easier to manage with a separate artifact. + +### Mutually-Authenticated RPC for Update Boolean + +Agents will not always have a mutually-authenticated connection to auth to receive update instructions. +For example, the agent may be in a failed state due to a botched upgrade, may be temporarily stopped, or may be newly installed. +In the future, `tbot`-only installations may have expired certificates. + +Making the update boolean instruction available via the `/webapi/find` TLS endpoint reduces complexity as well as the risk of unrecoverable outages. + ## Execution Plan 1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) From c91977f8f1b193ac636a10bac356eeafaadafc30 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 10 Sep 2024 17:07:44 -0400 Subject: [PATCH 69/84] adjust update.yaml to match implementation feedback --- rfd/0169-auto-updates-linux-agents.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9f4d5a430b868..48d9d6a53240c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -67,7 +67,7 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-update enable`: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: ```shell $ teleport-update enable --proxy teleport.example.com --group staging ``` @@ -200,7 +200,7 @@ Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. -- The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. ### Teleport Resources @@ -427,7 +427,7 @@ $ tree /var/lib/teleport │ └── etc │ └── systemd │ └── teleport.service - └── updates.yaml + └── update.yaml $ ls -l /usr/local/bin/tsh /usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh $ ls -l /usr/local/bin/tbot @@ -440,7 +440,7 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -#### updates.yaml +#### update.yaml This file stores configuration for `teleport-update`. @@ -452,8 +452,11 @@ spec: proxy: mytenant.teleport.sh # group specifies the update group group: staging + # url_template specifies a custom URL template for downloading Teleport. + # url_template: "" # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true +status: # active_version specifies the active (symlinked) deployment of the telepport agent. active_version: 15.1.1 ``` @@ -505,18 +508,18 @@ The `enable` subcommand will: 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. +11. Set `active_version` in `update.yaml` if successful or not enabled. 12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 13. Remove and purge any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `updates.yaml` with the current proxy address and group, and set `enabled` to true. +16. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: -1. Configure `updates.yaml` to set `enabled` to false. +1. Configure `update.yaml` to set `enabled` to false. When `update` subcommand is otherwise executed, it will: -1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. +1. Check `update.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. @@ -528,7 +531,7 @@ When `update` subcommand is otherwise executed, it will: 10. Update symlinks to point at the new version. 11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. -13. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Set `active_version` in `update.yaml` if successful or not enabled. 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. From ec8d67530e6c334d4d65a805adbdc5247b49ecb6 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 24 Sep 2024 15:49:37 -0700 Subject: [PATCH 70/84] wip - new model --- rfd/0169-auto-updates-linux-agents.md | 618 +++++++++++--------------- 1 file changed, 266 insertions(+), 352 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 48d9d6a53240c..2b8f264ed2899 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -25,7 +25,6 @@ The following anti-goals are out-of-scope for this proposal, but will be address - Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD -- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) - Support for progressive rollouts of tbot, when not installed on the same system as a Teleport agent This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -53,154 +52,9 @@ The existing mechanism for automatic agent updates does not provide a hands-off We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. -## Details - Teleport API - -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_version` resource. - -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version - -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. - -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` - -At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. -An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. -The group rollout is halted if timeouts or failures exceed their specified thresholds. -Rollouts may be retried with `tctl autoupdate run`. - -### Scalability - -#### Window Capture - -Instance heartbeats will be cached by auth servers using a dedicated cache. -This cache is initialized from the backend when the auth server starts, and kept up-to-date when the heartbeats are broadcast to all auth servers. - -When the auth server is started, the cache is initialized using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. -The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read at least once. - -Instance heartbeats are currently broadcast to all auth servers. -The cache will be kept up-to-date when the auth server receives updates. - -At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. -This plan is protected by optimistic locking, and contains the following data: - -Data key: `/autoupdate/[name of group](/[auth ID]/[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1`) - -Data value JSON: -- `start_time`: timestamp of current window start time -- `version`: version for which this rollout is valid -- `schedule`: type of schedule that triggered the rollout -- `hosts`: list of host UUIDs in randomized order -- `auth_id`: ID of the auth server writing the plan - - -Expiration time of each key is 2 weeks. - -At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. -If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. -The first auth server to write the plan wins; others will be rejected by the optimistic lock. -Auth servers will only write the plan if their instance heartbeat cache is healthy. - -If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. -This is necessary to support backends with a value size limit. - -Each page will duplicate all values besides `hosts`, which will be different for each page. -All pages besides the first page will be prefixed with the auth server's ID. -Pages will be written in reverse order before the final atomic non-prefixed write of the first page. -If the non-prefixed write fails, the auth server is responsible for cleaning up the unusable pages. -If cleanup fails, the unusable pages will expire from the backend after 2 weeks. - -``` -Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/2 - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1 - WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 - -Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 - WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 -``` - -To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. - -#### Rollout - -The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. - -The following data related to the rollout are stored in each instance heartbeat: -- `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_version`: current agent version - -Expiration time of the heartbeat is extended to 24 hours when `agent_update_start_time` is written. - -Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. -This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. -Instance heartbeats are considered completed when either `agent_update_version` matches the plan version, or `agent_update_start_time` is past the expiration time. -```golang -unfinished := make(map[Rollout][UUID]int) -``` - -On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. -This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. -If the stored number is fewer, `agent_update_start_time` is updated to the current time when the heartbeat is written. +## UX -The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): -- `last_active_host_index`: index of the last host allowed to update -- `failed_host_count`: failed host count -- `timeout_host_count`: timed-out host count - -Writes are rate-limited such that the progress is only updated every 10 seconds. -If the auth server's cached progress is greater than its calculated progress, the auth server declines to update the progress. - -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. - -Each group rollout is represented by an `agent_rollout_plan` Teleport resource that includes the progress and host count, but not the list of UUIDs. -Proxies use the start time in the resource to determine when to stream the list of UUIDs via a dedicated RPC. -Proxies watch the status section of `agent_rollout_plan` for updates to progress. - -Proxies read all started rollouts and maintain an in-memory map of host UUID to upgrading status: -```golang -upgrading := make(map[UUID]bool) -``` -Proxies watch for changes to the plan and update the map accordingly. - -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]&group=[name]`, the proxies query the map to determine the value of `agent_autoupdate: true`. - -Updating all agents generates the following additional backend write load: -- One write per page of the rollout plan per update group. -- One write per auth server every 10 seconds, during rollouts. - -### REST Endpoints - -`/v1/webapi/find?host=[host_uuid]&group=[name]` -```json -{ - "server_edition": "enterprise", - "agent_version": "15.1.1", - "agent_autoupdate": true, - "agent_update_jitter_seconds": 10 -} -``` -Notes: -- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. -- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +[Hugo to add] ### Teleport Resources @@ -211,74 +65,85 @@ kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. - agent_autoupdate: true|false + # agent updates are in place. Setting this to pause will halt the rollout. + agent_autoupdate: disable|enable|pause - # agent_schedules contains both "regular" and "critical" schedules. - # The schedule used is determined by the agent_version_schedule associated - # with the version in autoupdate_version. - # Groups are not configurable with the "immediate" schedule. + # agent_schedules specifies version rollout schedules for agents. + # The schedule used is determined by the schedule associated + # with the version in the rollout_plan resource. + # For now, only the "regular" strategy is configurable. agent_schedules: - # schedule is "regular" or "critical" + # rollout strategy must be "regular" for now regular: - # name of the group. Must only contain valid backend / resource name characters. - - name: staging-group - # days specifies the days of the week when the group may be updated. - # default: ["*"] (all days) - days: [“Sun”, “Mon”, ... | "*"] - # start_hour specifies the hour when the group may start upgrading. - # default: 0 - start_hour: 0-23 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent updater client will pick a random time within this duration to wait to update. - # default: 0 - jitter_seconds: 0-60 - # timeout_seconds specifies the amount of time, after the specified jitter, after which - # an agent update will be considered timed out if the version does not change. - # default: 60 - timeout_seconds: 30-900 - # failure_seconds specifies the amount of time after which an agent update will be considered - # failed if the agent heartbeat stops before the update is complete. - # default: 0 - failure_seconds: 0-900 - # max_in_flight specifies the maximum number of agents that may be updated at the same time. - # default: 100% - max_in_flight: 0-100% - # max_timeout_before_halt specifies the percentage of clients that may time out before this group - # and all dependent groups are halted. - # default: 10% - max_timeout_before_halt: 0-100% - # max_failed_before_halt specifies the percentage of clients that may fail before this group - # and all dependent groups are halted. - # default: 0 - max_failed_before_halt: 0-100% - # requires specifies groups that must pass with the current version before this group is allowed - # to run using that version. - requires: ["test-group"] + # name of the group. Must only contain valid backend / resource name characters. + - name: staging + # days specifies the days of the week when the group may be updated. + # default: ["*"] (all days) + days: [ “Sun”, “Mon”, ... | "*" ] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # wait_days specifies how many days to wait after the previous group finished before starting. + # default: 0 + wait_days: 0-1 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent updater client will pick a random time within this duration to wait to update. + # default: 5 + jitter_seconds: 0-60 + # max_in_flight specifies the maximum number of agents that may be updated at the same time. + # Only valid for the backpressure strategy. + # default: 20% + max_in_flight: 10-100% + # alert_after specifies the duration after which a cluster alert will be set if the rollout has + # not completed. + # default: 4h + alert_after: 1h + # ... ``` +Default resource: +```yaml +kind: autoupdate_config +spec: + agent_autoupdate: enable + agent_schedules: + regular: + - name: default + days: ["*"] + start_hour: 0 + jitter_seconds: 5 + max_in_flight: 20% + alert_after: 4h +``` + Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_autoupdate: true` from the time is it designated for update until the version changes in `autoupdate_version`. -After 24 hours, the update is halted in-place, and the group is considered failed if unfinished. +The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. + +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +Changing the `target_version` resets the schedule immediately, clearing all progress. + +Changing the `current_version` in `autoupdate_agent_plan` changes the advertised `current_version` for all unfinished groups. -Changing the version or schedule completely resets progress. -Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. +Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. +However, any changes to `agent_schedules` that occur while a group is active will be rejected. + +Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. Note that the `default` schedule applies to agents that do not specify a group name. ```shell # configuration -$ tctl autoupdate update--set-agent-auto-update=off +$ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 +$ tctl autoupdate update --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 +$ tctl autoupdate update --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --default --set-jitter-seconds=60 +$ tctl autoupdate update --group default --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -309,16 +174,46 @@ $ tctl autoupdate run --group staging-group Executing auto-update for group 'staging-group' immediately. ``` +Notes: +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_agent_plan`, while maintaining control over the rollout. + +#### Rollout + ```yaml -kind: autoupdate_version +kind: autoupdate_agent_plan spec: - # agent_version is the version of the agent the cluster will advertise. - agent_version: X.Y.Z - # agent_version_schedule specifies the rollout schedule associated with the version. - # Currently, only critical, regular, and immediate schedules are permitted. - agent_version_schedule: regular|critical|immediate - - # ... + # current_version is the desired version for agents before their window. + current_version: A.B.C + # target_version is the desired version for agents after their window. + target_version: X.Y.Z + # schedule to use for the rollout + schedule: regular|immediate + # strategy to use for the rollout + # default: backpressure + strategy: backpressure|grouped + # paused specifies whether the rollout is paused + # default: enabled + autoupdate: enabled|disabled|paused +status: + groups: + # name of group + - name: staging + # start_time is the time the upgrade will start + start_time: 2020-12-09T16:09:53+00:00 + # initial_count is the number of connected agents at the start of the window + initial_count: 432 + # missing_count is the number of agents disconnected since the start of the rollout + present_count: 53 + # failed_count is the number of agents rolled-back since the start of the rollout + failed_count: 23 + # progress is the current progress through the rollout + progress: 0.532 + # state is the current state of the rollout (unstarted, active, done, rollback) + state: active + # last_update_time is the time of the previous update for the group + last_update_time: 2020-12-09T16:09:53+00:00 + # last_update_reason is the trigger for the last update + last_update_reason: rollback ``` ```shell @@ -328,45 +223,96 @@ $ tctl autoupdate update --set-agent-version=15.1.2 --critical Automatic updates configuration has been updated. ``` -Notes: -- `autoupdate_version` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +## Details - Teleport API -#### Rollout +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. -```yaml -kind: agent_rollout_plan -spec: - # start time of the rollout - start_time: 0001-01-01T00:00:00Z - # target version of the rollout - version: X.Y.Z - # schedule that triggered the rollout - schedule: regular - # hosts updated by the rollout - host_count: 127 -status: - # current host index in rollout progress - last_active_host_index: 23 - # failed hosts - failed_host_count: 3 - # timed-out hosts - timeout_host_count: 1 +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `autoupdate_config` resource +- The status of past agent upgrades for the given version + +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. +Teleport auth servers use their access to agent heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. + +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: +```shell +$ teleport-update enable --proxy teleport.example.com --group staging ``` -Notes: -- This resource is stored in a paginated format with separate keys for each page and progress +At the start of a group rollout, the Teleport auth servers record the initial number connected agents. +A fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. +Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. + +### Rollout + +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_current_version`: current agent version +- `agent_update_rollback`: whether the agent was rolled-back automatically +- `agent_update_uuid`: Auto-update UUID +- `agent_update_group`: Auto-update group name + +Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). + +Every minute, auth servers persist the version counts: +- `version_counts[group][version]` + - `count`: number of currently connected agents at `version` in `group` + - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade + - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` + +At the start of each group's window, auth servers persist an initial count: +- `initial_counts[group]` + - `count`: number of connected agents in `group` at start of window -### Version Promotion +Expiration time of the persisted key is 1 hour. -This RFD only proposed a mechanism to signal when agent auto-updates should occur. -Advertising different target Teleport versions for different groups of agents is out-of-scope for this RFD. -This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. -**This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** +If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. +This prevents double-counting agents when auth servers are killed. -To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-update enable`) to determine which version should be served. +#### Progress Formulas -This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. +Each auth server will calculate the progress as `( max_in_flight * initial_counts[group].count + version_counts[group][target_version].count ) / initial_counts[group].count` and write the progress to `autoupdate_agent_plan` status. +This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. + +However, if `as_numeral(version_counts[group][not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. +This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. + +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: +`version_counts[group][*].count > initial_counts[group].count - max_in_flight * initial_counts[group].count` + +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. + +#### Proxies + +When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the `autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. +The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: +`as_numeral(host_uuid) / as_numeral(max_uuid) < progress` + +### REST Endpoints + +`/v1/webapi/find?host=[uuid]&group=[name]` +```json +{ + "server_edition": "enterprise", + "agent_version": "15.1.1", + "agent_autoupdate": true, + "agent_update_jitter_seconds": 10 +} +``` +Notes: +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The UUID and group name are read from `/var/lib/teleport/versions/update.yaml` by the updater. ## Details - Linux Agents @@ -380,7 +326,7 @@ Source code for the updater will live in the main Teleport repository, with the ### Installation ```shell -$ apt-get install teleport-ent-updater +$ apt-get install teleport $ teleport-update enable --proxy example.teleport.sh # if not enabled already, configure teleport and: @@ -427,6 +373,16 @@ $ tree /var/lib/teleport │ └── etc │ └── systemd │ └── teleport.service + ├── system # if installed via OS package + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ └── etc + │ └── systemd + │ └── teleport.service └── update.yaml $ ls -l /usr/local/bin/tsh /usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh @@ -444,9 +400,11 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service This file stores configuration for `teleport-update`. +All updates are applied atomically using renameio. + ``` version: v1 -kind: updates +kind: update_config spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh @@ -457,8 +415,16 @@ spec: # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true status: - # active_version specifies the active (symlinked) deployment of the telepport agent. + # start_time specifies the start time of the most recent update. + start_time: 2020-12-09T16:09:53+00:00 + # active_version specifies the active (symlinked) deployment of the teleport agent. active_version: 15.1.1 + # version_history specifies the previous deployed versions, in order by recency. + version_history: ["15.1.3", "15.0.4"] + # rollback specifies whether the most recent version was deployed by an automated rollback. + rollback: true + # error specifies the last error encounted + error: "" ``` #### backup.yaml @@ -479,7 +445,7 @@ spec: ### Runtime -The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. +The `teleport-update` binary will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: ```shell $ teleport-update update @@ -490,7 +456,7 @@ After it is installed, the `update` subcommand will no-op when executed until co $ teleport-update enable --proxy mytenant.teleport.sh --group staging ``` -If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. +If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used, if present. The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. @@ -498,9 +464,9 @@ It will also run update teleport immediately, to ensure that subsequent executio Both `update` and `enable` will maintain a shared lock file preventing any re-entrant executions. The `enable` subcommand will: -1. Query the `/v1/webapi/find` endpoint. -2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). -3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). +1. If an updater-incompatible version of the Teleport package is installed, fail immediately. +2. Query the `/v1/webapi/find` endpoint. +3. If the current updater-managed version of Teleport is the latest, jump to (14). 4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). @@ -510,10 +476,8 @@ The `enable` subcommand will: 10. Restart the agent if the systemd service is already enabled. 11. Set `active_version` in `update.yaml` if successful or not enabled. 12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -13. Remove and purge any `teleport` package if installed. -14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. +13. Remove all stored versions of the agent except the current version and last working version. +14. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: 1. Configure `update.yaml` to set `enabled` to false. @@ -535,18 +499,21 @@ When `update` subcommand is otherwise executed, it will: 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. -To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. +To guarantee auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. To ensure that SELinux permissions do not prevent the `teleport-update` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. -To ensure that `teleport` package removal does not interfere with `teleport-update`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. -Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. - -To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. - To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. +The `teleport` apt and yum packages contain a system installation of Teleport in `/var/lib/teleport/versions/system`. +Post package installation, the `link` subcommand is executed automatically to link the system installation when no auto-updater-managed version of Teleport is linked: +``` +/usr/local/bin/teleport -> /var/lib/teleport/versions/system/bin/teleport +/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/system/bin/teleport-updater +... +``` + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. @@ -563,9 +530,6 @@ To retrieve known information about agent updates, the `status` subcommand will "agent_version_installed": "15.1.1", "agent_version_desired": "15.1.2", "agent_version_previous": "15.1.0", - "agent_edition_installed": "enterprise", - "agent_edition_desired": "enterprise", - "agent_edition_previous": "enterprise", "agent_update_time_last": "2020-12-10T16:00:00+00:00", "agent_update_time_jitter": 600, "agent_updates_enabled": true @@ -618,6 +582,8 @@ This workflow supports customers that cannot use the auto-update mechanism provi Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. +Cluster administrators that choose this path may use the `teleport` package without auto-updates enabled locally. + ### Installers The following install scripts will be updated to install the latest updater and run `teleport-update enable` with the proxy address: @@ -632,10 +598,10 @@ Eventually, additional logic from the scripts could be added to `teleport-update Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: -- Install the `teleport-ent-updater` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. Installers scripts will continue to function, as the package install operation will no-op. -- Install the `teleport-ent-updater` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. Installers scripts would be skipped in favor of the `teleport configure` command. @@ -666,9 +632,12 @@ Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX ## Migration -The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. +The existing update system will remain in-place until the old auto-updater is fully deprecated. + +Both update systems can co-exist on the same machine. +The old auto-updater will update the system package, which will not affect the `teleport-update`-managed installation. -Eventually, the `cluster_maintenance_config` resource will be deprecated. +Eventually, the `cluster_maintenance_config` resource and `teleport-ent-upgrader` package will be deprecated. ## Security @@ -717,19 +686,14 @@ service AutoUpdateService { // ResetAutoUpdateConfig restores the autoupdate config to default values. rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // GetAutoUpdateVersion returns the autoupdate version. - rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // CreateAutoUpdateVersion creates the autoupdate version. - rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // UpdateAutoUpdateVersion updates the autoupdate version. - rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // UpsertAutoUpdateVersion overwrites the autoupdate version. - rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); - - // GetAgentRolloutPlan returns the agent rollout plan and current progress. - rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoUpdateVersion streams the agent rollout plan's list of all hosts. - rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); + // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. + rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. + rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. + rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. + rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); } // GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. @@ -779,8 +743,6 @@ message AutoUpdateConfigSpec { message AgentAutoUpdateSchedules { // regular schedules for non-critical versions. repeated AgentAutoUpdateGroup regular = 1; - // critical schedules for urgently needed versions. - repeated AgentAutoUpdateGroup critical = 2; } // AgentAutoUpdateGroup specifies the update schedule for a group of agents. @@ -799,12 +761,6 @@ message AgentAutoUpdateGroup { int32 failure_seconds = 6; // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 7; - // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. - string max_timeout_before_halt = 8; - // max_failed_before_halt specifies agents that can fail before the rollout is halted, by percent. - string max_failed_before_halt = 9; - // requires specifies rollout groups that must succeed for the current version/schedule before this rollout can run. - repeated string requires = 10; } // Day of the week @@ -820,29 +776,29 @@ enum Day { DAY_SATURDAY = 8; } -// GetAutoUpdateVersionRequest requests the autoupdate_version singleton resource. -message GetAutoUpdateVersionRequest {} +// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. +message GetAutoUpdateAgentPlanRequest {} -// GetAutoUpdateVersionRequest requests creation of the autoupdate_version singleton resource. -message CreateAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. +message CreateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// GetAutoUpdateVersionRequest requests an update of the autoupdate_version singleton resource. -message UpdateAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. +message UpdateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// GetAutoUpdateVersionRequest requests an upsert of the autoupdate_version singleton resource. -message UpsertAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. +message UpsertAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// AutoUpdateVersion holds dynamic configuration settings for autoupdate versions. -message AutoUpdateVersion { +// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. +message AutoUpdateAgentPlan { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -852,11 +808,13 @@ message AutoUpdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoUpdateVersionSpec spec = 5; + AutoUpdateAgentPlanSpec spec = 5; + // status is the status of the resource. + AutoUpdateAgentPlanStatus status = 6; } -// AutoUpdateVersionSpec is the spec for the autoupdate version. -message AutoUpdateVersionSpec { +// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. +message AutoUpdateAgentPlanSpec { // agent_version is the desired agent version for new rollouts. string agent_version = 1; // agent_version schedule is the schedule to use for rolling out the agent_version. @@ -869,54 +827,16 @@ enum Schedule { SCHEDULE_UNSPECIFIED = 0; // REGULAR update schedule SCHEDULE_REGULAR = 1; - // CRITICAL update schedule for critical bugs and vulnerabilities - SCHEDULE_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - SCHEDULE_IMMEDIATE = 3; + SCHEDULE_IMMEDIATE = 2; } -// GetAgentRolloutPlanRequest requests an agent_rollout_plan. -message GetAgentRolloutPlanRequest { - // name of the agent_rollout_plan - string name = 1; -} - -// GetAgentRolloutPlanHostsRequest requests the ordered host UUIDs for an agent_rollout_plan. -message GetAgentRolloutPlanHostsRequest { - // name of the agent_rollout_plan - string name = 1; -} - -// AgentRolloutPlan defines a version update rollout consisting a fixed group of agents. -message AgentRolloutPlan { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AgentRolloutPlanSpec spec = 5; - // status is the status of the resource. - AgentRolloutPlanStatus status = 6; -} - -// AutoUpdateVersionSpec is the spec for the AgentRolloutPlan. -message AgentRolloutPlanSpec { - // start_time of the rollout - google.protobuf.Timestamp start_time = 1; +// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. +message AutoUpdateAgentPlanStatus { // version targetted by the rollout string version = 2; - // schedule that triggered the rollout - string schedule = 3; - // host_count of hosts to update - int64 host_count = 4; -} - -// AutoUpdateVersionStatus is the status for the AgentRolloutPlan. -message AgentRolloutPlanStatus { + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; // failed_host_count specifies the number of failed hosts. @@ -924,12 +844,6 @@ message AgentRolloutPlanStatus { // timeout_host_count specifies the number of timed-out hosts. int64 timeout_host_count = 3; } - -// AgentRolloutPlanHost identifies an agent by host ID -message AgentRolloutPlanHost { - // host_id of a host included in the rollout - string host_id = 1; -} ``` ## Alternatives @@ -958,13 +872,13 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) +1. Implement Teleport APIs for new scheduling system (without backpressure strategy) 2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release new updater via teleport-ent-updater package. +6. Release via `teleport` package. 7. Release documentation changes. 8. Communicate to users that they should update their updater. -9. Deprecate old auto-updater endpoints. +9. Begin deprecation of old auto-updater resources, packages, and endpoints. 10. Add group interdependence and backpressure features. From 7c89fb6928711e2d139463ef482dcf30c7c97855 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 14:00:48 -0700 Subject: [PATCH 71/84] canaries --- rfd/0169-auto-updates-linux-agents.md | 39 ++++++++++++++++++++------- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2b8f264ed2899..89527542fe8df 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -90,14 +90,18 @@ spec: # The agent updater client will pick a random time within this duration to wait to update. # default: 5 jitter_seconds: 0-60 + # canary_count specifies the desired number of canaries to update before any other agents + # are updated. + # default: 5 + canaries: 0-10 # max_in_flight specifies the maximum number of agents that may be updated at the same time. # Only valid for the backpressure strategy. # default: 20% max_in_flight: 10-100% # alert_after specifies the duration after which a cluster alert will be set if the rollout has # not completed. - # default: 4h - alert_after: 1h + # default: 4 + alert_after_hours: 1-8 # ... ``` @@ -206,6 +210,8 @@ status: present_count: 53 # failed_count is the number of agents rolled-back since the start of the rollout failed_count: 23 + # canaries is a list of updater UUIDs used for canary deployments + canaries: ["abc123-..."] # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) @@ -243,8 +249,14 @@ $ teleport-update enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -A fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +The number of updated and non-updated agents is tracked by the auth servers. + +If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. + +If canaries are enabled, a user-specified number of agents are updated first. +These agents must all update successfully for the rollout to proceed to the remaining agents. + Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. ### Rollout @@ -269,11 +281,13 @@ Every minute, auth servers persist the version counts: At the start of each group's window, auth servers persist an initial count: - `initial_counts[group]` - `count`: number of connected agents in `group` at start of window + - `canaries`: list of updater UUIDs to use for canary deployments Expiration time of the persisted key is 1 hour. To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. - To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. - To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. @@ -312,7 +326,8 @@ The boolean is returned as `true` in the case that the provided `host` contains Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The UUID and group name are read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. ## Details - Linux Agents @@ -520,6 +535,8 @@ If the new version of Teleport fails to start, the installation of Teleport is r If `teleport-update` itself fails with an error, and an older version of `teleport-update` is available, the update will retry with the older version. +If the agent losses its connection to the proxy, `teleport-update` updates the agent to the group's current desired version immediately. + Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. #### Status @@ -872,13 +889,17 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure strategy) -2. Implement new Linux server auto-updater in Go. +1. Implement Teleport APIs for new scheduling system (without backpressure strategy, canaries, or completion tracking) +2. Implement new Linux server auto-updater in Go, including systemd-based rollbacks. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release via `teleport` package. +6. Release via `teleport` package and script for packageless install. 7. Release documentation changes. -8. Communicate to users that they should update their updater. +8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints. -10. Add group interdependence and backpressure features. +10. Add healthcheck endpoint to Teleport agents and incorporate into rollback logic. +10. Add progress and completion checking. +10. Add canary functionality. +10. Add backpressure functionality if necessary. +11. Add DB backups if necessary. From 2b95f8e8f1562a69826a05f203dba2f9392fe730 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 14:23:55 -0700 Subject: [PATCH 72/84] canary 2 --- rfd/0169-auto-updates-linux-agents.md | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 89527542fe8df..999c39053cb8c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -65,15 +65,15 @@ kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. Setting this to pause will halt the rollout. + # agent updates are in place. Setting this to pause will temporarily halt the rollout. agent_autoupdate: disable|enable|pause # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated - # with the version in the rollout_plan resource. - # For now, only the "regular" strategy is configurable. + # with the version in the autoupdate_agent_plan resource. + # For now, only the "regular" schedule is configurable. agent_schedules: - # rollout strategy must be "regular" for now + # rollout schedule must be "regular" for now regular: # name of the group. Must only contain valid backend / resource name characters. - name: staging @@ -93,7 +93,7 @@ spec: # canary_count specifies the desired number of canaries to update before any other agents # are updated. # default: 5 - canaries: 0-10 + canary_count: 0-10 # max_in_flight specifies the maximum number of agents that may be updated at the same time. # Only valid for the backpressure strategy. # default: 20% @@ -117,6 +117,7 @@ spec: days: ["*"] start_hour: 0 jitter_seconds: 5 + canary_count: 5 max_in_flight: 20% alert_after: 4h ``` @@ -143,9 +144,9 @@ Note that the `default` schedule applies to agents that do not specify a group n # configuration $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging-group --set-start-hour=3 +$ tctl autoupdate update --group staging --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging-group --set-jitter-seconds=60 +$ tctl autoupdate update --group staging --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate update --group default --set-jitter-seconds=60 Automatic updates configuration has been updated. @@ -159,11 +160,11 @@ Version: v1.2.4 Schedule: regular Groups: -staging-group: succeeded at 2024-01-03 23:43:22 UTC -prod-group: scheduled for 2024-01-03 23:43:22 UTC (depends on prod-group) -other-group: failed at 2024-01-05 22:53:22 UTC +staging: succeeded at 2024-01-03 23:43:22 UTC +prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) +other: failed at 2024-01-05 22:53:22 UTC -$ tctl autoupdate status --group staging-group +$ tctl autoupdate status --group staging Status: succeeded Date: 2024-01-03 23:43:22 UTC Requires: (none) @@ -174,8 +175,8 @@ Failed: 15 (3%) Timed-out: 0 # re-running failed group -$ tctl autoupdate run --group staging-group -Executing auto-update for group 'staging-group' immediately. +$ tctl autoupdate run --group staging +Executing auto-update for group 'staging' immediately. ``` Notes: From ce6de479a3a1a13b7a66782287671a231bf12ba4 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 25 Sep 2024 18:19:44 -0400 Subject: [PATCH 73/84] describe state, transitions, and proxy response --- rfd/0169-auto-updates-linux-agents.md | 91 +++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 999c39053cb8c..80fb19bd6a50a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -66,7 +66,7 @@ spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. Setting this to pause will temporarily halt the rollout. - agent_autoupdate: disable|enable|pause + agent_autoupdate_mode: disable|enable|pause # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated @@ -110,7 +110,7 @@ Default resource: ```yaml kind: autoupdate_config spec: - agent_autoupdate: enable + agent_autoupdate_mode: enable agent_schedules: regular: - name: default @@ -198,7 +198,7 @@ spec: strategy: backpressure|grouped # paused specifies whether the rollout is paused # default: enabled - autoupdate: enabled|disabled|paused + autoupdate_mode: enabled|disabled|paused status: groups: # name of group @@ -242,7 +242,7 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to agent heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. +Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: ```shell @@ -260,6 +260,89 @@ These agents must all update successfully for the rollout to proceed to the rema Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. +### Group states + +Let `v1` be the current version and `v2` the target version. + +A group can be in 5 state: +- unstarted: the group update has not been started yet. +- canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. +- active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. +- done: the group has been updated. New agents should run `v2`. +- rolledback: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. + +The finite state machine is the following: +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) + + unstarted -->|StartGroup
MaintenanceTriggerOK| canary + canary -->|canary came back alive| active + canary -->|ForceGroup| done + canary -->|RollbackGroup| rolledback + active -->|ForceGroup
Success criteria met| done + done -->|RollbackGroup| rolledback + active -->|RollbackGroup| rolledback + + canary -->|ResetGroup| canary + active -->|ResetGroup| active +``` + +### Agent auto update modes + +The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) +and by the customer (via `autoupdate_config`). + +The agent update mode can take 3 values: + +1. disabled: teleport should not manage agent updates +2. paused: the updates are temporarily suspended, we honour the existing rollout state +3. enabled: teleport can update agents + +The cluster agent rollout mode is computed by taking the lowest value. +For example: + +- cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` +- cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` +- cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` +- cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` + +### Proxy answer + +The proxy response contains two parts related to automatic updates: +- the target version of the requested group +- if the agent should be updated + +#### Rollout status: disabled + +| Group state | Version | Should update | +|-------------|---------|---------------| +| * | v2 | false | + +#### Rollout status: paused + +| Group state | Version | Should update | +|-------------|---------|---------------| +| unstarted | v1 | false | +| canary | v1 | false | +| active | v2 | false | +| done | v2 | false | +| rolledback | v1 | false | + +#### Rollout status: enabled + +| Group state | Version | Should update | +|-------------|---------|----------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true | +| rolledback | v1 | true | + ### Rollout Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. From 26c43b0da364c0849782eb78b478291ff2d40e41 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 15:29:28 -0700 Subject: [PATCH 74/84] rpcs --- rfd/0169-auto-updates-linux-agents.md | 66 +++++++++++++++++++-------- 1 file changed, 48 insertions(+), 18 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 80fb19bd6a50a..e83670dbea421 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -131,7 +131,7 @@ The update proceeds from the first group to the last group, ensuring that each g The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. -Changing the `current_version` in `autoupdate_agent_plan` changes the advertised `current_version` for all unfinished groups. +Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. However, any changes to `agent_schedules` that occur while a group is active will be rejected. @@ -187,8 +187,8 @@ Notes: ```yaml kind: autoupdate_agent_plan spec: - # current_version is the desired version for agents before their window. - current_version: A.B.C + # start_version is the desired version for agents before their window. + start_version: A.B.C # target_version is the desired version for agents after their window. target_version: X.Y.Z # schedule to use for the rollout @@ -349,7 +349,7 @@ Instance heartbeats will be extended to incorporate and send data that is writte The following data related to the rollout are stored in each instance heartbeat: - `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_current_version`: current agent version +- `agent_update_start_version`: current agent version - `agent_update_rollback`: whether the agent was rolled-back automatically - `agent_update_uuid`: Auto-update UUID - `agent_update_group`: Auto-update group name @@ -751,7 +751,7 @@ are signed. The Update Framework (TUF) will be used to implement secure updates in the future. -Anyone who possesses a host UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +Anyone who possesses a updater UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. It is not possible to discover the current version of that host, only the designated update window. ## Logging @@ -834,8 +834,8 @@ message AutoUpdateConfig { // AutoUpdateConfigSpec is the spec for the autoupdate config. message AutoUpdateConfigSpec { - // agent_autoupdate specifies whether agent autoupdates are enabled. - bool agent_autoupdate = 1; + // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_autoupdate_mode = 1; // agent_schedules specifies schedules for updates of grouped agents. AgentAutoUpdateSchedules agent_schedules = 3; } @@ -854,14 +854,16 @@ message AgentAutoUpdateGroup { repeated Day days = 2; // start_hour to initiate update int32 start_hour = 3; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; - // timeout_seconds before an agent is considered time-out (no version change) - int32 timeout_seconds = 5; - // failure_seconds before an agent is considered failed (loses connection) - int32 failure_seconds = 6; + // wait_days after last group succeeds before this group can run + int32 wait_days = 4; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]) + int32 jitter_seconds = 5; + // canary_count of agents to use in the canary deployment. + int32 canary_count = 6; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int32 alert_after_hours = 7; // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 7; + string max_in_flight = 8; } // Day of the week @@ -877,6 +879,18 @@ enum Day { DAY_SATURDAY = 8; } +// Mode of operation +enum Mode { + // UNSPECIFIED update mode + MODE_UNSPECIFIED = 0; + // DISABLE updates + MODE_DISABLE = 1; + // ENABLE updates + MODE_ENABLE = 2; + // PAUSE updates + MODE_PAUSE = 3; +} + // GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. message GetAutoUpdateAgentPlanRequest {} @@ -916,10 +930,16 @@ message AutoUpdateAgentPlan { // AutoUpdateAgentPlanSpec is the spec for the autoupdate version. message AutoUpdateAgentPlanSpec { - // agent_version is the desired agent version for new rollouts. - string agent_version = 1; - // agent_version schedule is the schedule to use for rolling out the agent_version. - Schedule agent_version_schedule = 2; + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // strategy to use for the rollout + Strategy strategy = 4; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 5; } // Schedule type for the rollout @@ -932,6 +952,16 @@ enum Schedule { SCHEDULE_IMMEDIATE = 2; } +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // GROUPED update schedule, with no backpressure + STRATEGY_GROUPED = 1; + // BACKPRESSURE update schedule + STRATEGY_BACKPRESSURE = 2; +} + // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { // version targetted by the rollout From 7d0f618565612c02a356b255274fe0fa3fe75c96 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 15:50:26 -0700 Subject: [PATCH 75/84] finish rpcs --- rfd/0169-auto-updates-linux-agents.md | 63 +++++++++++++++++++++------ 1 file changed, 50 insertions(+), 13 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e83670dbea421..cf3abf8b56a13 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -855,13 +855,13 @@ message AgentAutoUpdateGroup { // start_hour to initiate update int32 start_hour = 3; // wait_days after last group succeeds before this group can run - int32 wait_days = 4; + int64 wait_days = 4; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int64 alert_after_hours = 5; // jitter_seconds to introduce before update as rand([0, jitter_seconds]) - int32 jitter_seconds = 5; + int64 jitter_seconds = 6; // canary_count of agents to use in the canary deployment. - int32 canary_count = 6; - // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - int32 alert_after_hours = 7; + int64 canary_count = 7; // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 8; } @@ -964,17 +964,54 @@ enum Strategy { // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { - // version targetted by the rollout - string version = 2; + // name of the group + string name = 0; // start_time of the rollout google.protobuf.Timestamp start_time = 1; - // last_active_host_index specifies the index of the last host that may be updated. - int64 last_active_host_index = 1; - // failed_host_count specifies the number of failed hosts. - int64 failed_host_count = 2; - // timeout_host_count specifies the number of timed-out hosts. - int64 timeout_host_count = 3; + // initial_count is the number of connected agents at the start of the window. + int64 initial_count = 2; + // present_count is the current number of connected agents. + int64 present_count = 3; + // failed_count specifies the number of failed agents. + int64 failed_count = 4; + // canaries is a list of canary agents. + repeated Canary canaries = 5; + // progress is the current progress through the rollout. + float64 progress = 6; + // state is the current state of the rollout. + State state = 7; + // last_update_time is the time of the previous update for this group. + google.protobuf.Timestamp last_update_time = 8; + // last_update_reason is the trigger for the last update + string last_update_reason = 9; +} + +// Canary agent +message Canary { + // update_uuid of the canary agent + string update_uuid = 0; + // host_uuid of the canary agent + string host_uuid = 1; + // hostname of the canary agent + string hostname = 2; +} + +// State of the rollout +enum State { + // UNSPECIFIED state + STATE_UNSPECIFIED = 0; + // UNSTARTED state + STATE_UNSTARTED = 1; + // CANARY state + STATE_CANARY = 2; + // ACTIVE state + STATE_ACTIVE = 3; + // DONE state + STATE_DONE = 4; + // ROLLEDBACK state + STATE_ROLLEDBACK = 5; } + ``` ## Alternatives From 69d758ca36fab37fb585d0db4684a26b53782244 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 27 Sep 2024 07:42:16 -0700 Subject: [PATCH 76/84] minor tweaks --- rfd/0169-auto-updates-linux-agents.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index cf3abf8b56a13..f372b7759f9c6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -357,14 +357,12 @@ The following data related to the rollout are stored in each instance heartbeat: Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). Every minute, auth servers persist the version counts: -- `version_counts[group][version]` +- `agent_data[group].stats[version]` - `count`: number of currently connected agents at `version` in `group` - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` - -At the start of each group's window, auth servers persist an initial count: -- `initial_counts[group]` - - `count`: number of connected agents in `group` at start of window + - `count`: number of connected agents at `version` in `group` at start of window +- `agent_data[group]` - `canaries`: list of updater UUIDs to use for canary deployments Expiration time of the persisted key is 1 hour. @@ -379,14 +377,19 @@ This prevents double-counting agents when auth servers are killed. #### Progress Formulas -Each auth server will calculate the progress as `( max_in_flight * initial_counts[group].count + version_counts[group][target_version].count ) / initial_counts[group].count` and write the progress to `autoupdate_agent_plan` status. +Given: +``` +initial_count[group] = sum(agent_data[group].stats[*]).count +``` + +Each auth server will calculate the progress as `( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. -However, if `as_numeral(version_counts[group][not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: -`version_counts[group][*].count > initial_counts[group].count - max_in_flight * initial_counts[group].count` +`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. @@ -977,7 +980,7 @@ message AutoUpdateAgentPlanStatus { // canaries is a list of canary agents. repeated Canary canaries = 5; // progress is the current progress through the rollout. - float64 progress = 6; + float progress = 6; // state is the current state of the rollout. State state = 7; // last_update_time is the time of the previous update for this group. @@ -994,6 +997,8 @@ message Canary { string host_uuid = 1; // hostname of the canary agent string hostname = 2; + // success state of the canary agent + bool success = 3; } // State of the rollout From 430b7a4a193d2678008f704382e7e7ad4718d8e5 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Mon, 30 Sep 2024 09:28:18 -0400 Subject: [PATCH 77/84] Add user stories --- rfd/0169-auto-updates-linux-agents.md | 342 +++++++++++++++++++++++++- 1 file changed, 341 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index f372b7759f9c6..c22abdf66fa35 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,7 +54,347 @@ We must provide a seamless, hands-off experience for auto-updates of Teleport Ag ## UX -[Hugo to add] +### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version + +
+Before + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 + present_count: 103 + failed_count: 2 + progress: 1 + state: active + last_update_time: 2020-12-09T16:09:53+00:00 + last_update_reason: success + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-09T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +I run +```bash +tctl autoupdate agent new-rollout v3 +# created new rollout from v2 to v3 +``` + +
+After + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v2 + target_version: v3 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +Now, new agents will install v2 by default, and v3 after the maintenance. + +> [!NOTE] +> If the previous maintenance was not finished, I will install v2 on new prod agents while the rest of prod is still running v1. +> This is expected as we don't want to keep track of an infinite number of versions. +> +> If this is an issue I can create a v1 -> v3 rollout instead. +> +> ```bash +> tctl autoupdate agent new-rollout v3 --current-version v1 +> # created new update plan from v2 to v3 +> ``` + +### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% + +#### Failure mode 1: the new version crashes + +I create a new deployment, with a broken version. The version is deployed to the canaries. +The canaries crash, the updater reverts the update, the agents connect back online and +advertise they rolled-back. The maintenance is stuck until the canaries are running the target version. + +
+Autoupdate agent plan + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 + present_count: 100 + failed_count: 0 + progress: 0 + state: canaries + canaries: + - updater_id: abc + host_id: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +I and the customer get an alert if the canary testing has not succeeded after an hour. +Teleport cloud operators and the user can access the canary hostname and hostid +to + +The rollout resumes. + +#### Failure mode 1 bis: the new version crashes, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). + +The canaries might not select one of the affected agent and allow the update to proceed. +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updaters of the affected agents will attempt to self-heal by reverting to the previous version. + +Once the previous Teleport version is running, the agent will advertise its update failed and it had to rollback. +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +#### Failure mode 2: the new version crashes, and the old version cannot start + +I create a new deployment, with a broken version. The version is deployed to the canaries. +The canaries attempt the update, and the new Teleport instance crashes. +The updater fails to self-heal as the old version does not start anymore. + +This is typically caused by external sources like full disk, faulty networking, resource exhaustion. +This can also be caused by the Teleport control plan not being available. + +The group update is stuck until the canary comes back online and runs the latest version. + +The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the +hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. + +#### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: a clock drift blocks agents from re-connecting to Teleport. + +The canaries might not select one of the affected agent and allow the update to proceed. +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updater fails to self-heal as the old version does not start anymore. + +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +In this case, it's hard to identify which agent dropped. + +#### Failure mode 3: shadow failure + +Teleport cloud deploys a new version. Agents from the first group get updated. +The agents are seemingly running properly, but some functions are impaired. +For example, host user creation is failing. + +Some user tries to access a resource served by the agent, it fails and the user +notices the disruption. + +The customer can observe the agent update status and see that a recent update +might have caused this: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging in progress (53%) YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` + +Then, the customer or Teleport Cloud team can suspend the rollout: + +```shell +tctl auto-update agent suspend +# Automatic updates suspended +# No existing agent will get updated. New agents might install the new version +# depending on their group. +``` + +At this point, no new agent is updated to reduce the service disruption. +The customer can investigate, and get help from Teleport's support via a support ticket. +If the update is really the cause of the issue, the customer or Teleport cloud can perform a rollback: + +```shell +tctl auto-update agent rollback +# Rolledback groups: [dev, staging] +# Warning: the automatic agent updates are suspended. +# Agents will not rollback until you run: +# $> tctl auto-update agent resume +``` + +> [!NOTE] +> By default, all groups not in the "unstarted" state are rolledback. +> It is also possible to rollback only specific groups. + +The new state looks like +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: suspended +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev rolledback YYYY-MM-DD HHh 120 115 2 +# staging rolledback YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` + +Finally, when the user is happy with the new plan, they can resume the updates. +This will trigger the rollback. + +```shell +tctl auto-update agent resume +``` + +### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version + +I connect to the node and lookup its status: +```shell +teleport-updater status +# Running version v16.2.5 +# Automatic updates enabled. +# Proxy: example.teleport.sh +# Group: staging +``` + +I try to set a specific version: +```shell +teleport-udpater use-version v16.2.3 +# Error: the instance is enrolled into automatic updates. +# You must specify --disable-automatic-updates to opt this agent out of automatic updates and manually control the version. +``` + +I acknowledge that I am leaving automatic updates: +```shell +teleport-udpater use-version v16.2.3 --disable-automatic-updates +# Disabling automatic updates for the node. You can enable them back by running `teleport-updater enable` +# Downloading version 16.2.3 +# Restarting teleport +# Cleaning up old binaries +``` + +When the issue is fixed, I can enroll back into automatic updates: + +```shell +teleport-updater enable +# Enabling automatic updates +# Proxy: example.teleport.sh +# Group: staging +``` + +### As a Teleport user I want to fast-track a group update + +I have a new rollout, completely unstarted, and my current maintenance schedule updates over seevral days. +However, the new version contains something that I need as soon s possible (e.g. a fix for a bug that affects me). + +
+Before: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
+ +I can trigger the dev group immediately using the command: + +```shell +tctl auto-update agent trigger-group dev +# Dev group update triggered +``` + +[TODO: how to deal with the canary vs active vs done states?] + +
+After: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
### Teleport Resources From 4ac0e9ce1ad8b829dd4c91c3cf60209415eba374 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Mon, 30 Sep 2024 11:15:26 -0400 Subject: [PATCH 78/84] Put new requirements at the top + edit UX + add TODOs --- rfd/0169-auto-updates-linux-agents.md | 237 +++++++++++++++++--------- 1 file changed, 152 insertions(+), 85 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c22abdf66fa35..e83eece92b5d3 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -33,26 +33,72 @@ Additionally, this RFD parallels the auto-update functionality for client tools ## Why -The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. - -1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. -2. The use of system package management requires logic that varies significantly by target distribution. -3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. -4. The use of bash to implement the updater makes long-term maintenance difficult. -5. The existing auto-updater has limited automated testing. -6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. -7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). -8. The rollout plan for new agent versions is not fully-configurable using tctl. -9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. -10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. -11. The existing auto-updater is not self-updating. -12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). -13. There is no phased rollout mechanism for updates. -14. There is no way to automatically detect and halt failed updates. - -We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. - -## UX +1. We want customers always running the latest release of Teleport to always be secure, have access to the latest + features, and not deal with the pain of updating the agents. +2. Reduce Teleport Cloud operational costs of contacting customers with old agents. + Make updating easier for self-hosted customers so we don't have to provide support for older Teleport versions. +3. Increase reliability to 99.99%. + +The current systemd updater does not meet those requirements: +- its use of package managers leads users to accidentally upgrade Teleport +- the installation process is complex and users end up installing the wrong version of Teleport +- the current update process does not provide safeties to protect against broken updates +- many customers are not adopting the existing updater because they want to control when updates happen +- we don't offer a ni + +## Product requirements + +1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. + +2. Bucketed rollout that tenants have control over. + - Control the bucket update day + - Control the bucket update hour + - Ability to pause a rollout + +3. Customers should be able to run "apt-get update" without updating Teleport. + + Installation from a package manager should be possible, but the version should be controlled by Teleport. + +4. Self-managed updates should be a first class citizen. Teleport must advertise the desired agent and client version. + +5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. + +6. Upgrading a leaf cluster is out of scope. + +7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. + +8. We should have high quality metrics that report the version they are running and if they are running automatic + updates. For users and us. + +9. Best effort should be made so automatic updates should be applied in a way that sessions are not terminated. (Currently only supported for SSH) + +10. All backends should be supported. + +11. Teleport Discover installation (curl one-liner) should be supported. + +12. We need to support repo mirrors. + +13. I should be able to install Teleport via whatever mechanism I want to. + +14. If new nodes join a bucket outside the upgrade window and you are within your compat. window, wait until your next group update start. + If you are not within your compat. window attempt to upgrade right away. + +15. If an agent comes back online after some period of time and is still compat. with + control plane, it wait until the next upgrade window when it will be upgraded. + +16. Regular cloud tenant update schedule should run in les than a week. + Select tenants might support longer schedules. + +17. A cloud customer should be able to pause, resume, and rollback and existing rollout schedule. + A cloud customer should not be able to create new rollout schedules. + + Teleport can create as many rollout schedules as it wants. + +18. A user on the host, should be able to turn autoupdate off or select a version for that particular host. + +19. Operating system packages should be supported. + +## User Stories ### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version @@ -288,7 +334,9 @@ tctl auto-update agent rollback > By default, all groups not in the "unstarted" state are rolledback. > It is also possible to rollback only specific groups. -The new state looks like +
+After: + ```shell tctl auto-update agent status # Rollout plan created the YYYY-MM-DD @@ -302,6 +350,7 @@ tctl auto-update agent status # staging rolledback YYYY-MM-D2 HHh 20 10 0 # prod not started 234 0 0 ``` +
Finally, when the user is happy with the new plan, they can resume the updates. This will trigger the rollback. @@ -396,9 +445,16 @@ tctl auto-update agent status ``` -### Teleport Resources -#### Scheduling +## Teleport Resources + +### Scheduling + +This resource is owned by the Teleport cluster user. +This is how Teleport customers can specify their automatic update preferences such as: +- if automatic updates are enabled, disabled, or temporarily suspended +- in which order their agents should be updated (`dev` before `staging` before `prod`) +- when should the updates start ```yaml kind: autoupdate_config @@ -419,6 +475,7 @@ spec: - name: staging # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) + # TODO: explicit the supported values based on the customer QoS days: [ “Sun”, “Mon”, ... | "*" ] # start_hour specifies the hour when the group may start upgrading. # default: 0 @@ -426,6 +483,7 @@ spec: # wait_days specifies how many days to wait after the previous group finished before starting. # default: 0 wait_days: 0-1 + # TODO: is this needed? In which case a customer would need to set a custom jitter? # jitter_seconds specifies a maximum jitter duration after the start hour. # The agent updater client will pick a random time within this duration to wait to update. # default: 5 @@ -438,7 +496,7 @@ spec: # Only valid for the backpressure strategy. # default: 20% max_in_flight: 10-100% - # alert_after specifies the duration after which a cluster alert will be set if the rollout has + # alert_after specifies the duration after which a cluster alert will be set if the group update has # not completed. # default: 4 alert_after_hours: 1-8 @@ -454,7 +512,7 @@ spec: agent_schedules: regular: - name: default - days: ["*"] + days: ["*"] # TODO: restrict to work week? Minus Friday? start_hour: 0 jitter_seconds: 5 canary_count: 5 @@ -462,15 +520,14 @@ spec: alert_after: 4h ``` -Dependency cycles are rejected. -Dependency chains longer than a week will be rejected. -Otherwise, updates could take up to 7 weeks to propagate. The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. +By default, only 5 agent groups are allowed, this mitigates very long rollout plans. The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. +[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. @@ -479,9 +536,12 @@ However, any changes to `agent_schedules` that occur while a group is active wil Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. Note that the `default` schedule applies to agents that do not specify a group name. +[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] ```shell # configuration +# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. +# We should chose a user-friendly signature $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --group staging --set-start-hour=3 @@ -520,9 +580,22 @@ Executing auto-update for group 'staging' immediately. ``` Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_agent_plan`, while maintaining control over the rollout. +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating + `autoupdate_agent_plan`, while maintaining control over the rollout. + +### Rollout -#### Rollout +The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. +In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local +admin socket (tctl on local machine). + +> [!NOTE] +> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. +> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be +> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. +> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. +> +> Solving this problem is out of the scope of this RFD. ```yaml kind: autoupdate_agent_plan @@ -570,42 +643,12 @@ $ tctl autoupdate update --set-agent-version=15.1.2 --critical Automatic updates configuration has been updated. ``` -## Details - Teleport API - -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. - -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version - -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. - -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` - -At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -The number of updated and non-updated agents is tracked by the auth servers. - -If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. - -If canaries are enabled, a user-specified number of agents are updated first. -These agents must all update successfully for the rollout to proceed to the remaining agents. - -Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. - ### Group states Let `v1` be the current version and `v2` the target version. A group can be in 5 state: -- unstarted: the group update has not been started yet. +- unstarted: the group update has not been started yet. - canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. - active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. - done: the group has been updated. New agents should run `v2`. @@ -651,37 +694,35 @@ For example: - cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` - cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` -### Proxy answer +## Details - Teleport API -The proxy response contains two parts related to automatic updates: -- the target version of the requested group -- if the agent should be updated +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. -#### Rollout status: disabled +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `autoupdate_config` resource +- The status of past agent upgrades for the given version -| Group state | Version | Should update | -|-------------|---------|---------------| -| * | v2 | false | +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. +Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -#### Rollout status: paused +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: +```shell +$ teleport-update enable --proxy teleport.example.com --group staging +``` -| Group state | Version | Should update | -|-------------|---------|---------------| -| unstarted | v1 | false | -| canary | v1 | false | -| active | v2 | false | -| done | v2 | false | -| rolledback | v1 | false | +At the start of a group rollout, the Teleport auth servers record the initial number connected agents. +The number of updated and non-updated agents is tracked by the auth servers. -#### Rollout status: enabled +If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. -| Group state | Version | Should update | -|-------------|---------|----------------------------| -| unstarted | v1 | false | -| canary | v1 | false, except for canaries | -| active | v2 | true if UUID <= progress | -| done | v2 | true | -| rolledback | v1 | true | +If canaries are enabled, a user-specified number of agents are updated first. +These agents must all update successfully for the rollout to proceed to the remaining agents. + +Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. ### Rollout @@ -739,6 +780,32 @@ When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name] The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` +##### Rollout status: disabled + +| Group state | Version | Should update | +|-------------|---------|---------------| +| * | v2 | false | + +##### Rollout status: paused + +| Group state | Version | Should update | +|-------------|---------|---------------| +| unstarted | v1 | false | +| canary | v1 | false | +| active | v2 | false | +| done | v2 | false | +| rolledback | v1 | false | + +##### Rollout status: enabled + +| Group state | Version | Should update | +|-------------|---------|----------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true | +| rolledback | v1 | true | + ### REST Endpoints `/v1/webapi/find?host=[uuid]&group=[name]` From e87e3dc2e70ea5497adac241a098e55befa97151 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 12:34:20 -0400 Subject: [PATCH 79/84] Edition work --- rfd/0169-auto-updates-linux-agents.md | 659 +++++++++++++++----------- 1 file changed, 385 insertions(+), 274 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e83eece92b5d3..0d5ce9d964b2a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,7 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout speed. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout strategy. Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. @@ -44,9 +44,103 @@ The current systemd updater does not meet those requirements: - the installation process is complex and users end up installing the wrong version of Teleport - the current update process does not provide safeties to protect against broken updates - many customers are not adopting the existing updater because they want to control when updates happen -- we don't offer a ni - -## Product requirements +- we don't offer a nice user experience for self-hosted users, this ends up in a marginal automatic updates + adoption and does not reduce the cost of upgrading self-hosted clusters. + +## How + +The new agent automatic updates will rely on a separate updating binary controlling which Teleport version is +installed. The automatic updates will be implemented via incremental improvements over the existing mechanism: + +- Phase 1: introduce a new updater binary not relying on package managers +- Phase 2: introduce the concept of agent update groups and make the users chose in which order groups are updated +- Phase 3: add the ability for the agent updater to immediately revert a faulty update +- Phase 4: add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status +- Phase 5: add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated +- Phase 6: add the ability to perform slow and incremental version rollouts within an agent update group + +The updater will be usable after phase 1, and will gain new capabilities after each phase. +Future phases might change as we are working on the implementation and collecting real-world feedback and experience. + +### Resources + +We will introduce 2 user-facing resources: + +1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: + - if automatic updates are enabled, disabled, or temporarily suspended + - in which order their agents should be updated (`dev` before `staging` before `prod`) + - when should the updates start + + The resource will look like: + ```yaml + kind: autoupdate_config + spec: + agent_autoupdate_mode: enable + agent_schedules: + regular: + - name: dev + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + - name: prod + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + wait_days: 1 # update this group at least 1 day after the previous one + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + ``` + +2. The `autoupdate_agent_plan` resource, its spec is owned by the Teleport cluster administrator (e.g. Teleport Cloud team). + Its status is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via + select RPCs, for example fast-tracking a group update. + ```yaml + kind: autoupdate_agent_plan + spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled + status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 # part of phase 4 + present_count: 100 # part of phase 4 + failed_count: 0 # part of phase 4 + progress: 0 + state: canaries + canaries: # part of phase 5 + - updater_id: abc + host_id: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: prod + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan + ``` + +You can find more details about each resource field [in the dedicated resource section](#teleport-resources). + +## Details + +This section contains the proposed implementation details and is mainly relevant for Teleport developers and curious +users who want to know the motivations behind this specific design. + +### Product requirements + +Those are the requirements coming from engineering, product, and cloud teams: 1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. @@ -84,7 +178,7 @@ The current systemd updater does not meet those requirements: If you are not within your compat. window attempt to upgrade right away. 15. If an agent comes back online after some period of time and is still compat. with - control plane, it wait until the next upgrade window when it will be upgraded. + control lane, it wait until the next upgrade window when it will be upgraded. 16. Regular cloud tenant update schedule should run in les than a week. Select tenants might support longer schedules. @@ -98,41 +192,25 @@ The current systemd updater does not meet those requirements: 19. Operating system packages should be supported. -## User Stories +### User Stories -### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version +#### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version
Before -```yaml -kind: autoupdate_agent_plan -spec: - current_version: v1 - target_version: v2 - schedule: regular - strategy: grouped - autoupdate_mode: enabled -status: - groups: - - name: dev - start_time: 2020-12-09T16:09:53+00:00 - initial_count: 100 - present_count: 103 - failed_count: 2 - progress: 1 - state: active - last_update_time: 2020-12-09T16:09:53+00:00 - last_update_reason: success - - name: staging - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-09T16:09:53+00:00 - last_update_reason: newAgentPlan +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v1 +# New version: v2 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging complete YYYY-MM-D2 HHh 20 20 0 +# prod not started 234 0 0 ```
@@ -145,35 +223,20 @@ tctl autoupdate agent new-rollout v3
After -```yaml -kind: autoupdate_agent_plan -spec: - current_version: v2 - target_version: v3 - schedule: regular - strategy: grouped - autoupdate_mode: enabled -status: - groups: - - name: dev - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan - - name: staging - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 115 2 +# staging not started 20 20 0 +# prod not started 234 0 0 ``` +
Now, new agents will install v2 by default, and v3 after the maintenance. @@ -186,12 +249,12 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > > ```bash > tctl autoupdate agent new-rollout v3 --current-version v1 -> # created new update plan from v2 to v3 +> # created new update plan from v1 to v3 > ``` -### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% +#### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% -#### Failure mode 1: the new version crashes +##### Failure mode 1: the new version crashes I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries crash, the updater reverts the update, the agents connect back online and @@ -242,7 +305,7 @@ to The rollout resumes. -#### Failure mode 1 bis: the new version crashes, but not on the canaries +##### Failure mode 1 bis: the new version crashes, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). @@ -255,7 +318,7 @@ Once the previous Teleport version is running, the agent will advertise its upda If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. -#### Failure mode 2: the new version crashes, and the old version cannot start +##### Failure mode 2: the new version crashes, and the old version cannot start I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries attempt the update, and the new Teleport instance crashes. @@ -269,7 +332,7 @@ The group update is stuck until the canary comes back online and runs the latest The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. -#### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries +##### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: a clock drift blocks agents from re-connecting to Teleport. @@ -283,7 +346,7 @@ groups from the faulty updates. In this case, it's hard to identify which agent dropped. -#### Failure mode 3: shadow failure +##### Failure mode 3: shadow failure Teleport cloud deploys a new version. Agents from the first group get updated. The agents are seemingly running properly, but some functions are impaired. @@ -359,7 +422,7 @@ This will trigger the rollback. tctl auto-update agent resume ``` -### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version +#### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version I connect to the node and lookup its status: ```shell @@ -395,7 +458,7 @@ teleport-updater enable # Group: staging ``` -### As a Teleport user I want to fast-track a group update +#### As a Teleport user I want to fast-track a group update I have a new rollout, completely unstarted, and my current maintenance schedule updates over seevral days. However, the new version contains something that I need as soon s possible (e.g. a fix for a bug that affects me). @@ -404,7 +467,7 @@ However, the new version contains something that I need as soon s possible (e.g. Before: ```shell -tctl auto-update agent status +tctl auto-updates agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -421,11 +484,14 @@ tctl auto-update agent status I can trigger the dev group immediately using the command: ```shell -tctl auto-update agent trigger-group dev -# Dev group update triggered +tctl auto-updates agent start-update dev --no-canary +# Dev group update triggered (canary or active) ``` -[TODO: how to deal with the canary vs active vs done states?] +Alternatively +```shell +tctl auto-update agent force-done dev +```
After: @@ -445,16 +511,12 @@ tctl auto-update agent status ```
+### Teleport Resources -## Teleport Resources - -### Scheduling +#### Autoupdate Config This resource is owned by the Teleport cluster user. -This is how Teleport customers can specify their automatic update preferences such as: -- if automatic updates are enabled, disabled, or temporarily suspended -- in which order their agents should be updated (`dev` before `staging` before `prod`) -- when should the updates start +This is how Teleport customers can specify their automatic update preferences. ```yaml kind: autoupdate_config @@ -483,11 +545,6 @@ spec: # wait_days specifies how many days to wait after the previous group finished before starting. # default: 0 wait_days: 0-1 - # TODO: is this needed? In which case a customer would need to set a custom jitter? - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent updater client will pick a random time within this duration to wait to update. - # default: 5 - jitter_seconds: 0-60 # canary_count specifies the desired number of canaries to update before any other agents # are updated. # default: 5 @@ -512,7 +569,7 @@ spec: agent_schedules: regular: - name: default - days: ["*"] # TODO: restrict to work week? Minus Friday? + days: ["Mon", "Tue", "Wed", "Thu"] start_hour: 0 jitter_seconds: 5 canary_count: 5 @@ -520,70 +577,7 @@ spec: alert_after: 4h ``` - -The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. -By default, only 5 agent groups are allowed, this mitigates very long rollout plans. - -The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. -Changing the `target_version` resets the schedule immediately, clearing all progress. - -[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] -Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. - -Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. -However, any changes to `agent_schedules` that occur while a group is active will be rejected. - -Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. - -Note that the `default` schedule applies to agents that do not specify a group name. -[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] - -```shell -# configuration -# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. -# We should chose a user-friendly signature -$ tctl autoupdate update --set-agent-auto-update=off -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-start-hour=3 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group default --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate reset -Automatic updates configuration has been reset to defaults. - -# status -$ tctl autoupdate status -Status: disabled -Version: v1.2.4 -Schedule: regular - -Groups: -staging: succeeded at 2024-01-03 23:43:22 UTC -prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) -other: failed at 2024-01-05 22:53:22 UTC - -$ tctl autoupdate status --group staging -Status: succeeded -Date: 2024-01-03 23:43:22 UTC -Requires: (none) - -Updated: 230 (95%) -Unchanged: 10 (2%) -Failed: 15 (3%) -Timed-out: 0 - -# re-running failed group -$ tctl autoupdate run --group staging -Executing auto-update for group 'staging' immediately. -``` - -Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating - `autoupdate_agent_plan`, while maintaining control over the rollout. - -### Rollout +#### Autoupdate agent plan The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local @@ -594,7 +588,7 @@ admin socket (tctl on local machine). > However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be > granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. > This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. -> +> > Solving this problem is out of the scope of this RFD. ```yaml @@ -636,49 +630,16 @@ status: last_update_reason: rollback ``` -```shell -$ tctl autoupdate update --set-agent-version=15.1.1 -Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-version=15.1.2 --critical -Automatic updates configuration has been updated. -``` - -### Group states +### Backend logic to progress the rollout -Let `v1` be the current version and `v2` the target version. +The update proceeds from the first group to the last group, ensuring that each group successfully updates before +allowing the next group to proceed. By default, only 5 agent groups are allowed, this mitigates very long rollout plans. -A group can be in 5 state: -- unstarted: the group update has not been started yet. -- canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. -- active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. -- done: the group has been updated. New agents should run `v2`. -- rolledback: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. - -The finite state machine is the following: -```mermaid -flowchart TD - unstarted((unstarted)) - canary((canary)) - active((active)) - done((done)) - rolledback((rolledback)) - - unstarted -->|StartGroup
MaintenanceTriggerOK| canary - canary -->|canary came back alive| active - canary -->|ForceGroup| done - canary -->|RollbackGroup| rolledback - active -->|ForceGroup
Success criteria met| done - done -->|RollbackGroup| rolledback - active -->|RollbackGroup| rolledback - - canary -->|ResetGroup| canary - active -->|ResetGroup| active -``` - -### Agent auto update modes +#### Agent update mode The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) -and by the customer (via `autoupdate_config`). +and by the customer (via `autoupdate_config`). The agent update mode control whether +the cluster in enrolled into automatic agent updates. The agent update mode can take 3 values: @@ -694,92 +655,225 @@ For example: - cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` - cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` -## Details - Teleport API +The Teleport cluster only progresses the rollout if the mode is `enabled`. -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. +#### Group States -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version +Let `v1` be the previous version and `v2` the target version. -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. +A group can be in 5 states: +- `unstarted`: the group update has not been started yet. +- `canary`: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update + and keep their existing version. +- `active`: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update + to `v2`. +- `done`: the group has been updated. New agents should run `v2`. +- `rolledback`: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` +The finite state machine is the following: -At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -The number of updated and non-updated agents is tracked by the auth servers. +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) -If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. + unstarted -->|TriggerGroupRPC
Start conditions are met| canary + canary -->|Canary came back alive| active + canary -->|ForceGroupRPC| done + canary -->|RollbackGroupRPC| rolledback + active -->|ForceGroupRPC
Success criteria met| done + done -->|RollbackGroupRPC| rolledback + active -->|RollbackGroupRPC| rolledback + + canary -->|ResetGroupRPC| canary + active -->|ResetGroupRPC| active +``` -If canaries are enabled, a user-specified number of agents are updated first. -These agents must all update successfully for the rollout to proceed to the remaining agents. +#### Starting a group -Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. +A group can be started if the following criteria are met +- all of its previous group are in the `done` state +- it has been at least `wait_days` until the previous group update started +- the current week day is in the `days` list +- the current hours equals the `hour` field -### Rollout +When all hose criteria are met, the auth will transition the group into a new state. +If `canary_count` is not null, the group transitions to the `canary` state. +Else it transitions to the `active` state. -Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. +In phase 4, at the start of a group rollout, the Teleport auth servers record the initial number connected agents. +The number of updated and non-updated agents is tracked by the auth servers. This will be used later to evaluate the +update success criteria. -The following data related to the rollout are stored in each instance heartbeat: -- `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_start_version`: current agent version -- `agent_update_rollback`: whether the agent was rolled-back automatically -- `agent_update_uuid`: Auto-update UUID -- `agent_update_group`: Auto-update group name +#### Canary testing (phase 5) -Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). +A group in `canary` state will get assigned canaries. +The proxies will instruct those canaries to update now. +During each reconciliation loop, the auth will lookup the instance healthcheck in the backend of the canaries. -Every minute, auth servers persist the version counts: -- `agent_data[group].stats[version]` - - `count`: number of currently connected agents at `version` in `group` - - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade - - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` - - `count`: number of connected agents at `version` in `group` at start of window -- `agent_data[group]` - - `canaries`: list of updater UUIDs to use for canary deployments +Once all canaries have a healthcheck containing the new version (the healthcheck must not be older than 20 minutes), +they successfully came back online and the group can transition to the `active` state. -Expiration time of the persisted key is 1 hour. +If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. +An alert will eventually fire, warning the user about the stuck update. -To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. -- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. -- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. -- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. +#### Updating a group -If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. -This prevents double-counting agents when auth servers are killed. +A group in `active` mode is currently being updated. The conditions to leave te `active` mode and transition to the +`done` mode will vary based on the phase and rollout strategy. + +- Phase 2: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. +- Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: + - at least `(100 - max_in_flight)%` of the agents are still connected + - at least `(100 - max_in_flight)%` of the agents are running the new version +- Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% -#### Progress Formulas +The phase 6 backpressure update is the following: Given: ``` initial_count[group] = sum(agent_data[group].stats[*]).count ``` -Each auth server will calculate the progress as `( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and write the progress to `autoupdate_agent_plan` status. -This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. +Each auth server will calculate the progress as +`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and +write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a +`max_in_flight` percentage-window above the number of currently updated agents in the group. -However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. -This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the +calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no +UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to +update. -To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction +must be imposed for the rollout to proceed: `agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` -To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one +minute will be considered in these formulas. + +### Manually interacting with the rollout + + +#### RPCs +Users and administrators can interact with the rollout plan using the following RPCs: + +```protobuf +``` + +#### CLI + +### Editing the plan + +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +Changing the `target_version` resets the schedule immediately, clearing all progress. + +[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] +Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. + +Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. +However, any changes to `agent_schedules` that occur while a group is active will be rejected. + +Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. + +Note that the `default` schedule applies to agents that do not specify a group name. +[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] + +```shell +# configuration +# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. +# We should chose a user-friendly signature +$ tctl autoupdate update --set-agent-auto-update=off +Automatic updates configuration has been updated. +$ tctl autoupdate update --group staging --set-start-hour=3 +Automatic updates configuration has been updated. +$ tctl autoupdate update --group staging --set-jitter-seconds=60 +Automatic updates configuration has been updated. +$ tctl autoupdate update --group default --set-jitter-seconds=60 +Automatic updates configuration has been updated. +$ tctl autoupdate reset +Automatic updates configuration has been reset to defaults. + +# status +$ tctl autoupdate status +Status: disabled +Version: v1.2.4 +Schedule: regular + +Groups: +staging: succeeded at 2024-01-03 23:43:22 UTC +prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) +other: failed at 2024-01-05 22:53:22 UTC + +$ tctl autoupdate status --group staging +Status: succeeded +Date: 2024-01-03 23:43:22 UTC +Requires: (none) + +Updated: 230 (95%) +Unchanged: 10 (2%) +Failed: 15 (3%) +Timed-out: 0 + +# re-running failed group +$ tctl autoupdate run --group staging +Executing auto-update for group 'staging' immediately. +``` + +Notes: +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating + `autoupdate_agent_plan`, while maintaining control over the rollout. + +### Updater APIs + +#### Update requests + +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. + +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is +dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The group state from the `autoupdate_agent_plan` status -#### Proxies +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via +unauthenticated requests to `/v1/webapi/find`. Teleport proxies modulate the `/v1/webapi/find` response given the host +UUID and group name. -When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the `autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. -The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: +When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the +`autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. +The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress +percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` +The returned JSON looks like: + +`/v1/webapi/find?host=[uuid]&group=[name]` +```json +{ + "server_edition": "enterprise", + "agent_version": "15.1.1", + "agent_autoupdate": true, + "agent_update_jitter_seconds": 10 +} +``` + +Notes: + +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of + the value in `agent_autoupdate`. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. +- the jitter is served by the teleport cluster and depends on the rollout strategy (60 sec by default, 10sec when using + the backpressure strategy). + +Let `v1` be the previous version and `v2` the target version, the response matrix is the following: + ##### Rollout status: disabled | Group state | Version | Should update | @@ -806,24 +900,41 @@ The boolean is returned as `true` in the case that the provided `host` contains | done | v2 | true | | rolledback | v1 | true | -### REST Endpoints +#### Updater status reporting -`/v1/webapi/find?host=[uuid]&group=[name]` -```json -{ - "server_edition": "enterprise", - "agent_version": "15.1.1", - "agent_autoupdate": true, - "agent_update_jitter_seconds": 10 -} -``` -Notes: -- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. -- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_start_version`: current agent version +- `agent_update_rollback`: whether the agent was rolled-back automatically +- `agent_update_uuid`: Auto-update UUID +- `agent_update_group`: Auto-update group name + +[TODO: mention that we'll also send this info in the hello and store it in the auth invenotry] + +Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). + +Every minute, auth servers persist the version counts: +- `agent_data[group].stats[version]` + - `count`: number of currently connected agents at `version` in `group` + - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade + - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` + - `count`: number of connected agents at `version` in `group` at start of window +- `agent_data[group]` + - `canaries`: list of updater UUIDs to use for canary deployments + +Expiration time of the persisted key is 1 hour. + +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. + +If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. +This prevents double-counting agents when auth servers are killed. -## Details - Linux Agents +### Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. @@ -832,7 +943,7 @@ It will download the correct version of Teleport as a tarball, unpack it in `/va Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. -### Installation +#### Installation ```shell $ apt-get install teleport @@ -853,7 +964,7 @@ $ teleport-update enable --proxy example.teleport.sh --template 'https://example ``` (Checksum will use template path + `.sha256`) -### Filesystem +#### Filesystem ``` $ tree /var/lib/teleport @@ -905,7 +1016,7 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -#### update.yaml +##### update.yaml This file stores configuration for `teleport-update`. @@ -936,7 +1047,7 @@ status: error: "" ``` -#### backup.yaml +##### backup.yaml This file stores metadata about an individual backup of the Teleport agent's sqlite DB. @@ -952,7 +1063,7 @@ spec: creation_time: 2020-12-09T16:09:53+00:00 ``` -### Runtime +#### Runtime The `teleport-update` binary will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: @@ -1133,7 +1244,7 @@ The following documentation will need to be updated to cover the new updater wor Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. -## Details - Kubernetes Agents +### Details - Kubernetes Agents The Kubernetes agent updater will be updated for compatibility with the new scheduling system. @@ -1462,7 +1573,7 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo 8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints. 10. Add healthcheck endpoint to Teleport agents and incorporate into rollback logic. -10. Add progress and completion checking. -10. Add canary functionality. -10. Add backpressure functionality if necessary. -11. Add DB backups if necessary. +11. Add progress and completion checking. +12. Add canary functionality. +13. Add backpressure functionality if necessary. +14. Add DB backups if necessary. From fecefc750ae672314f123ef641302bee605b1951 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 13:19:28 -0400 Subject: [PATCH 80/84] cleanup + swap phases 1 and 2 --- rfd/0169-auto-updates-linux-agents.md | 80 ++++++++++++++++----------- 1 file changed, 49 insertions(+), 31 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0d5ce9d964b2a..c791815db0055 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,7 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout strategy. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and a rollout strategy. Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. @@ -33,43 +33,49 @@ Additionally, this RFD parallels the auto-update functionality for client tools ## Why -1. We want customers always running the latest release of Teleport to always be secure, have access to the latest - features, and not deal with the pain of updating the agents. -2. Reduce Teleport Cloud operational costs of contacting customers with old agents. - Make updating easier for self-hosted customers so we don't have to provide support for older Teleport versions. -3. Increase reliability to 99.99%. +1. We want customers to run the latest release of Teleport so that they are secure and have access to the latest + features. +2. We do not want customers to deal with the pain of updating agents installed on their own infrastructure. +3. We want to reduce the operational cost of customers running old agents. + For Cloud customers, this will allow us to support fewer simultaneous cluster versions and reduce support load. + For self-hosted customers, this will reduce support load associated with debugging old versions of Teleport. +4. Providing 99.99% availability for customers requires us to maintain that level of availability at the agent-level + as well as the cluster-level. The current systemd updater does not meet those requirements: -- its use of package managers leads users to accidentally upgrade Teleport -- the installation process is complex and users end up installing the wrong version of Teleport -- the current update process does not provide safeties to protect against broken updates -- many customers are not adopting the existing updater because they want to control when updates happen -- we don't offer a nice user experience for self-hosted users, this ends up in a marginal automatic updates +- Its use of package managers leads users to accidentally upgrade Teleport. +- Its installation process is complex and users end up installing the wrong version of Teleport. +- Its update process does not provide safeties to protect against broken updates. +- Customers are not adopting the existing updater because they want to control when updates happen. +- We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates adoption and does not reduce the cost of upgrading self-hosted clusters. ## How -The new agent automatic updates will rely on a separate updating binary controlling which Teleport version is -installed. The automatic updates will be implemented via incremental improvements over the existing mechanism: +The new agent automatic updates will rely on a separate `teleport-update` binary controlling which Teleport version is +installed. Automatic updates will be implemented via incrementally: -- Phase 1: introduce a new updater binary not relying on package managers -- Phase 2: introduce the concept of agent update groups and make the users chose in which order groups are updated -- Phase 3: add the ability for the agent updater to immediately revert a faulty update -- Phase 4: add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status -- Phase 5: add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated -- Phase 6: add the ability to perform slow and incremental version rollouts within an agent update group +- Phase 1: Introduce a new updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. +- Phase 2: Add the ability for the agent updater to immediately revert a faulty update. +- Phase 3: Introduce the concept of agent update groups and make users chose in which order groups are updated. +- Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. +- Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. +- Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. The updater will be usable after phase 1, and will gain new capabilities after each phase. +After phase 2, the new updater will have feature-parity with the old updater. +The existing auto-updates mechanism will remain unchanged throughout the process, and deprecated in the future. + Future phases might change as we are working on the implementation and collecting real-world feedback and experience. ### Resources -We will introduce 2 user-facing resources: +We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: - - if automatic updates are enabled, disabled, or temporarily suspended - - in which order their agents should be updated (`dev` before `staging` before `prod`) - - when should the updates start + - Whether automatic updates are enabled, disabled, or temporarily suspended + - The order in which their agents should be updated (`dev` before `staging` before `prod`) + - When updates should start The resource will look like: ```yaml @@ -157,7 +163,7 @@ Those are the requirements coming from engineering, product, and cloud teams: 5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. -6. Upgrading a leaf cluster is out of scope. +6. Upgrading a leaf cluster is out-of-scope. 7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. @@ -174,17 +180,17 @@ Those are the requirements coming from engineering, product, and cloud teams: 13. I should be able to install Teleport via whatever mechanism I want to. -14. If new nodes join a bucket outside the upgrade window and you are within your compat. window, wait until your next group update start. +14. If new nodes join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. If you are not within your compat. window attempt to upgrade right away. -15. If an agent comes back online after some period of time and is still compat. with +15. If an agent comes back online after some period of time, and it is still compatible with control lane, it wait until the next upgrade window when it will be upgraded. -16. Regular cloud tenant update schedule should run in les than a week. +16. Regular cloud tenant update schedule should run in less than a week. Select tenants might support longer schedules. -17. A cloud customer should be able to pause, resume, and rollback and existing rollout schedule. - A cloud customer should not be able to create new rollout schedules. +17. A Cloud customer should be able to pause, resume, and rollback and existing rollout schedule. + A Cloud customer should not be able to create new rollout schedules. Teleport can create as many rollout schedules as it wants. @@ -618,8 +624,20 @@ status: present_count: 53 # failed_count is the number of agents rolled-back since the start of the rollout failed_count: 23 - # canaries is a list of updater UUIDs used for canary deployments - canaries: ["abc123-..."] + # canaries is a list of agents used for canary deployments + canaries: # part of phase 5 + # updater_id is the updater UUID + - updater_id: abc123-... + # host_id is the agent host UUID + host_id: def534-... + # hostname of the agent + hostname: foo.example.com + # success status + success: false + # last_update_time is [TODO: what does this represent?] + last_update_time: 2020-12-10T16:09:53+00:00 + # last_update_reason is [TODO: what does this represent?] + last_update_reason: canaryTesting # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) From e7b1c1005c72102ccb76f205eecdeda47dfeecec Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 16:21:03 -0400 Subject: [PATCH 81/84] Move protobuf --- rfd/0169-auto-updates-linux-agents.md | 542 ++++++++++++++------------ 1 file changed, 285 insertions(+), 257 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c791815db0055..2a9048d628ad7 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -68,8 +68,6 @@ The existing auto-updates mechanism will remain unchanged throughout the process Future phases might change as we are working on the implementation and collecting real-world feedback and experience. -### Resources - We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: @@ -648,6 +646,291 @@ status: last_update_reason: rollback ``` +#### Protobuf + +```protobuf +syntax = "proto3"; + +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; + +// AutoUpdateService serves agent and client automatic version updates. +service AutoUpdateService { + // GetAutoUpdateConfig updates the autoupdate config. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // CreateAutoUpdateConfig creates the autoupdate config. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpdateAutoUpdateConfig updates the autoupdate config. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpsertAutoUpdateConfig overwrites the autoupdate config. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // ResetAutoUpdateConfig restores the autoupdate config to default values. + rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. + rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. + rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. + rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. + rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + + // TriggerAgentGroup changes the state of an agent group from `unstarted` to `active` or `canary`. + rpc TriggerAgentGroup(TriggerAgentGroupRequest) returns (AutoUpdateAgentPlan); + // ForceAgentGroup changes the state of an agent group from `unstarted`, `canary`, or `active` to the `done` state. + rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentPlan); + // ResetAgentGroup resets the state of an agent group. + // For `canary`, this means new canaries are picked + // For `active`, this means the initial node count is computed again. + rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentPlan); + // RollbackAgentGroup changes the state of an agent group to `rolledback`. + rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentPlan); +} + +// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. +message GetAutoUpdateConfigRequest {} + +// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. +message ResetAutoUpdateConfigRequest {} + +// AutoUpdateConfig holds dynamic configuration settings for automatic updates. +message AutoUpdateConfig { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoUpdateConfigSpec spec = 7; +} + +// AutoUpdateConfigSpec is the spec for the autoupdate config. +message AutoUpdateConfigSpec { + // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_autoupdate_mode = 1; + // agent_schedules specifies schedules for updates of grouped agents. + AgentAutoUpdateSchedules agent_schedules = 3; +} + +// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. +message AgentAutoUpdateSchedules { + // regular schedules for non-critical versions. + repeated AgentAutoUpdateGroup regular = 1; +} + +// AgentAutoUpdateGroup specifies the update schedule for a group of agents. +message AgentAutoUpdateGroup { + // name of the group + string name = 1; + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // wait_days after last group succeeds before this group can run + int64 wait_days = 4; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int64 alert_after_hours = 5; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]) + int64 jitter_seconds = 6; + // canary_count of agents to use in the canary deployment. + int64 canary_count = 7; + // max_in_flight specifies agents that can be updated at the same time, by percent. + string max_in_flight = 8; +} + +// Day of the week +enum Day { + DAY_UNSPECIFIED = 0; + DAY_ALL = 1; + DAY_SUNDAY = 2; + DAY_MONDAY = 3; + DAY_TUESDAY = 4; + DAY_WEDNESDAY = 5; + DAY_THURSDAY = 6; + DAY_FRIDAY = 7; + DAY_SATURDAY = 8; +} + +// Mode of operation +enum Mode { + // UNSPECIFIED update mode + MODE_UNSPECIFIED = 0; + // DISABLE updates + MODE_DISABLE = 1; + // ENABLE updates + MODE_ENABLE = 2; + // PAUSE updates + MODE_PAUSE = 3; +} + +// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. +message GetAutoUpdateAgentPlanRequest {} + +// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. +message CreateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. +message UpdateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. +message UpsertAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. +message AutoUpdateAgentPlan { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoUpdateAgentPlanSpec spec = 5; + // status is the status of the resource. + AutoUpdateAgentPlanStatus status = 6; +} + +// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. +message AutoUpdateAgentPlanSpec { + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // strategy to use for the rollout + Strategy strategy = 4; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 5; +} + +// Schedule type for the rollout +enum Schedule { + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; + // REGULAR update schedule + SCHEDULE_REGULAR = 1; + // IMMEDIATE update schedule for updating all agents immediately + SCHEDULE_IMMEDIATE = 2; +} + +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // GROUPED update schedule, with no backpressure + STRATEGY_GROUPED = 1; + // BACKPRESSURE update schedule + STRATEGY_BACKPRESSURE = 2; +} + +// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. +message AutoUpdateAgentPlanStatus { + // name of the group + string name = 0; + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; + // initial_count is the number of connected agents at the start of the window. + int64 initial_count = 2; + // present_count is the current number of connected agents. + int64 present_count = 3; + // failed_count specifies the number of failed agents. + int64 failed_count = 4; + // canaries is a list of canary agents. + repeated Canary canaries = 5; + // progress is the current progress through the rollout. + float progress = 6; + // state is the current state of the rollout. + State state = 7; + // last_update_time is the time of the previous update for this group. + google.protobuf.Timestamp last_update_time = 8; + // last_update_reason is the trigger for the last update + string last_update_reason = 9; +} + +// Canary agent +message Canary { + // update_uuid of the canary agent + string update_uuid = 0; + // host_uuid of the canary agent + string host_uuid = 1; + // hostname of the canary agent + string hostname = 2; + // success state of the canary agent + bool success = 3; +} + +// State of the rollout +enum State { + // UNSPECIFIED state + STATE_UNSPECIFIED = 0; + // UNSTARTED state + STATE_UNSTARTED = 1; + // CANARY state + STATE_CANARY = 2; + // ACTIVE state + STATE_ACTIVE = 3; + // DONE state + STATE_DONE = 4; + // ROLLEDBACK state + STATE_ROLLEDBACK = 5; +} + +message TriggerAgentGroupRequest { + // group is the agent update group name whose maintenance should be triggered. + string group = 1; + // desired_state describes the desired start state. + // Supported values are STATE_UNSPECIFIED, STATE_CANARY, and STATE_ACTIVE. + // When left empty, defaults to canary if they are supported. + State desired_state = 2; +} + +message ForceAgentGroupRequest { + // group is the agent update group name whose state should be forced to `done`. + string group = 1; +} + +message ResetAgentGroupRequest { + // group is the agent update group name whose state should be reset. + string group = 1; +} + +message RollbackAgentGroupRequest { + // group is the agent update group name whose state should change to `rolledback`. + string group = 1; +} +``` + ### Backend logic to progress the rollout The update proceeds from the first group to the last group, ensuring that each group successfully updates before @@ -1300,261 +1583,6 @@ Care will be taken to ensure that updater logs are sharable with Teleport Suppor When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. -## Protobuf API Changes - -Note: all updates use revisions to prevent data loss in case of concurrent access. - -### autoupdate/v1 - -```protobuf -syntax = "proto3"; - -package teleport.autoupdate.v1; - -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; - -// AutoUpdateService serves agent and client automatic version updates. -service AutoUpdateService { - // GetAutoUpdateConfig updates the autoupdate config. - rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // CreateAutoUpdateConfig creates the autoupdate config. - rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpdateAutoUpdateConfig updates the autoupdate config. - rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpsertAutoUpdateConfig overwrites the autoupdate config. - rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // ResetAutoUpdateConfig restores the autoupdate config to default values. - rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - - // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. - rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. - rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. - rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. - rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); -} - -// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. -message GetAutoUpdateConfigRequest {} - -// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. -message CreateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. -message UpdateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. -message UpsertAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. -message ResetAutoUpdateConfigRequest {} - -// AutoUpdateConfig holds dynamic configuration settings for automatic updates. -message AutoUpdateConfig { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateConfigSpec spec = 7; -} - -// AutoUpdateConfigSpec is the spec for the autoupdate config. -message AutoUpdateConfigSpec { - // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. - Mode agent_autoupdate_mode = 1; - // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoUpdateSchedules agent_schedules = 3; -} - -// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. -message AgentAutoUpdateSchedules { - // regular schedules for non-critical versions. - repeated AgentAutoUpdateGroup regular = 1; -} - -// AgentAutoUpdateGroup specifies the update schedule for a group of agents. -message AgentAutoUpdateGroup { - // name of the group - string name = 1; - // days to run update - repeated Day days = 2; - // start_hour to initiate update - int32 start_hour = 3; - // wait_days after last group succeeds before this group can run - int64 wait_days = 4; - // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - int64 alert_after_hours = 5; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]) - int64 jitter_seconds = 6; - // canary_count of agents to use in the canary deployment. - int64 canary_count = 7; - // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 8; -} - -// Day of the week -enum Day { - DAY_UNSPECIFIED = 0; - DAY_ALL = 1; - DAY_SUNDAY = 2; - DAY_MONDAY = 3; - DAY_TUESDAY = 4; - DAY_WEDNESDAY = 5; - DAY_THURSDAY = 6; - DAY_FRIDAY = 7; - DAY_SATURDAY = 8; -} - -// Mode of operation -enum Mode { - // UNSPECIFIED update mode - MODE_UNSPECIFIED = 0; - // DISABLE updates - MODE_DISABLE = 1; - // ENABLE updates - MODE_ENABLE = 2; - // PAUSE updates - MODE_PAUSE = 3; -} - -// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. -message GetAutoUpdateAgentPlanRequest {} - -// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. -message CreateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. -message UpdateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. -message UpsertAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. -message AutoUpdateAgentPlan { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateAgentPlanSpec spec = 5; - // status is the status of the resource. - AutoUpdateAgentPlanStatus status = 6; -} - -// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. -message AutoUpdateAgentPlanSpec { - // start_version is the version to update from. - string start_version = 1; - // target_version is the version to update to. - string target_version = 2; - // schedule to use for the rollout - Schedule schedule = 3; - // strategy to use for the rollout - Strategy strategy = 4; - // autoupdate_mode to use for the rollout - Mode autoupdate_mode = 5; -} - -// Schedule type for the rollout -enum Schedule { - // UNSPECIFIED update schedule - SCHEDULE_UNSPECIFIED = 0; - // REGULAR update schedule - SCHEDULE_REGULAR = 1; - // IMMEDIATE update schedule for updating all agents immediately - SCHEDULE_IMMEDIATE = 2; -} - -// Strategy type for the rollout -enum Strategy { - // UNSPECIFIED update strategy - STRATEGY_UNSPECIFIED = 0; - // GROUPED update schedule, with no backpressure - STRATEGY_GROUPED = 1; - // BACKPRESSURE update schedule - STRATEGY_BACKPRESSURE = 2; -} - -// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. -message AutoUpdateAgentPlanStatus { - // name of the group - string name = 0; - // start_time of the rollout - google.protobuf.Timestamp start_time = 1; - // initial_count is the number of connected agents at the start of the window. - int64 initial_count = 2; - // present_count is the current number of connected agents. - int64 present_count = 3; - // failed_count specifies the number of failed agents. - int64 failed_count = 4; - // canaries is a list of canary agents. - repeated Canary canaries = 5; - // progress is the current progress through the rollout. - float progress = 6; - // state is the current state of the rollout. - State state = 7; - // last_update_time is the time of the previous update for this group. - google.protobuf.Timestamp last_update_time = 8; - // last_update_reason is the trigger for the last update - string last_update_reason = 9; -} - -// Canary agent -message Canary { - // update_uuid of the canary agent - string update_uuid = 0; - // host_uuid of the canary agent - string host_uuid = 1; - // hostname of the canary agent - string hostname = 2; - // success state of the canary agent - bool success = 3; -} - -// State of the rollout -enum State { - // UNSPECIFIED state - STATE_UNSPECIFIED = 0; - // UNSTARTED state - STATE_UNSTARTED = 1; - // CANARY state - STATE_CANARY = 2; - // ACTIVE state - STATE_ACTIVE = 3; - // DONE state - STATE_DONE = 4; - // ROLLEDBACK state - STATE_ROLLEDBACK = 5; -} - -``` - ## Alternatives ### `teleport update` Subcommand From 2a5515e7d8b013a54b7c1a98c364decb8487f467 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 18:28:27 -0400 Subject: [PATCH 82/84] Add installation scenarios --- rfd/0169-auto-updates-linux-agents.md | 119 ++++++++++++++------------ 1 file changed, 66 insertions(+), 53 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2a9048d628ad7..a0f73b35f5caf 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -515,6 +515,71 @@ tctl auto-update agent status ``` +#### As a Teleport user, I want to install a new agent automatically updated + +The manual way: + +```bash +wget https://cdn.teleport.dev/teleport-updater-- +chmod +x teleport-updater +./teleport-updater enable example.teleport.sh --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "production" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is done, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +The one-liner: + +``` +curl https://cdn.teleport.dev/auto-install | bash -s example.teleport.sh +# Downloading the teleport updater +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "default" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is finished, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +I can also install teleport using the package manager, then enroll the agent into AUs. See the section below: + +#### As a Teleport user I want to enroll my existing agent into AUs + +I have an agent, installed from a package manager or by manually unpacking the tarball. +I have the teleport updater installed and available in my path. +I run: + +```shell +teleport-updater enable --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed, reloading the service. +# Enabling automatic updates, the agent is part of the "production" update group. +``` + +> [!NOTE] +> The updater saw the teleport unit running and the existing teleport configuration. +> It used the configuration to pick the right proxy address. As teleport is already running, the teleport service is +> reloaded to use the new binary. + + ### Teleport Resources #### Autoupdate Config @@ -1058,14 +1123,7 @@ minute will be considered in these formulas. ### Manually interacting with the rollout - -#### RPCs -Users and administrators can interact with the rollout plan using the following RPCs: - -```protobuf -``` - -#### CLI +[TODO add cli commands] ### Editing the plan @@ -1083,51 +1141,6 @@ Releasing new agent versions multiple times a week has the potential to starve d Note that the `default` schedule applies to agents that do not specify a group name. [TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] -```shell -# configuration -# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. -# We should chose a user-friendly signature -$ tctl autoupdate update --set-agent-auto-update=off -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-start-hour=3 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group default --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate reset -Automatic updates configuration has been reset to defaults. - -# status -$ tctl autoupdate status -Status: disabled -Version: v1.2.4 -Schedule: regular - -Groups: -staging: succeeded at 2024-01-03 23:43:22 UTC -prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) -other: failed at 2024-01-05 22:53:22 UTC - -$ tctl autoupdate status --group staging -Status: succeeded -Date: 2024-01-03 23:43:22 UTC -Requires: (none) - -Updated: 230 (95%) -Unchanged: 10 (2%) -Failed: 15 (3%) -Timed-out: 0 - -# re-running failed group -$ tctl autoupdate run --group staging -Executing auto-update for group 'staging' immediately. -``` - -Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating - `autoupdate_agent_plan`, while maintaining control over the rollout. - ### Updater APIs #### Update requests From 4b33f2a4065f5a7615bb5d26a7ea014ed7f36cd3 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:09:53 -0400 Subject: [PATCH 83/84] cleanup + move backpressure formulas --- rfd/0169-auto-updates-linux-agents.md | 228 +++++++++++++------------- 1 file changed, 114 insertions(+), 114 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a0f73b35f5caf..0c73af930fa50 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -1,5 +1,5 @@ --- -authors: Stephen Levine (stephen.levine@goteleport.com) +authors: Stephen Levine (stephen.levine@goteleport.com) & Hugo Hervieux (hugo.hervieux@goteleport.com) state: draft --- @@ -45,24 +45,24 @@ Additionally, this RFD parallels the auto-update functionality for client tools The current systemd updater does not meet those requirements: - Its use of package managers leads users to accidentally upgrade Teleport. - Its installation process is complex and users end up installing the wrong version of Teleport. -- Its update process does not provide safeties to protect against broken updates. +- Its update process does not provide sufficient safeties to protect against broken updates. - Customers are not adopting the existing updater because they want to control when updates happen. - We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates adoption and does not reduce the cost of upgrading self-hosted clusters. ## How -The new agent automatic updates will rely on a separate `teleport-update` binary controlling which Teleport version is -installed. Automatic updates will be implemented via incrementally: +The new agent automatic updates system will rely on a separate `teleport-update` binary controlling which Teleport version is +installed. Automatic updates will be implemented incrementally: -- Phase 1: Introduce a new updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. +- Phase 1: Introduce a new, self-updating updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. - Phase 2: Add the ability for the agent updater to immediately revert a faulty update. -- Phase 3: Introduce the concept of agent update groups and make users chose in which order groups are updated. +- Phase 3: Introduce the concept of agent update groups and make users chose the order in which groups are updated. - Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. - Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. - Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. -The updater will be usable after phase 1, and will gain new capabilities after each phase. +The updater will be usable after phase 1 and will gain new capabilities after each phase. After phase 2, the new updater will have feature-parity with the old updater. The existing auto-updates mechanism will remain unchanged throughout the process, and deprecated in the future. @@ -72,8 +72,9 @@ We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: - Whether automatic updates are enabled, disabled, or temporarily suspended - - The order in which their agents should be updated (`dev` before `staging` before `prod`) - - When updates should start + - The order in which agents should be updated (`dev` before `staging` before `prod`) + - Times when agent updates should start + - Configuration for client auto-updates (e.g., `tsh` and `tctl`), which are out-of-scope for this RFD The resource will look like: ```yaml @@ -97,9 +98,9 @@ We will introduce two user-facing resources: max_in_flight: 20% # added in phase 6 ``` -2. The `autoupdate_agent_plan` resource, its spec is owned by the Teleport cluster administrator (e.g. Teleport Cloud team). - Its status is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via - select RPCs, for example fast-tracking a group update. +2. The `autoupdate_agent_plan` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud team). + Its `status` is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via + select RPCs (for example, an RPC to fast-track a group update). ```yaml kind: autoupdate_agent_plan spec: @@ -118,12 +119,12 @@ We will introduce two user-facing resources: progress: 0 state: canaries canaries: # part of phase 5 - - updater_id: abc - host_id: def + - updater_uuid: abc + host_uuid: def hostname: foo.example.com success: false - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: canaryTesting + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting - name: prod start_time: 0000-00-00 initial_count: 0 @@ -144,29 +145,29 @@ users who want to know the motivations behind this specific design. ### Product requirements -Those are the requirements coming from engineering, product, and cloud teams: +Those are the requirements coming from engineering, product, and Cloud teams: -1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. +1. Phased rollout for Cloud tenants. We should be able to control the agent version per-tenant. -2. Bucketed rollout that tenants have control over. +2. Bucketed rollout that customers have control over. - Control the bucket update day - Control the bucket update hour - Ability to pause a rollout -3. Customers should be able to run "apt-get update" without updating Teleport. +3. Customers should be able to run "apt-get upgrade" without updating Teleport. - Installation from a package manager should be possible, but the version should be controlled by Teleport. + Installation from a package manager should be possible, but the version should still be controlled by Teleport. 4. Self-managed updates should be a first class citizen. Teleport must advertise the desired agent and client version. -5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. +5. Self-hosted customers should be supported, for example, customers whose own internal customer is running a Teleport agent. -6. Upgrading a leaf cluster is out-of-scope. +6. Upgrading leaf clusters is out-of-scope. -7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. +7. Rolling back after a broken update should be supported. Roll forward gets you 99.9%, we need rollback for 99.99%. 8. We should have high quality metrics that report the version they are running and if they are running automatic - updates. For users and us. + updates. For both users and us. 9. Best effort should be made so automatic updates should be applied in a way that sessions are not terminated. (Currently only supported for SSH) @@ -174,27 +175,25 @@ Those are the requirements coming from engineering, product, and cloud teams: 11. Teleport Discover installation (curl one-liner) should be supported. -12. We need to support repo mirrors. +12. We need to support Docker image repository mirrors and Teleport artifact mirrors. -13. I should be able to install Teleport via whatever mechanism I want to. +13. I should be able to install an auto-updating deployment of Teleport via whatever mechanism I want to, including OS packages such as apt and yum. 14. If new nodes join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. - If you are not within your compat. window attempt to upgrade right away. + If you are not within your compatibility window, attempt to upgrade right away. 15. If an agent comes back online after some period of time, and it is still compatible with - control lane, it wait until the next upgrade window when it will be upgraded. + control plane, it should wait until the next upgrade window to be upgraded. -16. Regular cloud tenant update schedule should run in less than a week. - Select tenants might support longer schedules. +16. Regular agent updates for Cloud tenants should complete in less than a week. + (Select tenants may support longer schedules, at the Cloud team's discretion.) 17. A Cloud customer should be able to pause, resume, and rollback and existing rollout schedule. A Cloud customer should not be able to create new rollout schedules. Teleport can create as many rollout schedules as it wants. -18. A user on the host, should be able to turn autoupdate off or select a version for that particular host. - -19. Operating system packages should be supported. +18. A user logged-in to the agent host should be able to disable agent auto-updates and pin a version for that particular host. ### User Stories @@ -204,7 +203,7 @@ Those are the requirements coming from engineering, product, and cloud teams: Before ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v1 # New version: v2 @@ -224,11 +223,14 @@ tctl autoupdate agent new-rollout v3 # created new rollout from v2 to v3 ``` +TODO(sclevine): What about `update` or `target` instead of `new-rollout`? + `new-rollout` seems like we're creating a new resource, not changing target version. +
After ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -256,13 +258,13 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > # created new update plan from v1 to v3 > ``` -#### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% +#### As Teleport Cloud I want to minimize damage caused by broken versions to ensure we maintain 99.99% availability -##### Failure mode 1: the new version crashes +##### Failure mode 1(a): the new version crashes -I create a new deployment, with a broken version. The version is deployed to the canaries. +I create a new deployment with a broken version. The version is deployed to the canaries. The canaries crash, the updater reverts the update, the agents connect back online and -advertise they rolled-back. The maintenance is stuck until the canaries are running the target version. +advertise they have rolled-back. The maintenance is stuck until the canaries are running the target version.
Autoupdate agent plan @@ -285,10 +287,10 @@ status: progress: 0 state: canaries canaries: - - updater_id: abc - host_id: def - hostname: foo.example.com - success: false + - updater_uuid: abc + host_uuid: def + hostname: foo.example.com + success: false last_update_time: 2020-12-10T16:09:53+00:00 last_update_reason: canaryTesting - name: staging @@ -304,25 +306,25 @@ status:
I and the customer get an alert if the canary testing has not succeeded after an hour. -Teleport cloud operators and the user can access the canary hostname and hostid -to +Teleport cloud operators and the customer can access the canary hostname and host_uuid +to identify the broken agent. The rollout resumes. -##### Failure mode 1 bis: the new version crashes, but not on the canaries +##### Failure mode 1(b): the new version crashes, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). -The canaries might not select one of the affected agent and allow the update to proceed. +The canaries might not select one of the affected agents and allow the update to proceed. All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. The updaters of the affected agents will attempt to self-heal by reverting to the previous version. -Once the previous Teleport version is running, the agent will advertise its update failed and it had to rollback. -If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +Once the previous Teleport version is running, the agent will advertise the update failed and that it had to rollback. +If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. -##### Failure mode 2: the new version crashes, and the old version cannot start +##### Failure mode 2(a): the new version crashes, and the old version cannot start I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries attempt the update, and the new Teleport instance crashes. @@ -336,7 +338,7 @@ The group update is stuck until the canary comes back online and runs the latest The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. -##### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries +##### Failure mode 2(b): the new version crashes, and the old version cannot start, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: a clock drift blocks agents from re-connecting to Teleport. @@ -345,7 +347,7 @@ The canaries might not select one of the affected agent and allow the update to All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. The updater fails to self-heal as the old version does not start anymore. -If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. In this case, it's hard to identify which agent dropped. @@ -605,8 +607,8 @@ spec: # name of the group. Must only contain valid backend / resource name characters. - name: staging # days specifies the days of the week when the group may be updated. + # mandatory value for most Cloud customers: ["Mon", "Tue", "Wed", "Thu"] # default: ["*"] (all days) - # TODO: explicit the supported values based on the customer QoS days: [ “Sun”, “Mon”, ... | "*" ] # start_hour specifies the hour when the group may start upgrading. # default: 0 @@ -649,13 +651,13 @@ spec: #### Autoupdate agent plan The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. -In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local +In Teleport Cloud, this is the Cloud operations team. For self-hosted setups this is the user with access to the local admin socket (tctl on local machine). > [!NOTE] > This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. > However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be -> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. +> granted to users. To part with local admin rights, we need a way to have Cloud or admin-only operations. > This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. > > Solving this problem is out of the scope of this RFD. @@ -689,18 +691,14 @@ status: failed_count: 23 # canaries is a list of agents used for canary deployments canaries: # part of phase 5 - # updater_id is the updater UUID - - updater_id: abc123-... - # host_id is the agent host UUID - host_id: def534-... + # updater_uuid is the updater UUID + - updater_uuid: abc123-... + # host_uuid is the agent host UUID + host_uuid: def534-... # hostname of the agent hostname: foo.example.com # success status success: false - # last_update_time is [TODO: what does this represent?] - last_update_time: 2020-12-10T16:09:53+00:00 - # last_update_reason is [TODO: what does this represent?] - last_update_reason: canaryTesting # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) @@ -922,37 +920,37 @@ enum Strategy { // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { // name of the group - string name = 0; + string name = 1; // start_time of the rollout - google.protobuf.Timestamp start_time = 1; + google.protobuf.Timestamp start_time = 2; // initial_count is the number of connected agents at the start of the window. - int64 initial_count = 2; + int64 initial_count = 3; // present_count is the current number of connected agents. - int64 present_count = 3; + int64 present_count = 4; // failed_count specifies the number of failed agents. - int64 failed_count = 4; + int64 failed_count = 5; // canaries is a list of canary agents. - repeated Canary canaries = 5; + repeated Canary canaries = 6; // progress is the current progress through the rollout. - float progress = 6; + float progress = 7; // state is the current state of the rollout. - State state = 7; + State state = 8; // last_update_time is the time of the previous update for this group. - google.protobuf.Timestamp last_update_time = 8; + google.protobuf.Timestamp last_update_time = 9; // last_update_reason is the trigger for the last update - string last_update_reason = 9; + string last_update_reason = 10; } // Canary agent message Canary { // update_uuid of the canary agent - string update_uuid = 0; + string update_uuid = 1; // host_uuid of the canary agent - string host_uuid = 1; + string host_uuid = 2; // hostname of the canary agent - string hostname = 2; + string hostname = 3; // success state of the canary agent - bool success = 3; + bool success = 4; } // State of the rollout @@ -999,12 +997,12 @@ message RollbackAgentGroupRequest { ### Backend logic to progress the rollout The update proceeds from the first group to the last group, ensuring that each group successfully updates before -allowing the next group to proceed. By default, only 5 agent groups are allowed, this mitigates very long rollout plans. +allowing the next group to proceed. By default, only 5 agent groups are allowed. This mitigates very long rollout plans. #### Agent update mode The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) -and by the customer (via `autoupdate_config`). The agent update mode control whether +and by the customer (via `autoupdate_config`). The agent update mode controls whether the cluster in enrolled into automatic agent updates. The agent update mode can take 3 values: @@ -1016,10 +1014,10 @@ The agent update mode can take 3 values: The cluster agent rollout mode is computed by taking the lowest value. For example: -- cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` -- cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` -- cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` -- cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` +- Cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` +- Cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` +- Cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` +- Cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` The Teleport cluster only progresses the rollout if the mode is `enabled`. @@ -1062,9 +1060,9 @@ flowchart TD A group can be started if the following criteria are met - all of its previous group are in the `done` state -- it has been at least `wait_days` until the previous group update started +- it has been at least `wait_days` since the previous group update started - the current week day is in the `days` list -- the current hours equals the `hour` field +- the current hour equals the `hour` field When all hose criteria are met, the auth will transition the group into a new state. If `canary_count` is not null, the group transitions to the `canary` state. @@ -1078,9 +1076,9 @@ update success criteria. A group in `canary` state will get assigned canaries. The proxies will instruct those canaries to update now. -During each reconciliation loop, the auth will lookup the instance healthcheck in the backend of the canaries. +During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend. -Once all canaries have a healthcheck containing the new version (the healthcheck must not be older than 20 minutes), +Once all canaries have a heartbeat containing the new version (the heartbeat must not be older than 20 minutes), they successfully came back online and the group can transition to the `active` state. If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. @@ -1088,38 +1086,16 @@ An alert will eventually fire, warning the user about the stuck update. #### Updating a group -A group in `active` mode is currently being updated. The conditions to leave te `active` mode and transition to the +A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the `done` mode will vary based on the phase and rollout strategy. -- Phase 2: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. +- Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. - Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: - at least `(100 - max_in_flight)%` of the agents are still connected - at least `(100 - max_in_flight)%` of the agents are running the new version - Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% -The phase 6 backpressure update is the following: - -Given: -``` -initial_count[group] = sum(agent_data[group].stats[*]).count -``` - -Each auth server will calculate the progress as -`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and -write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a -`max_in_flight` percentage-window above the number of currently updated agents in the group. - -However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the -calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no -UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to -update. - -To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction -must be imposed for the rollout to proceed: -`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` - -To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one -minute will be considered in these formulas. +The phase 6 backpressure calculations are covered in the Backpressure Calculations section below.. ### Manually interacting with the rollout @@ -1225,7 +1201,7 @@ The following data related to the rollout are stored in each instance heartbeat: - `agent_update_uuid`: Auto-update UUID - `agent_update_group`: Auto-update group name -[TODO: mention that we'll also send this info in the hello and store it in the auth invenotry] +[TODO: mention that we'll also send this info in the hello and store it in the auth inventory] Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). @@ -1248,6 +1224,30 @@ To progress the rollout, auth servers will range-read keys from `/autoupdate/[gr If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. This prevents double-counting agents when auth servers are killed. +#### Backpressure Calculations + +Given: +``` +initial_count[group] = sum(agent_data[group].stats[*]).count +``` + +Each auth server will calculate the progress as +`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and +write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a +`max_in_flight` percentage-window above the number of currently updated agents in the group. + +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the +calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no +UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to +update. + +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction +must be imposed for the rollout to proceed: +`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` + +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one +minute will be considered in these formulas. + ### Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. From 0a0d6589108ce04f7f833d77a56c4142355ee179 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:17:57 -0400 Subject: [PATCH 84/84] more cleanup --- rfd/0169-auto-updates-linux-agents.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0c73af930fa50..e46d776989380 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -223,8 +223,8 @@ tctl autoupdate agent new-rollout v3 # created new rollout from v2 to v3 ``` -TODO(sclevine): What about `update` or `target` instead of `new-rollout`? - `new-rollout` seems like we're creating a new resource, not changing target version. +[TODO(sclevine): What about `update` or `target` instead of `new-rollout`? + `new-rollout` seems like we're creating a new resource, not changing target version.]
After @@ -1114,8 +1114,8 @@ However, any changes to `agent_schedules` that occur while a group is active wil Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. -Note that the `default` schedule applies to agents that do not specify a group name. -[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] +Note that the `default` group applies to agents that do not specify a group name. +If a `default` group is not present, the last group is treated as the default. ### Updater APIs @@ -1156,10 +1156,10 @@ Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss) and cannot be configured. - The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. - The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. -- the jitter is served by the teleport cluster and depends on the rollout strategy (60 sec by default, 10sec when using +- the jitter is served by the teleport cluster and depends on the rollout strategy (60s by default, 10s when using the backpressure strategy). Let `v1` be the previous version and `v2` the target version, the response matrix is the following: @@ -1192,7 +1192,7 @@ Let `v1` be the previous version and `v2` the target version, the response matri #### Updater status reporting -Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. The following data related to the rollout are stored in each instance heartbeat: - `agent_update_start_time`: timestamp of individual agent's upgrade time @@ -1627,7 +1627,7 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release via `teleport` package and script for packageless install. +6. Release via `teleport` package and script for package-less installation. 7. Release documentation changes. 8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints.