-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Upgrade Details] Ensure details report UPG_WATCHING
for the entire time that the upgrade is being watched
#3827
[Upgrade Details] Ensure details report UPG_WATCHING
for the entire time that the upgrade is being watched
#3827
Conversation
0ee25d4
to
b1f5e51
Compare
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
// - the marker was just created and the upgrade is about to start | ||
// (marker.details.state should not be empty), or | ||
// - the upgrade was rolled back (marker.details.state should be UPG_ROLLBACK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a third option we haven't considered? The case where the agent is restarted in the middle of an upgrade, for example when it is in the UPGRADE_EXTRACTING state. The obvious example would be the host system powering off.
What happens in this case? What is reported in the upgrade details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just rolled back, is the previous version of the agent guaranteed to be next to the currently running agent in the data path? That is is the state of the filesystem always something like:
data/
elastic-agent-current/
elastic-agent-next/
If this is true, can we look to see if there's another agent in the data directory next to us to detect if there was a roll back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will likely have both versions next to each other if the upgrade is interrupted immediately after the artifact is extracted, so that may not be 100% either.
One more thing, in the case that the host system powers off we should start calling https://pkg.go.dev/os#File.Sync on the marker file. It doesn't look like we do this today.
elastic-agent/internal/pkg/agent/application/upgrade/marker_access_other.go
Lines 26 to 30 in 97e8217
// On non-Windows platforms, writeMarkerFile simply writes the marker file. | |
// See marker_access_windows.go for behavior on Windows platforms. | |
func writeMarkerFile(markerFile string, markerBytes []byte) error { | |
return os.WriteFile(markerFilePath(), markerBytes, 0600) | |
} |
This might only be worth doing arounds critical transitions, like when the agent process re-execs or the watcher first starts up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, in the case that the host system powers off we should start calling https://pkg.go.dev/os#File.Sync on the marker file...
I've implemented the fsync change in its own PR, since it's not strictly related to this PR here: #3836
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a third option we haven't considered? The case where the agent is restarted in the middle of an upgrade, for example when it is in the UPGRADE_EXTRACTING state. The obvious example would be the host system powering off.
What happens in this case? What is reported in the upgrade details?
It depends on whether the state transition happens before or after the upgrade marker file comes into existence.
The following states occur before the upgrade marker file comes into existence and, as such, are never persisted in it: UPG_REQUESTED
, UPG_SCHEDULED
, UPG_DOWNLOADING
, UPG_EXTRACTING
. Additionally, the UPG_RESTARTING
state is also currently not being persisted to the upgrade marker file, mostly because of where this state transition happens in the code vs. where the upgrade marker file is being created. With some refactoring, we could start persisting this state to the upgrade marker file as well. So if the Agent were to restart during one of these states, the upgrade details that are stored in the Coordinator state (and from there sent to Fleet) would get reset to nothing and the upgrade state would be lost.
The following states occur right before or after the upgrade marker file comes into existence and, as such, do get persisted to it: UPG_REPLACING
, UPG_WATCHING
, UPG_ROLLBACK
. So if the Agent were to restart during one of these states, the upgrade details from the upgrade marker would be restored to the Coordinator state (and from there sent to Fleet).
We may want to consider either persisting upgrade details in their own file throughout the upgrade process OR creating the upgrade marker file at the start of the upgrade process instead of where it's being created now (right before the Upgrade Watcher is invoked from the old Agent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just rolled back, is the previous version of the agent guaranteed to be next to the currently running agent in the data path? That is is the state of the filesystem always something like:
data/
elastic-agent-current/
elastic-agent-next/
If this is true, can we look to see if there's another agent in the data directory next to us to detect if there was a roll back?
Two folders will exist the moment we go past the UPG_EXTRACTING
state and, yes, two folders will also exist when we are in the UPG_ROLLBACK
state. So I'm not sure if checking if the number of folders > 1 is sufficient to determine if we're about to upgrade or if we've just rolled back.
But I think there might be another solution to detecting if we're about to upgrade: the code that creates the Upgrade Marker file runs in the same process as the code that's watching the Upgrade Marker file for changes. As such, the former code can communicate to the latter in memory that we're about to upgrade. In the case of a rollback, this communication will not happen.
Let me explore this solution in a separate PR as it's not strictly related to this PR here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think there might be another solution to detecting if we're about to upgrade: the code that creates the Upgrade Marker file runs in the same process as the code that's watching the Upgrade Marker file for changes. As such, the former code can communicate to the latter in memory that we're about to upgrade. In the case of a rollback, this communication will not happen.
Let me explore this solution in a separate PR as it's not strictly related to this PR here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following states occur right before or after the upgrade marker file comes into existence and, as such, do get persisted to it: UPG_REPLACING, UPG_WATCHING, UPG_ROLLBACK. So if the Agent were to restart during one of these states, the upgrade details from the upgrade marker would be restored to the Coordinator state (and from there sent to Fleet).
What happens to the upgrade action if an upgrade is interrupted after it is started but before it is completed? Does the agent start over from the beginning? Does it acknowledge it as if the upgrade had happened if though it didn't? Does it never get acknowledged?
What does the watcher do if it starts up, sees an upgrade marker, but the version of the agent that is currently running isn't the version that should be running?
It would be surprising for a user to see an upgrade stuck in UPG_REPLACING
for example. If the upgrade is essentially aborted by a host reboot then UPG_ROLLBACK
could be considered the correct state. The UPG_WATCHING
always clears itself when the watcher stops but I'm not sure what happens in this state in this situation today.
I think the ideal thing to happen in this situation is that the upgrade restarts from the beginning when the agent host system comes back online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the upgrade action if an upgrade is interrupted after it is started but before it is completed? Does the agent start over from the beginning? Does it acknowledge it as if the upgrade had happened if though it didn't? Does it never get acknowledged?
Looking at the code, once the Upgrade Marker has been created, if the Agent restarts, it will acknowledge the upgrade with Fleet even if the upgrade may not have completed. In fact, it's entirely possible that Agent acknowledges the upgrade with Fleet while the Upgrade Watcher is still running and then the Upgrade Watcher decides to roll back the Agent. As far as Fleet is concerned, the upgrade would've been reported as successful, and then within 10 minutes, the previous version of Agent would start showing again without any explanation as to why.
With Upgrade Details, in the above rollback scenario, the Upgrade Watcher will record the state as UPG_ROLLBACK
in the Upgrade Marker file, which will get picked up by the main Agent process and sent to Fleet.
What does the watcher do if it starts up, sees an upgrade marker, but the version of the agent that is currently running isn't the version that should be running?
Again, looking at the code...
First, if the Upgrade Watcher was running before an interruption stopped/killed it, and then the Upgrade Watcher was restarted, it will exit immediately because the watcher lock file, watcher.lock
, will still exist. In this case, there are two possibilities as to which state will be reported in Upgrade Details to Fleet:
- if the Upgrade Marker contains Upgrade Details, the state recorded in it will be reported to Fleet. This should be the
UPG_REPLACING
state as it's the last state that's persisted to the Upgrade Marker before the old Agent's upgrade code restarts the new Agent. - if, for some reason, the Upgrade Marker does not contain Upgrade Details and the version of the running Agent is the same as the previous version recorded in the Upgrade Marker,
UPG_ROLLBACK
will be reported to Fleet.
If the Upgrade Watcher wasn't running yet when the upgrade process was interrupted, the Upgrade Watcher will start monitoring the Agent regardless of what version of Agent is currently running or what version is recorded in the Upgrade Marker file. In this case, UPG_WATCHING
will be reported to Fleet. If the watch succeeds, the Upgrade Details will stop being reported to Fleet; note that the version of Agent being reported to Fleet will already be the new one in this case. If the watch fails and Agent has to be rolled back, UPG_ROLLBACK
will be reported to Fleet.
I think the ideal thing to happen in this situation is that the upgrade restarts from the beginning when the agent host system comes back online.
Agreed but I think restarting the upgrade from the beginning after a crash is beyond the scope of this PR so I've created #3860 to track this improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree fixing this is outside the scope of this PR given your explanation.
I also think we acknowledge the upgrade too early since it isn't synchronized with the watcher but that is also out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it using the air-gapped test, it works. There is still what I believe is something similar to #3821, but it's because it's upgrading from an snapshot to anoter. Thus all good for this PR
// error context added by checkUpgradeDetailsState | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to check for a final state of the upgrade like COMPLETED or FAILED ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're checking this a few lines later in this same file: https://github.com/elastic/elastic-agent/pull/3827/files#diff-4941bb4aa7d421d6eb86b572bc58968a061897540ef25923d5b2460bfdce10e8R289-R295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small comment about checking the final state of an upgrade before asserting that the upgrade details disappear
3c69999
to
74f2697
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
bf1767d
to
9925381
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested manually and confirmed it works, thanks!
49883a3
to
7344cd7
Compare
5cb41f4
to
b040d29
Compare
05cb012
to
67cf265
Compare
All upgrade-related integration tests are passing now. But all Endpoint-related integration tests are failing now with the same symptom:
[EDIT] Downloaded a failing test's diagnostic and looked at the Endpoint logs within it. The earliest errors in the logs say this:
Right before those errors, there are these two logs about proxy URLs; not sure if those are relevant to the errors or not:
Failures don't seem related to the changes in this PR. |
buildkite test this |
|
… time that the upgrade is being watched (#3827) * Don't set upgrade details when creating upgrade marker * Set UPG_WATCHING state right before starting to watch upgrade * Log upgrade details whenever they're set on the coordinator * Fix logging location * Revert "Don't set upgrade details when creating upgrade marker" This reverts commit 6821832. * Fix logic with assuming UPG_ROLLBACK state * Add FIXME * Correctly observe upgrade details changes * Update unit test * Include upgrade details in status output * Check upgrade details state before and after upgrade watcher starts * Check that upgrade details have been cleared out upon successful upgrade * Update unit test * Fixing up upgrade integration tests * Add unit test + fix details object being used * Define AgentStatusOutput.IsZero() and use it * Make sure Marker Watcher accounts for `UPG_COMPLETED` state * Fix location of assertion * Fix error message * Join errors for wrapping * Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing * Cast string to details.State * Remove version override debugging * Wrap bugfix assertions in version checks * Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option * Call option function * Debugging * Fixing version check logic * Remove debugging statements (cherry picked from commit ad7e1b5)
… time that the upgrade is being watched (#3827) (#3927) * Don't set upgrade details when creating upgrade marker * Set UPG_WATCHING state right before starting to watch upgrade * Log upgrade details whenever they're set on the coordinator * Fix logging location * Revert "Don't set upgrade details when creating upgrade marker" This reverts commit 6821832. * Fix logic with assuming UPG_ROLLBACK state * Add FIXME * Correctly observe upgrade details changes * Update unit test * Include upgrade details in status output * Check upgrade details state before and after upgrade watcher starts * Check that upgrade details have been cleared out upon successful upgrade * Update unit test * Fixing up upgrade integration tests * Add unit test + fix details object being used * Define AgentStatusOutput.IsZero() and use it * Make sure Marker Watcher accounts for `UPG_COMPLETED` state * Fix location of assertion * Fix error message * Join errors for wrapping * Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing * Cast string to details.State * Remove version override debugging * Wrap bugfix assertions in version checks * Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option * Call option function * Debugging * Fixing version check logic * Remove debugging statements (cherry picked from commit ad7e1b5) Co-authored-by: Shaunak Kashyap <[email protected]>
… time that the upgrade is being watched (#3827) (#3927) * Don't set upgrade details when creating upgrade marker * Set UPG_WATCHING state right before starting to watch upgrade * Log upgrade details whenever they're set on the coordinator * Fix logging location * Revert "Don't set upgrade details when creating upgrade marker" This reverts commit 6821832. * Fix logic with assuming UPG_ROLLBACK state * Add FIXME * Correctly observe upgrade details changes * Update unit test * Include upgrade details in status output * Check upgrade details state before and after upgrade watcher starts * Check that upgrade details have been cleared out upon successful upgrade * Update unit test * Fixing up upgrade integration tests * Add unit test + fix details object being used * Define AgentStatusOutput.IsZero() and use it * Make sure Marker Watcher accounts for `UPG_COMPLETED` state * Fix location of assertion * Fix error message * Join errors for wrapping * Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing * Cast string to details.State * Remove version override debugging * Wrap bugfix assertions in version checks * Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option * Call option function * Debugging * Fixing version check logic * Remove debugging statements (cherry picked from commit ad7e1b5) Co-authored-by: Shaunak Kashyap <[email protected]>
What does this PR do?
This PR fixes the upgrade details such that they are in the
UPG_WATCHING
state the entire time the Agent upgrade is being watched by the Upgrade Watcher.Why is it important?
So the state of the upgrade is accurately reported.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry inBug was never released../changelog/fragments
using the changelog toolHow to test this PR locally
UPG_WATCHING
.UPG_WATCHING
afterUPG_RESTARTING
.Related issues