-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[A2-798] fix v2.1 upgrade issue for multi-node settings #501
Conversation
9a9f656
to
416e567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks cleaner now :)
@@ -1338,12 +1338,29 @@ func (p *pg) Pristine(ctx context.Context) error { | |||
return p.recordMigrationStatus(ctx, enumPristine) | |||
} | |||
|
|||
func (p *pg) recordMigrationStatusAndNotifyPG(ctx context.Context, ms string) error { | |||
tx, err := p.db.BeginTx(ctx, nil /* use driver default */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a naive question: don't you need a "defer some-kind-of-rollback" on this transaction? I don't see it anywhere else this construct is used, so I'm guessing the answer is no, but why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all in the ctx
, see this note:
automate/components/authz-service/storage/v2/postgres/postgres.go
Lines 58 to 63 in 02fcff1
// Note(sr): we're using BeginTx with the context that'll be cancelled in a | |
// `defer` when the function ends. This should rollback transactions that | |
// haven't been committed -- what would happen when any of the following | |
// `err != nil` cases return early. | |
// However, I haven't played with this extensively, so there's a bit of a | |
// chance that this understanding is just plain wrong. |
Now, whether not cancelling this myself is a problem or not (I haven't put a ctx, cancel := context.WithCancel(ctx); defer cancel()
in this function), I'll figure out now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: We don't need to cancel ourselves, grpc takes care of this -- see this toy experiment https://gist.github.com/srenatus/ce12b31ea517f16c024e4f8736fa5f2b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
329e8ef Added a cleanup accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And reverted the cleanup. Experiment must have been flawed, we need the context.WithCancel
.
// Engine updates need unfiltered access to all data. | ||
ctx = auth_context.ContextWithoutProjects(ctx) | ||
|
||
vsn, err := refresher.getIAMVersion(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following all the intricacies of this PR, but I think this line is the crux: it retrieves the version from the DB instead of the local server state so that in a multi-node situation it will be updating the correct store. Is that right? If so, please add an in-code comment to that effect; thx!
d983380
to
557787c
Compare
will merge when green |
…nge event Since there's two different stores for these now, it matters in a multi-node setting which version the server is operating under. Since the version is derived from the migration status, which in turn is stored in the database, we make a change to that a "policy change event": any instance that is connected to the database, but hasn't been the one processing the migration GRPC call, will be informed that it needs to update its store. Figuring out which store to update is done by querying the database for its migration status; it no longer is retrieved from local state of the server instance. Note: this does not notify other nodes in unhappy paths, since a failure to migrate v1 policies should prohibit a switch to v2 (or v2.1). Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
…geID Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
557787c
to
c9eb738
Compare
|
Hrm maybe my experiment re: |
…el's" This reverts commit 892e989.
I believe my previous hypothesis was wrong, and not properly validated by the experiment. I've rolled that back in the reverted commit, and will investigate this a little more later. Signed-off-by: Stephan Renatus <[email protected]>
will merge when green |
🔩 Description
Since there's two different stores for these now, it matters in a
multi-node setting which version the server is operating under. Since
the version is derived from the migration status, which in turn is
stored in the database, we make a change to that a "policy change
event": any instance that is connected to the database, but hasn't been
the one processing the migration GRPC call, will be informed that it
needs to update its store.
Figuring out which store to update is done by querying the database for
its migration status; it no longer is retrieved from local state of the
server instance.
Note: this does not notify other nodes in unhappy paths, since a failure
to migrate v1 policies should prohibit a switch to v2 (or v2.1).
👍 Definition of Done
Stuff still works single-node. -- This is a best effort fix for getting switches between v2 and v2.1 in multi-node settings work, I don't believe we have any proper testing in place for this (I don't even know how to spin it up!)... 😅
👟 Demo Script / Repro Steps
NA
⛓️ Related Resources
✅ Checklist