Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[A2-798] fix v2.1 upgrade issue for multi-node settings #501

Merged
merged 8 commits into from
Jun 6, 2019

Conversation

srenatus
Copy link
Contributor

@srenatus srenatus commented Jun 5, 2019

🔩 Description

Since there's two different stores for these now, it matters in a
multi-node setting which version the server is operating under. Since
the version is derived from the migration status, which in turn is
stored in the database, we make a change to that a "policy change
event": any instance that is connected to the database, but hasn't been
the one processing the migration GRPC call, will be informed that it
needs to update its store.

Figuring out which store to update is done by querying the database for
its migration status; it no longer is retrieved from local state of the
server instance.

Note: this does not notify other nodes in unhappy paths, since a failure
to migrate v1 policies should prohibit a switch to v2 (or v2.1).

👍 Definition of Done

Stuff still works single-node. -- This is a best effort fix for getting switches between v2 and v2.1 in multi-node settings work, I don't believe we have any proper testing in place for this (I don't even know how to spin it up!)... 😅

👟 Demo Script / Repro Steps

NA

⛓️ Related Resources

✅ Checklist

  • Necessary tests added/updated?
  • Necessary docs added/updated?
  • Code actually executed?
  • Vetting performed (unit tests, lint, etc.)?

@srenatus srenatus added bug 🐛 Something isn't working automate-auth labels Jun 5, 2019
@srenatus srenatus self-assigned this Jun 5, 2019
@srenatus srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch 5 times, most recently from 9a9f656 to 416e567 Compare June 5, 2019 14:35
Copy link
Contributor

@bcmdarroch bcmdarroch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks cleaner now :)

@@ -1338,12 +1338,29 @@ func (p *pg) Pristine(ctx context.Context) error {
return p.recordMigrationStatus(ctx, enumPristine)
}

func (p *pg) recordMigrationStatusAndNotifyPG(ctx context.Context, ms string) error {
tx, err := p.db.BeginTx(ctx, nil /* use driver default */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a naive question: don't you need a "defer some-kind-of-rollback" on this transaction? I don't see it anywhere else this construct is used, so I'm guessing the answer is no, but why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all in the ctx, see this note:

// Note(sr): we're using BeginTx with the context that'll be cancelled in a
// `defer` when the function ends. This should rollback transactions that
// haven't been committed -- what would happen when any of the following
// `err != nil` cases return early.
// However, I haven't played with this extensively, so there's a bit of a
// chance that this understanding is just plain wrong.

Now, whether not cancelling this myself is a problem or not (I haven't put a ctx, cancel := context.WithCancel(ctx); defer cancel() in this function), I'll figure out now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: We don't need to cancel ourselves, grpc takes care of this -- see this toy experiment https://gist.github.com/srenatus/ce12b31ea517f16c024e4f8736fa5f2b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

329e8ef Added a cleanup accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And reverted the cleanup. Experiment must have been flawed, we need the context.WithCancel.

// Engine updates need unfiltered access to all data.
ctx = auth_context.ContextWithoutProjects(ctx)

vsn, err := refresher.getIAMVersion(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following all the intricacies of this PR, but I think this line is the crux: it retrieves the version from the DB instead of the local server state so that in a multi-node situation it will be updating the correct store. Is that right? If so, please add an in-code comment to that effect; thx!

@srenatus srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch 2 times, most recently from d983380 to 557787c Compare June 6, 2019 12:17
@srenatus
Copy link
Contributor Author

srenatus commented Jun 6, 2019

will merge when green

srenatus added 6 commits June 6, 2019 16:16
…nge event

Since there's two different stores for these now, it matters in a
multi-node setting which version the server is operating under. Since
the version is derived from the migration status, which in turn is
stored in the database, we make a change to that a "policy change
event": any instance that is connected to the database, but hasn't been
the one processing the migration GRPC call, will be informed that it
needs to update its store.

Figuring out which store to update is done by querying the database for
its migration status; it no longer is retrieved from local state of the
server instance.

Note: this does not notify other nodes in unhappy paths, since a failure
to migrate v1 policies should prohibit a switch to v2 (or v2.1).

Signed-off-by: Stephan Renatus <[email protected]>
@srenatus srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch from 557787c to c9eb738 Compare June 6, 2019 14:16
@srenatus
Copy link
Contributor Author

srenatus commented Jun 6, 2019

⚠️ Unit tests for authz-service keep timing out because they don't finish within 10 minutes. This could be a symptom of some issue. TODO

@srenatus
Copy link
Contributor Author

srenatus commented Jun 6, 2019

Hrm maybe my experiment re: context.WithCancel was misleading 🤔

srenatus added 2 commits June 6, 2019 16:43
I believe my previous hypothesis was wrong, and not properly validated
by the experiment.

I've rolled that back in the reverted commit, and will investigate this
a little more later.

Signed-off-by: Stephan Renatus <[email protected]>
@srenatus
Copy link
Contributor Author

srenatus commented Jun 6, 2019

will merge when green

@srenatus srenatus merged commit 938518e into master Jun 6, 2019
@chef-ci chef-ci deleted the sr/a2-798/fix-v2.1-upgrade-issue branch June 6, 2019 15:15
@susanev susanev added the auth-team anything that needs to be on the auth team board label Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auth-team anything that needs to be on the auth team board automate-auth bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants