[A2-798] fix v2.1 upgrade issue for multi-node settings #501

srenatus · 2019-06-05T11:59:33Z

🔩 Description

Since there's two different stores for these now, it matters in a
multi-node setting which version the server is operating under. Since
the version is derived from the migration status, which in turn is
stored in the database, we make a change to that a "policy change
event": any instance that is connected to the database, but hasn't been
the one processing the migration GRPC call, will be informed that it
needs to update its store.

Figuring out which store to update is done by querying the database for
its migration status; it no longer is retrieved from local state of the
server instance.

Note: this does not notify other nodes in unhappy paths, since a failure
to migrate v1 policies should prohibit a switch to v2 (or v2.1).

👍 Definition of Done

Stuff still works single-node. -- This is a best effort fix for getting switches between v2 and v2.1 in multi-node settings work, I don't believe we have any proper testing in place for this (I don't even know how to spin it up!)... 😅

👟 Demo Script / Repro Steps

NA

⛓️ Related Resources

[A2-798] split partial result update calls for v2 and v2.1 #467 for splitting the store updates in v2 vs v2.1

✅ Checklist

Necessary tests added/updated?
Necessary docs added/updated?
Code actually executed?
Vetting performed (unit tests, lint, etc.)?

bcmdarroch

👍 looks cleaner now :)

msorens · 2019-06-06T04:38:01Z

components/authz-service/storage/v2/postgres/postgres.go

@@ -1338,12 +1338,29 @@ func (p *pg) Pristine(ctx context.Context) error {
 	return p.recordMigrationStatus(ctx, enumPristine)
 }

+func (p *pg) recordMigrationStatusAndNotifyPG(ctx context.Context, ms string) error {
+	tx, err := p.db.BeginTx(ctx, nil /* use driver default */)


Probably a naive question: don't you need a "defer some-kind-of-rollback" on this transaction? I don't see it anywhere else this construct is used, so I'm guessing the answer is no, but why is that?

It's all in the ctx, see this note:

automate/components/authz-service/storage/v2/postgres/postgres.go

Lines 58 to 63 in 02fcff1

// Note(sr): we're using BeginTx with the context that'll be cancelled in a

// `defer` when the function ends. This should rollback transactions that

// haven't been committed -- what would happen when any of the following

// `err != nil` cases return early.

// However, I haven't played with this extensively, so there's a bit of a

// chance that this understanding is just plain wrong.

Now, whether not cancelling this myself is a problem or not (I haven't put a ctx, cancel := context.WithCancel(ctx); defer cancel() in this function), I'll figure out now.

Update: We don't need to cancel ourselves, grpc takes care of this -- see this toy experiment https://gist.github.com/srenatus/ce12b31ea517f16c024e4f8736fa5f2b

329e8ef Added a cleanup accordingly.

And reverted the cleanup. Experiment must have been flawed, we need the context.WithCancel.

msorens · 2019-06-06T04:42:44Z

components/authz-service/server/v2/policy_refresher.go

 	// Engine updates need unfiltered access to all data.
 	ctx = auth_context.ContextWithoutProjects(ctx)

+	vsn, err := refresher.getIAMVersion(ctx)


Not following all the intricacies of this PR, but I think this line is the crux: it retrieves the version from the DB instead of the local server state so that in a multi-node situation it will be updating the correct store. Is that right? If so, please add an in-code comment to that effect; thx!

srenatus · 2019-06-06T12:18:18Z

will merge when green

…nge event Since there's two different stores for these now, it matters in a multi-node setting which version the server is operating under. Since the version is derived from the migration status, which in turn is stored in the database, we make a change to that a "policy change event": any instance that is connected to the database, but hasn't been the one processing the migration GRPC call, will be informed that it needs to update its store. Figuring out which store to update is done by querying the database for its migration status; it no longer is retrieved from local state of the server instance. Note: this does not notify other nodes in unhappy paths, since a failure to migrate v1 policies should prohibit a switch to v2 (or v2.1). Signed-off-by: Stephan Renatus <[email protected]>

Signed-off-by: Stephan Renatus <[email protected]>

…geID Signed-off-by: Stephan Renatus <[email protected]>

Signed-off-by: Stephan Renatus <[email protected]>

srenatus · 2019-06-06T14:32:21Z

⚠️ Unit tests for `authz-service` keep timing out because they don't finish within 10 minutes. This could be a symptom of some issue. TODO

srenatus · 2019-06-06T14:42:37Z

Hrm maybe my experiment re: context.WithCancel was misleading 🤔

…el's" This reverts commit 892e989.

I believe my previous hypothesis was wrong, and not properly validated by the experiment. I've rolled that back in the reverted commit, and will investigate this a little more later. Signed-off-by: Stephan Renatus <[email protected]>

srenatus · 2019-06-06T15:04:44Z

will merge when green

srenatus added bug 🐛 Something isn't working automate-auth labels Jun 5, 2019

srenatus self-assigned this Jun 5, 2019

srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch 5 times, most recently from 9a9f656 to 416e567 Compare June 5, 2019 14:35

bcmdarroch approved these changes Jun 5, 2019

View reviewed changes

msorens approved these changes Jun 6, 2019

View reviewed changes

srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch 2 times, most recently from d983380 to 557787c Compare June 6, 2019 12:17

srenatus added 6 commits June 6, 2019 16:16

authz-service/server/v2: rename setVersion for clarity

d9f371d

Signed-off-by: Stephan Renatus <[email protected]>

authz-service/policy_refresher: rename lastPolicyID -> lastPolicyChan…

f8dcf8a

…geID Signed-off-by: Stephan Renatus <[email protected]>

v2/policy_test: remove superfluous assertion

e013cd9

Signed-off-by: Stephan Renatus <[email protected]>

authz-service/postgres: cleanup superfluous 'context.WithCancel's

892e989

Signed-off-by: Stephan Renatus <[email protected]>

authz-service/policy_refresher: add comment

c9eb738

Signed-off-by: Stephan Renatus <[email protected]>

srenatus force-pushed the sr/a2-798/fix-v2.1-upgrade-issue branch from 557787c to c9eb738 Compare June 6, 2019 14:16

srenatus added the DO-NOT-MERGE label Jun 6, 2019

srenatus added 2 commits June 6, 2019 16:43

Revert "authz-service/postgres: cleanup superfluous 'context.WithCanc…

f879acb

…el's" This reverts commit 892e989.

srenatus removed the DO-NOT-MERGE label Jun 6, 2019

srenatus merged commit 938518e into master Jun 6, 2019

chef-ci deleted the sr/a2-798/fix-v2.1-upgrade-issue branch June 6, 2019 15:15

susanev added the auth-team anything that needs to be on the auth team board label Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[A2-798] fix v2.1 upgrade issue for multi-node settings #501

[A2-798] fix v2.1 upgrade issue for multi-node settings #501

srenatus commented Jun 5, 2019 •

edited

Loading

bcmdarroch left a comment

msorens Jun 6, 2019

srenatus Jun 6, 2019

srenatus Jun 6, 2019

srenatus Jun 6, 2019

srenatus Jun 6, 2019

msorens Jun 6, 2019

srenatus commented Jun 6, 2019

srenatus commented Jun 6, 2019

srenatus commented Jun 6, 2019

srenatus commented Jun 6, 2019

	// Note(sr): we're using BeginTx with the context that'll be cancelled in a
	// `defer` when the function ends. This should rollback transactions that
	// haven't been committed -- what would happen when any of the following
	// `err != nil` cases return early.
	// However, I haven't played with this extensively, so there's a bit of a
	// chance that this understanding is just plain wrong.

[A2-798] fix v2.1 upgrade issue for multi-node settings #501

[A2-798] fix v2.1 upgrade issue for multi-node settings #501

Conversation

srenatus commented Jun 5, 2019 • edited Loading

🔩 Description

👍 Definition of Done

👟 Demo Script / Repro Steps

⛓️ Related Resources

✅ Checklist

bcmdarroch left a comment

Choose a reason for hiding this comment

msorens Jun 6, 2019

Choose a reason for hiding this comment

srenatus Jun 6, 2019

Choose a reason for hiding this comment

srenatus Jun 6, 2019

Choose a reason for hiding this comment

srenatus Jun 6, 2019

Choose a reason for hiding this comment

srenatus Jun 6, 2019

Choose a reason for hiding this comment

msorens Jun 6, 2019

Choose a reason for hiding this comment

srenatus commented Jun 6, 2019

srenatus commented Jun 6, 2019

⚠️ Unit tests for authz-service keep timing out because they don't finish within 10 minutes. This could be a symptom of some issue. TODO

srenatus commented Jun 6, 2019

srenatus commented Jun 6, 2019

srenatus commented Jun 5, 2019 •

edited

Loading

⚠️ Unit tests for `authz-service` keep timing out because they don't finish within 10 minutes. This could be a symptom of some issue. TODO