Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[A2-798] split partial result update calls for v2 and v2.1 #467

Merged
merged 26 commits into from
Jun 5, 2019

Conversation

srenatus
Copy link
Contributor

@srenatus srenatus commented May 31, 2019

🔩 Description

authz-service uses OPA's partial eval feature to save time when authorizing requests by "pre-calculating" everything it can figure out without knowing the (variable) input data. However, that process itself takes time, any change to the relevant data (policies, roles, projects) requires a rebuild of that entire state.

When introducing IAM v2.1, we've taken a bit of a shortcut, and have rebuilt both the v2 and the v2.1 queries' partial eval state. This PR is about fixing that.

The need to fix it arises from the rebuild time, which is dependent on the data it's built with, increasing when the data increases (linearly? haven't checked) -- and it's become noticable, and annoying.

ℹ️ Notes

  • This PR also expands the current upgrade-to-v2-and-reset-to-v1 tests by going some extra loops through v2.1 as well, see integration/tests/upgrade_reset_iam_v2.sh
  • If you look at the commits, you'll find that I had to revert some of the cleanups I thought necessary. The need here was ingest-service, it seems to query the v2.1-related projects-rules stuff on ingest, regardless of whether A2 is using IAM v1, v2 or v2.1.

👍 Definition of Done

When using IAM v2.1, only the v2.1 partial eval result is rebuilt on updated data, and when using v2, only the v2 state is.

👟 Demo Script / Repro Steps

  • rebuild components/authz-service
  • upgrade to IAM v2.1 (chef-automate iam upgrade-to-v2 --beta2.1)
  • create two projects and 30 policies, measuring the time it takes:
    • projects foo and bar: curl -kH "api-token: $TOK" https://localhost/apis/iam/v2beta/projects -d "$(jq -n '{ id: "foo", name: "my foo project" }')" and curl -kH "api-token: $TOK" https://localhost/apis/iam/v2beta/projects -d "$(jq -n '{ id: "bar", name: "my bar project" }')"
[118][default:/src:0]# time for i in $(seq 1 30); do curl -kH "api-token: $TOK" https://localhost/apis/iam/v2beta/policies -d "$(sed "s/ID/$i/" pol.json)"; done
real    0m5.211s
user    0m0.252s
sys     0m0.060s

this is with pol.json containing

{
  "name": "testpolicy",
  "id": "testpolicy-id-ID",
  "members": [
    "team:local:viewers"
  ],
  "statements": [
    {
      "effect": "ALLOW",
      "actions": [],
      "role": "viewer",
      "projects": [
        "*"
      ]
    }
  ],
  "projects": [ "foo", "bar" ]
}
  • delete those policies,
[119][default:/src:2]# chef-automate dev psql chef_authz_service -- -c "delete from iam_policies where id ilike 'testpolicy-id-%'"
DELETE 30
  • downgrade to v2, redo timing the 30 inserts:
[121][default:/src:130]# chef-automate iam upgrade-to-v2

Upgrading to IAM v2
Migrating v1 policies...
Creating default teams Editors and Viewers...
Skipped: Editors team already exists
Skipped: Viewers team already exists

Migrating existing teams...

Success: Enabled IAM v2
[122][default:/src:0]# time for i in $(seq 1 30); do curl -kH "api-token: $TOK" https://localhost/apis/iam/v2beta/policies -d "$(sed "s/ID/$i/" _dump/pol.json)"; done
[output omitted]
real    8m41.688s
user    0m0.300s
sys     0m0.112s

⚠️ This is odd, isn't it? So, this doesn't solve the issue for IAM v2 customers; but it does promise us a bright and fast future in IAM v2.1. Now I don't really understand (yet) why v2.1 is so much faster to rebuild, but at least we don't drag v2 around 😅. Also, not updating v2.1 when updating v2 takes some of the work away; combined with not updating the engine store twice, this hopefully helps a little.

⛓️ Related Resources

  • A2-798 👈 this is about the perceivable delay 🐛 this hopes to improve.

✅ Checklist

  • Necessary tests added/updated?
  • Necessary docs added/updated?
  • Code actually executed?
  • Vetting performed (unit tests, lint, etc.)?

@srenatus srenatus added automate-auth iamv2 This issue or pull request applies to iamv2 work for Automate tech debt This issue addresses tech debt in our code base labels May 31, 2019
@srenatus srenatus self-assigned this May 31, 2019
@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch 5 times, most recently from 24cec04 to 25e452c Compare June 3, 2019 14:01
Copy link
Contributor Author

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some reviewer notes inside

@@ -704,16 +705,12 @@ func (s *policyServer) EngineUpdateInterceptor() grpc.UnaryServerInterceptor {
// do nothing
}

s.log.Debugf("Initiating store update for %s", info.FullMethod)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 drive-by bug fix

@@ -119,6 +116,10 @@ func NewPoliciesServer(
}
srv.setVersion(v)

if err := srv.updateEngineStore(ctx); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ this needs to happen after setVersion, so the proper store will be updated in the process (v2 vs v2.1). (v1 is done per-call in its corresponding handlers)

}
refresher.refreshRequests <- m
return m.Err()
}

// TODO: handle version: v2 vs v2.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix that when we get there? 😉

@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch 2 times, most recently from 2bbafaf to f0b1792 Compare June 3, 2019 14:56
@srenatus srenatus added the WIP label Jun 3, 2019
@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch 2 times, most recently from f334564 to dcf74c0 Compare June 3, 2019 18:12
@@ -185,10 +185,10 @@ func (refresher *policyRefresher) updateEngineStore(ctx context.Context, vsn api
}

switch {
case vsn.Minor == api.Version_V1: // v2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E_IMANIDIOT

@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch 2 times, most recently from aa61614 to d46aaed Compare June 4, 2019 12:21
@srenatus srenatus requested a review from a team as a code owner June 4, 2019 13:34
@srenatus srenatus removed the WIP label Jun 4, 2019
return refresher.engine.V2SetPolicies(ctx, policyMap, roleMap, ruleMap)
}
// Note 2019/06/04 (sr): v1?! Yes, IAM v1. Our POC code depends on this query to be
// answered regardless of whether IAM is v1, v2 or v2.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful comment!

@@ -295,3 +313,7 @@ func (refresher *policyRefresher) getRuleMap(_ context.Context) (map[string][]in
}
return data, nil
}

func pretty(vsn api.Version) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenor-62368220

Copy link
Contributor

@msorens msorens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and the world is just a little bit better now.

if err != nil {
return err
}
// Need a compiler for use by regular queries, too, so store it here.
s.compiler = compiler

// Partial eval for authzV2Query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest delete this comment, as well as L194-195; otherwise, make them more parallel to each other (on the same relative line, etc.).

return dumpData(ctx, s.store, s.log)
}

func dumpData(ctx context.Context, store storage.Store, l logger.Logger) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest move dumpData below DumpDataV2p1 so DumpData* are all contiguous.

@@ -235,17 +223,28 @@ func (s *State) initCompiler() (*ast.Compiler, error) {
// DumpData is a bit fast-and-loose when it comes to error checking; it's not meant
// to be used in production
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a further comment mentioning where you might inject any of the DumpData* calls?

@@ -636,6 +636,20 @@ func (s *State) V2SetPolicies(
return s.initPartialResultV2(ctx)
}

// V2p1SetPolicies replaces OPA's data with a new set of policies and roles,
// and resets the partial evaluation cache for v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/v2/v2p1/

s.v2Store = store
s, err := New(ctx, l)
require.NoError(b, err, "init state")
s.v2Store = store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move the store outside the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to only benchmark the relevant bits, not what happens in s.New. Setting the s.v2Store is not relevant, either, but also likely not a thing that influences measurements too much.

if err := srv.updateEngineStore(ctx); err != nil {
return nil, errors.Wrapf(err, "initialize engine storage (%v)", v)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I infer it is better to updateEngineStore after ApplyV2DataMigrations...? If so, consider adding a brief in-code comment mentioning why.

curPolicyID, err := refresher.store.GetPolicyChangeID(ctx)
if err != nil {
refresher.log.WithError(err).Warn("Failed to get current policy change ID")
return lastPolicyID, err
}
if curPolicyID != lastPolicyID {
if curPolicyID != lastPolicyID || forceUpdate {
refresher.log.WithFields(logrus.Fields{
"lastPolicyID": lastPolicyID,
"curPolicyID": curPolicyID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from your PR, but please rename curPolicyID and lastPolicyID to curPolicyChangeID and lastPolicyChangeID...
( I saw this code and was thinking "What the heck is a 'current policy' and why do we care?" 🤔 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll touch this code soon to fix the multi-node issue mentioned in the comments. I'll probably do this, then.

return setupV2(t, nil, writer, nil, nil)
writer engine.V2pXWriter) testSetup {
return setupV2WithMigrationState(t, nil, writer, nil, make(chan api_v2.Version, 1),
func(s storage.MigrationStatusProvider) error { return s.Success(context.Background()) }) // IAM v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the comment on the line communicates anything additional here...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does. I'll elaborate.

_ context.Context, policies map[string]interface{},
ctx context.Context, policies map[string]interface{},
roles map[string]interface{}, rules map[string][]interface{}) error {
return te.V2p1SetPolicies(ctx, policies, roles, rules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a comment explaining why it is OK to use the same test code for v2 and v2p1 here.

@@ -33,41 +33,28 @@ hab_curl() {
do_test_deploy() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@srenatus
Copy link
Contributor Author

srenatus commented Jun 5, 2019

So this most likely has issues with switching between v2 and v2.1 in a multi-node setting. I'm not sure how this is currently tested, so thinking hard is the only method we have right now for figuring this out. I'm torn between adding a card outlining the problematic scenario; and adding some commits on top of this PR. I'd rather not do the latter, but since I have no idea who depends on this functionality when, I'm probably opting for that anyways.

srenatus added 6 commits June 5, 2019 10:59
Also: simplify module override logic a bit.

Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
This should allows us to shave a bit off our policy creation/update
times, by only running what we should need.

I'm not completely certain that this captures all issues potentially
arising in the multi-node setup, but it should be fine for all-in-one.

Signed-off-by: Stephan Renatus <[email protected]>
Signed-off-by: Stephan Renatus <[email protected]>
srenatus added 18 commits June 5, 2019 10:59
The stores are empty -- any responses based on these will be wrong.
So, don't bother initializing.

Signed-off-by: Stephan Renatus <[email protected]>
...as when the version changes:

Without this flag, we'd get into the following situation when flipping
from v2 to v2.1 or vice-versa:

1. the server's state would flip (say, v2 -> v2.1)
2. the policy refresher would notice that nothing changed in the policy data
3. and hence would not update the engine store

as a result, the v2.1 engine store would remain non-functional.

This would lead to the `chef-automate iam upgrade-to-v2 --beta2.1` command
failing to list teams,

    APIError: An API error occurred during execution: Failed to retrieve
    team "editors": Failed to retrieve admins team: rpc error: code =
    PermissionDenied desc = error authorizing action "iam:teams:list" on
    resource "iam:teams" for subjects
    ["tls:service:deployment-service:40adf15d875d3190de6c24d0862804cf0d656be656f60234a8714a44563d5518"]:
    rpc error: code = Internal desc = error in query evaluation: cannot
    evaluate empty query

I'm not sure if this is the best way, or if the IAM version should be
reflected in the last change ID...

Signed-off-by: Stephan Renatus <[email protected]>
This is a robustness measure -- without this, we might be answering the
wrong versioned queries when we really shouldn't.

Only relevant for scenarios where we upgrade, though, as usually, only
one of the three should be initialized on startup.

Signed-off-by: Stephan Renatus <[email protected]>
Plain v2 knows nothing about rules, and I think that's how it should
stay.

Signed-off-by: Stephan Renatus <[email protected]>
yeah well if it's wrong it doesn't work.

Signed-off-by: Stephan Renatus <[email protected]>
On v2 server init, the version wasn't picked up: it depended on both the
version channel being non-nil, AND on the migration status NOT having
transitioned to failure. In the v2 memstore implementation, that was
possible.

Signed-off-by: Stephan Renatus <[email protected]>
This reverts commit 6f92330.

Turns out the project rules stuff depends on this to work with IAM v2.
Signed-off-by: Stephan Renatus <[email protected]>
@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch from 1cecfc9 to 36c4481 Compare June 5, 2019 08:59
@srenatus srenatus force-pushed the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch from 36c4481 to 069a19d Compare June 5, 2019 09:06
@srenatus srenatus merged commit 8df4b34 into master Jun 5, 2019
@chef-ci chef-ci deleted the sr/a2-798/split-upgrade-calls-for-v2-and-v2.1 branch June 5, 2019 10:16
@susanev susanev added the auth-team anything that needs to be on the auth team board label Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auth-team anything that needs to be on the auth team board automate-auth iamv2 This issue or pull request applies to iamv2 work for Automate tech debt This issue addresses tech debt in our code base
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants