-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(flags): add quota limiting for feature flags served up with /decide
and to local_evaluation
#28564
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR adds quota limiting functionality to the /decide
endpoint to restrict feature flag access for teams exceeding usage limits, with controlled rollout capabilities.
- Added
DECIDE_FEATURE_FLAG_QUOTA_CHECK
setting in/posthog/settings/web.py
to control quota limiting rollout - Extracted flag computation into
compute_feature_flags()
in/posthog/api/decide.py
for better maintainability - Added
FEATURE_FLAGS
toQuotaResource
enum in/ee/billing/quota_limiting.py
with zero overage buffer - Returns structured response with empty flags when quota limited:
{"quotaLimited": ["feature_flags"], "featureFlags": {}}
- Added comprehensive test coverage in
/posthog/api/test/test_decide.py
for quota limiting scenarios
5 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile
ee/billing/quota_limiting.py
Outdated
@@ -56,6 +57,7 @@ class QuotaLimitingCaches(Enum): | |||
QuotaResource.EVENTS: 0, | |||
QuotaResource.RECORDINGS: 1000, | |||
QuotaResource.ROWS_SYNCED: 0, | |||
QuotaResource.FEATURE_FLAGS: 0, # TODO: should we have a buffer here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to calling this feature_flags_evaluated
instead, that might be more clear (I like how rows_synced
is more obvious than just events
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FEATURE_FLAGS_EVALUATED
makes sense. That way it's not confused with how many feature flags a team has, which I assume is NOT quota limited?
active_flags = {key: value for key, value in feature_flags.items() if value} | ||
|
||
if api_version == 2: | ||
flags_response["featureFlags"] = active_flags | ||
elif api_version >= 3: | ||
# v3 returns all flags, not just active ones, as well as if there was an error computing all flags | ||
flags_response["featureFlags"] = feature_flags | ||
flags_response["errorsWhileComputingFlags"] = errors | ||
flags_response["featureFlagPayloads"] = feature_flag_payloads | ||
else: | ||
# default v1 | ||
flags_response["featureFlags"] = list(active_flags.keys()) | ||
|
||
# metrics for feature flags | ||
team_id_label = label_for_team_id_to_track(team.pk) | ||
FLAG_EVALUATION_COUNTER.labels( | ||
team_id=team_id_label, | ||
errors_computing=errors, | ||
has_hash_key_override=bool(data.get("$anon_distinct_id")), | ||
).inc() | ||
|
||
if is_request_sampled_for_logging: | ||
logger.warn( | ||
"DECIDE_REQUEST_SUCCEEDED", | ||
team_id=team.id, | ||
distinct_id=distinct_id, | ||
errors_while_computing=errors or False, | ||
has_hash_key_override=bool(data.get("$anon_distinct_id")), | ||
) | ||
else: | ||
flags_response["featureFlags"] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code was all moved to get_feature_flags_response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish GitHub supported semantic diffs to make it easier to review changes like this. 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that'd be so nice. Sometimes I find the side-by-side view to help with it.
posthog/api/decide.py
Outdated
# NOTE: Whenever you add something to decide response, update this test: | ||
# `test_decide_doesnt_error_out_when_database_is_down` | ||
# which ensures that decide doesn't error out when the database is down | ||
|
||
if feature_flags: | ||
# Billing analytics for decide requests with feature flags | ||
# Don't count if all requests are for survey targeting flags only. | ||
if not all(flag.startswith(SURVEY_TARGETING_FLAG_PREFIX) for flag in feature_flags.keys()): | ||
# Sample no. of decide requests with feature flags | ||
if settings.DECIDE_BILLING_SAMPLING_RATE and random() < settings.DECIDE_BILLING_SAMPLING_RATE: | ||
count = int(1 / settings.DECIDE_BILLING_SAMPLING_RATE) | ||
increment_request_count(team.pk, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code was also moved to compute_feature_flags
, since it's only relevant if we actually evaluated flags.
@@ -60,6 +60,9 @@ | |||
# if `true` we disable session replay if over quota | |||
DECIDE_SESSION_REPLAY_QUOTA_CHECK = get_from_env("DECIDE_SESSION_REPLAY_QUOTA_CHECK", False, type_cast=str_to_bool) | |||
|
|||
# if `true` we disable feature flags if over quota | |||
DECIDE_FEATURE_FLAG_QUOTA_CHECK = get_from_env("DECIDE_FEATURE_FLAG_QUOTA_CHECK", False, type_cast=str_to_bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be an array of team IDs to either explicit limit or explicitly not limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why wouldn't we control this with our own feature flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could! I was just following the pattern of the other environment variable controls. We've generally avoided using feature flags that are evaluated within /decide
itself, though (this variable is passed into get_decide
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've generally avoided using feature flags that are evaluated within
/decide
itself
Ha! I guess that could get recursive. I'm not suggesting making this change. Just curious.
ee/billing/quota_limiting.py
Outdated
@@ -56,6 +57,7 @@ class QuotaLimitingCaches(Enum): | |||
QuotaResource.EVENTS: 0, | |||
QuotaResource.RECORDINGS: 1000, | |||
QuotaResource.ROWS_SYNCED: 0, | |||
QuotaResource.FEATURE_FLAGS: 0, # TODO: should we have a buffer here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FEATURE_FLAGS_EVALUATED
makes sense. That way it's not confused with how many feature flags a team has, which I assume is NOT quota limited?
@@ -623,8 +646,9 @@ def update_all_orgs_billing_quotas( | |||
for org_id in orgs_with_changes: | |||
properties = { | |||
"quota_limited_events": quota_limited_orgs["events"].get(org_id, None), | |||
"quota_limited_recordings": quota_limited_orgs["events"].get(org_id, None), | |||
"quota_limited_recordings": quota_limited_orgs["recordings"].get(org_id, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug fix? Has recordings quota limited been wrong this whole time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow I didn't even realize i did this; it was a cursor tab-complete for adding in the quota_limited_flags
stuff. I'm gonna tag the replay team so they know about this, since i don't want to change it without them signing off, but it's also possible they weren't even aware this was broken. @PostHog/team-replay you folks want to take a look at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈 that wouldn't suprise me.
i've never touched quota limiting and don't know how I'd check if it was broken 🤔
more scary, there were no failing tests before or after the change 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out it's been that way since being introduced
that PR talks about tracking so is this just what we're reporting to ourselves or the actual value we use for limiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I wonder if it's because the one test where we check the value of that entry we also assert that it's the same value of the quota_limited_events
field.
assert org_action_call.kwargs.get("properties") == {
"quota_limited_events": 1612137599,
"quota_limited_recordings": 1612137599,
"quota_limited_rows_synced": None,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you nailed it though; that object is just used in report_organization_action
, so it's just for internal reporting
for org_id in orgs_with_changes:
properties = {
"quota_limited_events": quota_limited_orgs["events"].get(org_id, None),
"quota_limited_recordings": quota_limited_orgs["recordings"].get(org_id, None),
"quota_limited_rows_synced": quota_limited_orgs["rows_synced"].get(org_id, None),
"quota_limited_feature_flags": quota_limited_orgs["feature_flags"].get(org_id, None),
}
report_organization_action(
orgs_by_id[org_id],
"organization quota limits changed",
properties=properties,
group_properties=properties,
)
So, no actual quota limiting bug, but a quota limit reporting bug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you ok to fix the test as well here?
i don't understand the impact but i'm guessing quota limiting on the correct count is better than not regardless of if it's just reporting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah i'm on it; there's a bunch of other tests I want to fix too – I need to assert the presence of this new quota limit in way more places lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome... thanks!
active_flags = {key: value for key, value in feature_flags.items() if value} | ||
|
||
if api_version == 2: | ||
flags_response["featureFlags"] = active_flags | ||
elif api_version >= 3: | ||
# v3 returns all flags, not just active ones, as well as if there was an error computing all flags | ||
flags_response["featureFlags"] = feature_flags | ||
flags_response["errorsWhileComputingFlags"] = errors | ||
flags_response["featureFlagPayloads"] = feature_flag_payloads | ||
else: | ||
# default v1 | ||
flags_response["featureFlags"] = list(active_flags.keys()) | ||
|
||
# metrics for feature flags | ||
team_id_label = label_for_team_id_to_track(team.pk) | ||
FLAG_EVALUATION_COUNTER.labels( | ||
team_id=team_id_label, | ||
errors_computing=errors, | ||
has_hash_key_override=bool(data.get("$anon_distinct_id")), | ||
).inc() | ||
|
||
if is_request_sampled_for_logging: | ||
logger.warn( | ||
"DECIDE_REQUEST_SUCCEEDED", | ||
team_id=team.id, | ||
distinct_id=distinct_id, | ||
errors_while_computing=errors or False, | ||
has_hash_key_override=bool(data.get("$anon_distinct_id")), | ||
) | ||
else: | ||
flags_response["featureFlags"] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish GitHub supported semantic diffs to make it easier to review changes like this. 😆
@@ -60,6 +60,9 @@ | |||
# if `true` we disable session replay if over quota | |||
DECIDE_SESSION_REPLAY_QUOTA_CHECK = get_from_env("DECIDE_SESSION_REPLAY_QUOTA_CHECK", False, type_cast=str_to_bool) | |||
|
|||
# if `true` we disable feature flags if over quota | |||
DECIDE_FEATURE_FLAG_QUOTA_CHECK = get_from_env("DECIDE_FEATURE_FLAG_QUOTA_CHECK", False, type_cast=str_to_bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why wouldn't we control this with our own feature flag?
Whoops, I meant to add some comments to my review. It looks good to me, but obviously I'm still learning the system. This gives me an idea for a possible future feature. A "rate-limited" feature flag. Basically a feature flag with a counter per group and when it's hit, the flag returns |
… feat/quota-limiting
… feat/quota-limiting
… feat/quota-limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a lot of context on the decide/flag stuff but the quote limiting stuff is good 👍 . Nice fix on the recordings issue.
QuotaResource.FEATURE_FLAGS, QuotaLimitingCaches.QUOTA_LIMITER_CACHE_KEY | ||
) | ||
if self.team.api_token in limited_tokens_flags: | ||
return Response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be responding with an error here so they know they are being limited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was on the fence about this too – I know we can't fail decide because of quota limiting and instead have to just return a different permutation of the decide response, maybe with local eval I can be more fail-y – I haven't really looked at the SDKs to see how they handle it but we already are pretty happy to fail local eval with a 429 if we hit the customer rate limits, perhaps it's worth doing failing for a quota limit, too. Good take. Let me try this and see how it feels.
/decide
/decide
and to local_evaluation
if feature_flags: | ||
# Billing analytics for decide requests with feature flags | ||
# Don't count if all requests are for survey targeting flags only. | ||
if not all(flag.startswith(SURVEY_TARGETING_FLAG_PREFIX) for flag in feature_flags.keys()): | ||
# Sample no. of decide requests with feature flags | ||
if settings.DECIDE_BILLING_SAMPLING_RATE and random() < settings.DECIDE_BILLING_SAMPLING_RATE: | ||
count = int(1 / settings.DECIDE_BILLING_SAMPLING_RATE) | ||
increment_request_count(team.pk, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this, and other logging, is moved into _record_feature_flag_metrics
Problem
Part 1 of 2 of this issue #28223, this change adds quota limiting to the
/decide
and/local_evaluation
endpoints so that we stop serving flags to users who are not paying for them.A few major things to note here: before I explain more
Changes
This change uses a lot of the existing infrastructure for quota limiting recordings, so not a lot of code needed to be changed. I just did the following:
FEATURE_FLAGS_EVALUATED
as a quota-limited resource/decide
that, if we pass in an environment variable that turns on quota limiting, returns an quota-limited flag responseI also refactored the logical flow in
/decide
a bit (which mostly consisted of me pulling out the flag computation logic into a separate function). This makes it IMO a bit easier to read, and it'll also make it easier for me to redirect the flag computation to a separate service when I'm testing the new service in production.