Further improve how we minimize the possibility of GitHub AbuseException #188

Jericho · 2021-03-19T15:16:34Z

We seem to be triggering GitHub's abuse detection more frequently as of late and the result is that our automated process raises an issue to add/modify/delete a given yaml file but we don't submit the corresponding PR.

It could be caused by another process running under the cake-contrib-bot user and making GitHub API calls around the same time the discoverer is scheduled (need to ask the @cake-contrib/cake-team if any new process is running), or maybe GitHub has changed the heuristics they use to determine if a process is abusing their API. To my knowledge, they don't disclose what these heuristics and therefore not much we can do about it. Either @devlead or @gep13 (it's been a long time, so I don't remember who it was) talked to someone at GitHub about it at some point and they acknowledged that they had heuristics in place, above and beyond the publicly disclosed "rate limits", but they have not disclosed what they are.

That's why I added some defense mechanisms in the discoverer such as:

don't delete more than 75 yaml files per scheduled run
don't modify more than 75 yaml files per scheduled run
don't create more than 75 yaml files per scheduled run
delay after processing each yaml file. This delay used to be 1 second, it was randomized (between 600 milliseconds and 2 seconds) last month. Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???
check the result of Octokit's .GetLastApiInfo() and stop processing if there are less than an arbitrary but reasonable (as determined by me) number of calls remaining

But evidently, these are not sufficient, so I propose the following:

don't process more than 75 yaml files TOTAL per scheduled run (as opposed to 75 delete, 75 modify and 75 create). I don't think this will make a big difference because we rarely delete/modify/add more than 75 files at a time but better safe than sorry.
double the random delay (1200 milliseconds to 4 seconds)
increase the "minimum number of remaining calls" threshold. Currently 250, maybe increase it to 500???
display more info in the log when we trigger AbuseException. e.g.: what was the remaining number of API calls prior to the exception, how many yaml file were processed prior to the exception, etc. Maybe this could help us figure out why we are triggering it.

Now that I think about it, I should implement the change to display additional info in the log before changing anything else. Maybe this new information will help me understand the root cause of this problem.

The text was updated successfully, but these errors were encountered:

augustoproiete · 2021-03-19T15:54:11Z

Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???

Probably yes, according to GitHub's docs:

"If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request."

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?

One other thing you might want to consider is to read the Retry-After response when you get the exception, and pause until after the time they say to try again. Maybe Polly policies to retry?

From the docs:

"When you have been limited, use the Retry-After response header to slow down. The value of the Retry-After header will always be an integer, representing the number of seconds you should wait before making requests again. For example, Retry-After: 30 means you should wait 30 seconds before sending more requests".

https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-abuse-rate-limits

Jericho · 2021-03-19T15:58:18Z

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?

<sarcasm>Thanks for suggesting something I already proposed</sarcasm> 😜

devlead · 2021-03-19T16:24:19Z

@patriksvensson correct me if I'm wrong by I think headers returned from GitHub API contains rate limit info, so one might not need to resort to random, but know when to retry.

Edit: saw @augustoproiete already linked to those docs.

patriksvensson · 2021-03-19T16:37:56Z

@devlead Correct. If we're doing many calls against the same endpoint and resource, we can also cache the ETag between calls. Using an ETag in a GitHub API call counts as 0 (zero) against the rate limit.

Jericho · 2021-03-19T16:42:21Z

Yes headers contain rate limit info and as I previously indicated I already respect them. Also, there's nothing to cache because we trigger the abuse detection mostly when commiting, creating an issue or submitting a PR. The issue is not retrieving from the API, it's creating, commiting, etc.

I wish the exception from GitHub said something like: You triggered our abuse detection because .... That would tell us exactly how to fix and prevent this problem.

devlead · 2021-03-19T17:26:29Z

Maybe solution is to run it more often so fewer PRs created at a time?
I.e. what would happen if it ran every hour, half hour, etc?

Jericho · 2021-03-20T14:33:31Z

@devlead that might indeed be a good solution to fall back on if I can't prevent AbuseExceptions: reduce the number of yaml file synchronized per run but run the process more often. GitHub's rate limits are on a "per-hour" basis, therefore we couldn't run our process more often than once an hour. Maybe 4 times a day would be more reasonable?

I still want to figure out why we get this exception and how to prevent it in the first place. It's bothering me!

By the way, one theory I mentioned in my original comment is that there may be another process running under the same cake-contrib-bot identity and consuming GitHub API calls which could interfere with the AddinDiscoverer. Anybody aware of any such process? Maybe a build pipeline, CI, etc.???

devlead · 2021-03-20T15:32:49Z

Could having a separate addin discover user be an alternative?

Jericho added the enhancement New feature or request label Mar 19, 2021

Jericho added this to the 3.72.0 milestone Mar 19, 2021

Jericho self-assigned this Mar 19, 2021

Jericho modified the milestones: 3.72.0, 3.73.0 Mar 26, 2021

Jericho added a commit that referenced this issue Mar 27, 2021

(GH-188) Increase the minimum delay between calls to the GitHub API

cf559f5

Jericho modified the milestones: 3.73.0, 3.74.0, 3.75.0 Mar 31, 2021

Jericho modified the milestones: 3.75.0, 3.76.0 Apr 18, 2021

Jericho removed this from the 3.76.0 milestone Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further improve how we minimize the possibility of GitHub AbuseException #188

Further improve how we minimize the possibility of GitHub AbuseException #188

Jericho commented Mar 19, 2021

augustoproiete commented Mar 19, 2021

Jericho commented Mar 19, 2021

devlead commented Mar 19, 2021 •

edited

Loading

patriksvensson commented Mar 19, 2021

Jericho commented Mar 19, 2021

devlead commented Mar 19, 2021

Jericho commented Mar 20, 2021

devlead commented Mar 20, 2021

Further improve how we minimize the possibility of GitHub AbuseException #188

Further improve how we minimize the possibility of GitHub AbuseException #188

Comments

Jericho commented Mar 19, 2021

augustoproiete commented Mar 19, 2021

Jericho commented Mar 19, 2021

devlead commented Mar 19, 2021 • edited Loading

patriksvensson commented Mar 19, 2021

Jericho commented Mar 19, 2021

devlead commented Mar 19, 2021

Jericho commented Mar 20, 2021

devlead commented Mar 20, 2021

devlead commented Mar 19, 2021 •

edited

Loading