Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improve how we minimize the possibility of GitHub AbuseException #188

Open
4 tasks
Jericho opened this issue Mar 19, 2021 · 8 comments
Open
4 tasks
Assignees
Labels
enhancement New feature or request

Comments

@Jericho
Copy link
Member

Jericho commented Mar 19, 2021

We seem to be triggering GitHub's abuse detection more frequently as of late and the result is that our automated process raises an issue to add/modify/delete a given yaml file but we don't submit the corresponding PR.

It could be caused by another process running under the cake-contrib-bot user and making GitHub API calls around the same time the discoverer is scheduled (need to ask the @cake-contrib/cake-team if any new process is running), or maybe GitHub has changed the heuristics they use to determine if a process is abusing their API. To my knowledge, they don't disclose what these heuristics and therefore not much we can do about it. Either @devlead or @gep13 (it's been a long time, so I don't remember who it was) talked to someone at GitHub about it at some point and they acknowledged that they had heuristics in place, above and beyond the publicly disclosed "rate limits", but they have not disclosed what they are.

That's why I added some defense mechanisms in the discoverer such as:

  • don't delete more than 75 yaml files per scheduled run
  • don't modify more than 75 yaml files per scheduled run
  • don't create more than 75 yaml files per scheduled run
  • delay after processing each yaml file. This delay used to be 1 second, it was randomized (between 600 milliseconds and 2 seconds) last month. Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???
  • check the result of Octokit's .GetLastApiInfo() and stop processing if there are less than an arbitrary but reasonable (as determined by me) number of calls remaining

But evidently, these are not sufficient, so I propose the following:

  • don't process more than 75 yaml files TOTAL per scheduled run (as opposed to 75 delete, 75 modify and 75 create). I don't think this will make a big difference because we rarely delete/modify/add more than 75 files at a time but better safe than sorry.
  • double the random delay (1200 milliseconds to 4 seconds)
  • increase the "minimum number of remaining calls" threshold. Currently 250, maybe increase it to 500???
  • display more info in the log when we trigger AbuseException. e.g.: what was the remaining number of API calls prior to the exception, how many yaml file were processed prior to the exception, etc. Maybe this could help us figure out why we are triggering it.

Now that I think about it, I should implement the change to display additional info in the log before changing anything else. Maybe this new information will help me understand the root cause of this problem.

@Jericho Jericho added the enhancement New feature or request label Mar 19, 2021
@Jericho Jericho added this to the 3.72.0 milestone Mar 19, 2021
@Jericho Jericho self-assigned this Mar 19, 2021
@augustoproiete
Copy link
Member

Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???

Probably yes, according to GitHub's docs:

"If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request."

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?


One other thing you might want to consider is to read the Retry-After response when you get the exception, and pause until after the time they say to try again. Maybe Polly policies to retry?

From the docs:

"When you have been limited, use the Retry-After response header to slow down. The value of the Retry-After header will always be an integer, representing the number of seconds you should wait before making requests again. For example, Retry-After: 30 means you should wait 30 seconds before sending more requests".

https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-abuse-rate-limits

@Jericho
Copy link
Member Author

Jericho commented Mar 19, 2021

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?

<sarcasm>Thanks for suggesting something I already proposed</sarcasm> 😜

@devlead
Copy link
Member

devlead commented Mar 19, 2021

@patriksvensson correct me if I'm wrong by I think headers returned from GitHub API contains rate limit info, so one might not need to resort to random, but know when to retry.

Edit: saw @augustoproiete already linked to those docs.

@patriksvensson
Copy link
Member

@devlead Correct. If we're doing many calls against the same endpoint and resource, we can also cache the ETag between calls. Using an ETag in a GitHub API call counts as 0 (zero) against the rate limit.

@Jericho
Copy link
Member Author

Jericho commented Mar 19, 2021

Yes headers contain rate limit info and as I previously indicated I already respect them. Also, there's nothing to cache because we trigger the abuse detection mostly when commiting, creating an issue or submitting a PR. The issue is not retrieving from the API, it's creating, commiting, etc.

I wish the exception from GitHub said something like: You triggered our abuse detection because .... That would tell us exactly how to fix and prevent this problem.

@devlead
Copy link
Member

devlead commented Mar 19, 2021

Maybe solution is to run it more often so fewer PRs created at a time?
I.e. what would happen if it ran every hour, half hour, etc?

@Jericho
Copy link
Member Author

Jericho commented Mar 20, 2021

@devlead that might indeed be a good solution to fall back on if I can't prevent AbuseExceptions: reduce the number of yaml file synchronized per run but run the process more often. GitHub's rate limits are on a "per-hour" basis, therefore we couldn't run our process more often than once an hour. Maybe 4 times a day would be more reasonable?

I still want to figure out why we get this exception and how to prevent it in the first place. It's bothering me!

By the way, one theory I mentioned in my original comment is that there may be another process running under the same cake-contrib-bot identity and consuming GitHub API calls which could interfere with the AddinDiscoverer. Anybody aware of any such process? Maybe a build pipeline, CI, etc.???

@devlead
Copy link
Member

devlead commented Mar 20, 2021

Could having a separate addin discover user be an alternative?

@Jericho Jericho modified the milestones: 3.72.0, 3.73.0 Mar 26, 2021
@Jericho Jericho modified the milestones: 3.73.0, 3.74.0, 3.75.0 Mar 31, 2021
@Jericho Jericho modified the milestones: 3.75.0, 3.76.0 Apr 18, 2021
@Jericho Jericho removed this from the 3.76.0 milestone Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants