Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

silviu-dinu
Copy link

@silviu-dinu silviu-dinu commented Feb 1, 2025

The current GitHub runner scaler implementation tries to determine the workflow queue length by fetching the latest 30 workflow runs via an API request to GitHub and then filtering the results by specific statuses (queued, in_progress) at the client side. This can be inaccurate when there are queued workflows older than the latest 30 items (the default limit of GitHub API) and results in queued jobs not being picked up. This issue usually manifests when re-running older jobs.

The proposed solution is to filter workflows on the server side by using the ?status=queued/in_progress query parameter for the /actions/runs API call. Additionally, setting ?per_page=100 (maximum value) when calling /actions/runs/{run_id}/jobs API, instead of the default limit of 30.

Checklist

  • When introducing a new scaler, I agree with the scaling governance policy
  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • (N/A) A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • (N/A) A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6519

@vogonistic
Copy link

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@silviu-dinu
Copy link
Author

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

@vogonistic
Copy link

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

@silviu-dinu
Copy link
Author

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

@vogonistic
Copy link

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

I’m talking about this part:

but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

@silviu-dinu
Copy link
Author

silviu-dinu commented Feb 6, 2025

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

@vogonistic You're right, it will only scale up to 30 + 30 runners maximum, but that is per each call cycle. KEDA would keep calling GitHub (i.e., every minute) and should spin more runners if it still finds pending jobs. So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

I've run into the same issue where I have a workflow that has >30 jobs

I also increased the job page size query parameter to the maximum value 100 with this commit for now since it' much easier to implement than pagination.

@silviu-dinu silviu-dinu marked this pull request as ready for review February 6, 2025 18:43
@silviu-dinu silviu-dinu requested a review from a team as a code owner February 6, 2025 18:43
@JorTurFer
Copy link
Member

JorTurFer commented Feb 8, 2025

So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

If someone is using KEDA at scale, this could be a significant limitation :( As the API looks as paginated, what about browsing the pages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants