Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Kafka Scaler: Implementation for Excluding Persistent Lag #3905

Closed
wants to merge 35 commits into from

Conversation

josephangbc
Copy link
Contributor

@josephangbc josephangbc commented Nov 23, 2022

Summary

Add implementation for excluding consumer lag from partitions with persistent lag.

Use Case

In situations where consumer is unable to process / consume from partition due to errors etc., committed offset will not change, and consumer lag on that partition will be increasing and never be decreased. KEDA trigger scaling towards the maxReplicaCount.

If partition lag is deemed as persistent, excluding its consumer lag will allow KEDA to trigger scaling appropriately based on the consumer lag observed on other topics and partition, and not be affected by this consumer lag which will not be resolved by scaling.

Logic

Upon each polling cycle, check if current consumer offset is same as previous consumer offset.

Different: return endOffset - consumerOffset (No different from current implementation)
Same: return 0 (To exclude this partition's consumer lag from the total lag)

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • Tests have been added
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Changelog has been updated and is aligned with our changelog requirements

Relates to #3904
Relates to kedacore/keda-docs#984

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure about this feature, in general we try to avoid maintaining information about the execution inside the scaler (previousOffsets). This is because scaler can be recreated when it's needed, and behaviours based on previous cycles information could work randomly.
WDYT @zroubalik ?

CHANGELOG.md Outdated Show resolved Hide resolved
@zroubalik
Copy link
Member

zroubalik commented Nov 28, 2022

/run-e2e kafka*
Update: You can check the progress here

@zroubalik
Copy link
Member

this test is failing:

=== RUN   TestGetBrokers
    kafka_scaler_test.go:168: Expected success but got error error parsing excludePersistentLag: strconv.ParseBool: parsing "notvalid": invalid syntax
    kafka_scaler_test.go:192: Expected success but got error error parsing excludePersistentLag: strconv.ParseBool: parsing "notvalid": invalid syntax

@zroubalik
Copy link
Member

I'm not really sure about this feature, in general we try to avoid maintaining information about the execution inside the scaler (previousOffsets). This is because scaler can be recreated when it's needed, and behaviours based on previous cycles information could work randomly. WDYT @zroubalik ?

That's a very good point @JorTurFer, even though I think that it should be okay for this specific feature. @JosephABC as @JorTurFer mentioned, the scaler could be recreated during it's lifetime, would it cause any problems? I think that it will reconcile.

@zroubalik
Copy link
Member

I'd love to see e2e test for this.

@tomkerkhove
Copy link
Member

We are going to release KEDA v2.9 on Thursday. Do you think you can complete the open work by Wednesday @JosephABC?

@josephangbc
Copy link
Contributor Author

@zroubalik If the scaler is recreated, I would expect the previousOffsets map to be recreated anew. So the recording of the previous offsets will start from a blank state. The downside is that the scaler will not be able to identify the partitions with persistent lag in the next reconciliation cycle and possibly trigger a scale out of the scaling target. In the following cycles, the scaler will observe the previousOffsets map and scale accordingly (excluding persistent lag if deemed necessary)

@tomkerkhove I will certainly try to complete by Wednesday. If not, I will target to complete before the next KEDA release.

@josephangbc
Copy link
Contributor Author

@zroubalik e2e tests seems to be queued for quite a long time already. Is this normal?

@zroubalik
Copy link
Member

@zroubalik e2e tests seems to be queued for quite a long time already. Is this normal?

What do you mean? Only maintainers can trigger them. Do you mean this comment #3905 (comment) ?

The e2e tests passed on this one (it is marked in emojis)

@JorTurFer
Copy link
Member

The e2e tests passed on this one (it is marked in emojis)

And you can also see it in the latest commit checks when the execution was triggered

@josephangbc
Copy link
Contributor Author

The checks for the latest commit all passed, except for the e2e test which shows as "queued". Does that need to be runned and completed as well?

@JorTurFer
Copy link
Member

The checks for the latest commit all passed, except for the e2e test which shows as "queued". Does that need to be runned and completed as well?

we need to trigger them manually like here
image

Could you add an e2e test to cover this new feature? We try to cover all the scalers with e2e tests, in this case, you could just add your feature as a test case here

@JorTurFer
Copy link
Member

JorTurFer commented Dec 4, 2022

/run-e2e kafka*
Update: You can check the progress here

@zroubalik
Copy link
Member

@tobiaskrause would you mind looking at this as well, since you are doing a Kafka PR in parallel?

@zroubalik
Copy link
Member

@JosephABC rebase please

@tobiaskrause
Copy link
Contributor

@tobiaskrause would you mind looking at this as well, since you are doing a Kafka PR in parallel?

@zroubalik, don't see any interference with my PR

)

* Disable response compression for k8s restAPI in client-go

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>

* Updating metrics server with the same parameters

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>

* Adding the change to changelog

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>

* Set default value to true for disable compression

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>

* Changing default value to true in adapter

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>

Signed-off-by: Chaitanya Kuduvalli Ramachandra <[email protected]>
Co-authored-by: Chaitanya Kuduvalli Ramachandra <[email protected]>
penghuazhou and others added 25 commits December 7, 2022 00:43
Signed-off-by: Zbynek Roubalik <[email protected]>

Signed-off-by: Zbynek Roubalik <[email protected]>
* Metrics Server: use vendored OpenAPI definitions

custom-metrics-apiserver serves OpenAPI spec by default since
version [v1.25.0] (cf [PR 110]).

[v1.25.0]: https://github.com/kubernetes-sigs/custom-metrics-apiserver/releases/tag/v1.25.0
[PR 110]: kubernetes-sigs/custom-metrics-apiserver#110

In Keda Metrics Server, remove generation of
`adapter/generated/openapi/zz_generated.openapi.go` and use
OpenAPI definitions from custom-metrics-apiserver instead.

Signed-off-by: Olivier Lemasle <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Zbynek Roubalik <[email protected]>
Signed-off-by: Olivier Lemasle <[email protected]>

Signed-off-by: Olivier Lemasle <[email protected]>
Signed-off-by: Olivier Lemasle <[email protected]>
Co-authored-by: Zbynek Roubalik <[email protected]>
* chore: add `stale-bot-cant-touch-this` to stale bot ignores

Signed-off-by: Jorge Turrado Ferrero <[email protected]>

* update label

Signed-off-by: Jorge Turrado Ferrero <[email protected]>

Signed-off-by: Jorge Turrado Ferrero <[email protected]>
Signed-off-by: Laszlo Kishalmi <[email protected]>

Signed-off-by: Laszlo Kishalmi <[email protected]>
as the scenario covered in this line is for the case in which the partition reached maxInt and started from zero. the calc is done wrongly.

Signed-off-by: Yoav Dobrin <[email protected]>

Signed-off-by: Yoav Dobrin <[email protected]>
…re#3788)

* Update stackdriver client to handle metrics of value type double

Signed-off-by: Eric Takemoto <[email protected]>

* move change log note to below general

Signed-off-by: Eric Takemoto <[email protected]>

* parse activation value as float64

Signed-off-by: Eric Takemoto <[email protected]>

* change target value to float64 for GCP pub/sub and stackdriver

Signed-off-by: Eric Takemoto <[email protected]>

Signed-off-by: Eric Takemoto <[email protected]>
* Split CodeQL into a specific static analysers workflow

Signed-off-by: Jorge Turrado <[email protected]>

* update workflow

Signed-off-by: Jorge Turrado <[email protected]>

* remove schedule trigger

Signed-off-by: Jorge Turrado <[email protected]>

* fix typo

Signed-off-by: Jorge Turrado <[email protected]>

Signed-off-by: Jorge Turrado <[email protected]>
* feat: add semgrep

Signed-off-by: Jorge Turrado <[email protected]>

* change trigger-type

Signed-off-by: Jorge Turrado <[email protected]>

* change trigger-type

Signed-off-by: Jorge Turrado <[email protected]>

* add new line

Signed-off-by: Jorge Turrado <[email protected]>

Signed-off-by: Jorge Turrado <[email protected]>
…e#3959)

* chore: remove vendor folder from semgrep scan

Signed-off-by: Jorge Turrado <[email protected]>

* fix the checkout

Signed-off-by: Jorge Turrado <[email protected]>

* add ne wline

Signed-off-by: Jorge Turrado <[email protected]>

Signed-off-by: Jorge Turrado <[email protected]>
…ore#3960)

* fix: add missing env var for gcp e2e

Signed-off-by: Jorge Turrado <[email protected]>

* add missing gh-cli for checking out

Signed-off-by: Jorge Turrado <[email protected]>

* update security page with semgrep

Signed-off-by: Jorge Turrado <[email protected]>

Signed-off-by: Jorge Turrado <[email protected]>
@josephangbc josephangbc reopened this Dec 6, 2022
@zroubalik
Copy link
Member

@JosephABC could you please rebase your PR to contain only relevant commits? Thanks!

@josephangbc josephangbc closed this Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.