Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip #10339

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Jul 15, 2024

Type of change

  • Bugfix

Description

This change is an attempt to fix the flaky TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip. Looking at logs I can confirm that the timeout is indeed due to the Cruise Control's cluster model not being ready in time.

The Cruise Control setup proposed here seems to make the cluster model generation faster. It basically reduces the partition/replica count of Cruise Control's topics and the metrics window. On my machine, the whole test suite went down from 26 minutes to 13 minutes, which is a 50% improvement. It would be good if someone else could confirm that by running TopicReplicasChangeST before and after this change.

With this change, I wasn't able to trigger the issue locally after many runs. I would suggest to run regression tests 2 or 3 times to confirm this really helps.

Should fix #10295.

Checklist

  • Make sure all tests pass

@fvaleri fvaleri requested review from ppatierno, see-quick and kyguy July 15, 2024 05:54
@fvaleri fvaleri added this to the 0.43.0 milestone Jul 15, 2024
@Frawless
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Frawless
Copy link
Member

/packit test --labels regression

@fvaleri fvaleri requested a review from Frawless July 15, 2024 07:37
@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 15, 2024

As a side effect, all CC tests should become faster if this works.

@scholzj
Copy link
Member

scholzj commented Jul 15, 2024

What does this have to do with DumpLogSegmentsTest#testDumpRemoteLogMetadataNonZeroStartingOffset? Is that just a bad copy paste?

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 15, 2024

What does this have to do with DumpLogSegmentsTest#testDumpRemoteLogMetadataNonZeroStartingOffset? Is that just a bad copy paste?

Yes, sorry, let me fix it. Working on too many flaky tests :D

@fvaleri fvaleri changed the title Fix flaky DumpLogSegmentsTest#testDumpRemoteLogMetadataNonZeroStartingOffset Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip Jul 15, 2024
@fvaleri fvaleri changed the title Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip Flaky test TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip Jul 15, 2024
…undTrip

This change is an attempt to fix the flaky TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip.
Looking at logs I can confirm that the timeout is indeed due to the Cruise Control's cluster model not being ready in time.

The Cruise Control setup proposed here seems to make the cluster model generation faster.
It basically reduces the partition/replica count of Cruise Control's topics and the metrics window.
On my machine, the whole test suite went down from 26 minutes to 13 minutes, which is a 50% improvement.
It would be good if someone else could confirm that by running TopicReplicasChangeST before and after this change.

With this change, I wasn't able to trigger the issue locally after many runs.
I would suggest to run regression tests 2 or 3 times to confirm this really helps.

Signed-off-by: Federico Valeri <[email protected]>
@fvaleri fvaleri changed the title Flaky test TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip Fix flaky TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip Jul 15, 2024
@fvaleri fvaleri force-pushed the flaky-positive-trip branch from a7ca402 to 7206661 Compare July 15, 2024 08:09
Signed-off-by: Federico Valeri <[email protected]>
@Frawless
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Frawless
Copy link
Member

/packit test --labels regression

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 16, 2024

Looks like it worked. Wdyt?

@Frawless
Copy link
Member

The results are good, 1 flake on azure, not sure if it could be somehow connected or could be improved ReconciliationST.testPauseReconciliationInKafkaRebalanceAndTopic, the test failed during waiting for the rebalance readiness. TF has 0 failures connected to changes in this PR.

Signed-off-by: Federico Valeri <[email protected]>
Copy link
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @fvaleri 👍 Good job

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 17, 2024

@kyguy @ppatierno are you good with these changes?

@Frawless Frawless merged commit 10a39d6 into strimzi:main Jul 17, 2024
13 checks passed
@fvaleri fvaleri deleted the flaky-positive-trip branch July 17, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip
5 participants