-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295
Comments
It does not seem to be a problem in the Azure pipelines. |
I know, but have you tried locally? I would be good to have some confirmation besides me. |
Funny enough, I saw it fail once in the next ST run after I wrote that it doesn't seem to fail. :-( That said, I still think that in the CI it seems to fail only rarely. There are several tests that are significantly more flaky. The general issue with this is that unless it is a real problem, nobody is going to jump on it and fix it. So if you think it is flaky, it will usually be up to you to deal with it.
Interesting, I ran it 5 times and it failed twice. Normally my local is much more reliable than Azure, not the other way around. |
Sure, and we should log them so that they can eventually be fixed. This one is tricky as you also found. The test looks correct, and we applied some optimization to make the CC model generation faster. There must be some race condition going on, possibly related to the testing infrastructure. |
I think all of the flaky tests are related to Cruise Control. It is really unpredictable. Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal. |
That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that. |
Correct, not sure if increasing timeout would solve a race there. There have been countless problems with the CC, where many race conditions happen.
This is true...and even with our optimizations is still un-predictable sometimes. So not sure how we could improve it more. |
First thing we have to confirm is that this issue is actually related to the cluster model not being ready in time, which I'm not sure at the moment. I'll try to have a look if I find some time. |
Test executor should collect logs from operands and CO so I guess the problem should be visible in CC log or maybe in operator/kafka log. |
Discussed on the community call on 10.7.2024: @fvaleri plans to look at this and try to improve the stability. |
It took me many runs to hit the timeout a couple of times. # test logs
2024-07-13 08:05:07 [main] INFO [KafkaTopicUtils:315] Starting wait for condition: replica change status not present
io.strimzi.test.WaitException: Timeout after 300000 ms waiting for replica change status not present
at io.strimzi.test.TestUtils.waitFor(TestUtils.java:182)
at io.strimzi.test.TestUtils.waitFor(TestUtils.java:133)
at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForCondition(KafkaTopicUtils.java:316)
at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForReplicaChangeStatusNotPresent(KafkaTopicUtils.java:377)
at io.strimzi.systemtest.operators.topic.TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip(TopicReplicasChangeST.java:170)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
....
# TO logs
2024-07-13 08:04:57,18740 INFO [LoopRunnable-0] ReplicasChangeHandler:176 - Replicas change pending, Topics: [my-topic-1976557641-236539828]
...
2024-07-13 08:10:11,05290 INFO [LoopRunnable-0] ReplicasChangeHandler:184 - Replicas change ongoing, Topics: [my-topic-1976557641-236539828]
# CC logs
2024-07-13 08:05:09 WARN OperationRunnable:50 - Received exception when trying to execute runnable for "Update Topic Configuration"
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There are only 0 valid windows when aggregating in range [-1, 1720857909207] for aggregation options (minValidEntityRatio=1.00, minValidEntityGroupRatio=0.00, minValidWindows=1, numEntitiesToInclude=68, granularity=ENTITY)
at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.computeResult(GoalBasedOperationRunnable.java:167) ~[cruise-control-2.5.137.jar:?]
at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:157) ~[cruise-control-2.5.137.jar:?]
at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:73) ~[cruise-control-2.5.137.jar:?]
at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.OperationRunnable.run(OperationRunnable.java:45) ~[cruise-control-2.5.137.jar:?]
at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.run(GoalBasedOperationRunnable.java:36) ~[cruise-control-2.5.137.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
.... I opened a new PR which should help to make the model generation faster. |
Bug Description
The system test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip often fails with a timeout when running locally the entire class, while it always works when running it in isolation. I think this was already observed before by other people.
Steps to reproduce
mvn verify -B -Pall -pl systemtest -Dit.test="TopicReplicasChangeST" \ -DexcludedGroups=flaky,loadbalancer -Dmaven.javadoc.skip -Dskip.surefire.tests
Expected behavior
The test should always work.
Strimzi version
main
Kubernetes version
1.30
The text was updated successfully, but these errors were encountered: