Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: dataflow runner worker project test stuck causing Java PreCommit time out #28957

Closed
1 of 16 tasks
Abacn opened this issue Oct 12, 2023 · 16 comments
Closed
1 of 16 tasks

Comments

@Abacn
Copy link
Contributor

Abacn commented Oct 12, 2023

What happened?

It appears that PreCommit timeout happens more frequently recently.

the succeeded job run after the timed out run shows the following tests are added, which suggests the stuck project is runners:google-cloud-dataflow-java:worker:

Package Duration Fail (diff) Skip (diff) Pass (diff) Total (diff)
org.apache.beam.runners.dataflow.harness.test 10 sec 0   0   9 +9 9 +9
org.apache.beam.runners.dataflow.options 15 sec 0   0   25   25  
org.apache.beam.runners.dataflow.transforms 9.9 sec 0   0   11   11  
org.apache.beam.runners.dataflow.util 1.9 sec 0   0   144   144  
org.apache.beam.runners.dataflow.worker 6 min 7 sec 0   2 +2 493 +493 495 +495
org.apache.beam.runners.dataflow.worker.apiary 1 ms 0   0   6 +6 6 +6
org.apache.beam.runners.dataflow.worker.counters 12 ms 0   0   22 +22 22 +22
org.apache.beam.runners.dataflow.worker.graph 98 ms 0   0   33 +33 33 +33
org.apache.beam.runners.dataflow.worker.logging 0.24 sec 0   0   37 +37 37 +37
org.apache.beam.runners.dataflow.worker.profiler 3 ms 0   0   4 +4 4 +4
org.apache.beam.runners.dataflow.worker.status 40 sec 0   0   7 +7 7 +7
org.apache.beam.runners.dataflow.worker.streaming 10 sec 0   0   18 +18 18 +18
org.apache.beam.runners.dataflow.worker.testing 2 ms 0   0   3 +3 3 +3
org.apache.beam.runners.dataflow.worker.util 1 min 3 sec 0   1 +1 52 +52 53 +53
org.apache.beam.runners.dataflow.worker.util.common 22 ms 0   0   6 +6 6 +6
org.apache.beam.runners.dataflow.worker.util.common.worker 1.4 sec 0   0   84 +84 84 +84
org.apache.beam.runners.dataflow.worker.windmill 9 ms 0   0   9 +9 9 +9
org.apache.beam.runners.dataflow.worker.windmill.grpcclient 11 sec 0   0   13 +13 13 +13
org.apache.beam.runners.dataflow.worker.windmill.state 5 min 7 sec 0   0   127 +127 127 +127

see https://ci-beam.apache.org/view/PostCommit/job/beam_PreCommit_Java_Cron/7455/testReport/

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor Author

Abacn commented Dec 12, 2023

The original stucked test was fixed. Now It appears there is another stuck test:

testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed
runners/google-cloud-dataflow-java/worker/build/test-results/test/TEST-org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.xml [took 10m 36s]

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.waitForWorkerMetadataToBeConsumed(StreamingEngineClientTest.java:349)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers(StreamingEngineClientTest.java:287)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:750)

It was added in #28835

@Abacn
Copy link
Contributor Author

Abacn commented Dec 12, 2023

Another occurrance: https://github.com/apache/beam/pull/29723/checks?check_run_id=19571889999

this happens in intermediate frequency (>10%)

@Abacn Abacn added P1 and removed P2 labels Dec 12, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Dec 12, 2023

Bump to P1 as this is somewhat frequent, and not yet clear if this suggests some regression on the streaming worker

@scwhittle
Copy link
Contributor

@Abacn Can the test be disabled or sickbayed before it is investigated? This code is not yet used in pipelines.

@Abacn
Copy link
Contributor Author

Abacn commented Dec 12, 2023

If it is not used in actual pipeline, its of lower risk and yes could be disabled or sickbayed

update: downgrade to P2 for now

@kennknowles
Copy link
Member

Yea let's disable. This has caused weeks of delay for some major changes.

@Abacn
Copy link
Contributor Author

Abacn commented Dec 18, 2023

Currently the test is disabled. Leave this bug open to track the fixing test TODO

@Abacn
Copy link
Contributor Author

Abacn commented Dec 28, 2023

StreamingEngineClientTest.testScheduledBudgetRefresh is also flaky: https://github.com/apache/beam/runs/19993891236

also added in #28835

@scwhittle
Copy link
Contributor

@m-trieu Martin can you look into fixing the test flakiness?

@Abacn
Copy link
Contributor Author

Abacn commented Feb 14, 2024

testStreamsStartCorrectly (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
	at org.apache.beam.runners.dataflow.worker.windmill.client.AbstractWindmillStream.close(AbstractWindmillStream.java)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClient.finish(StreamingEngineClient.java:236)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.cleanUp(StreamingEngineClientTest.java:172)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

https://github.com/apache/beam/pull/30245/checks?check_run_id=21501494900

@m-trieu
Copy link
Contributor

m-trieu commented Feb 14, 2024

will take a look today

@m-trieu
Copy link
Contributor

m-trieu commented Feb 15, 2024

have a potential fix #30322

@Abacn
Copy link
Contributor Author

Abacn commented Mar 6, 2024

Other flaky test:

testLatencyAttributionToQueuedState: https://github.com/apache/beam/runs/22270690743

java.lang.AssertionError: expected:<PT0S> but was:<PT1S>
	at org.junit.Assert.assertEquals(Assert.java:146)
	at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testLatencyAttributionToQueuedState(StreamingDataflowWorkerTest.java:3444)

testInvalidateStuckCommits: https://github.com/apache/beam/runs/22276370706

Wanted but not invoked:
forComputation.invalidate(
    <ByteString@3e931ad0 size=0 contents="">,
    0L
);
-> at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)
Actually, there were zero interactions with this mock.

	at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)

@kennknowles
Copy link
Member

The flakiness is pretty close to perma-red. I think it is best to rollback while we fix it if #30322 does not work right away.

@Abacn
Copy link
Contributor Author

Abacn commented Jan 27, 2025

testConsumedWorkItems_itemsSplitAcrossResponses (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.GrpcDirectGetWorkStreamTest) failed

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds

testMultimapLazyIterateHugeEntriesResult (org.apache.beam.runners.dataflow.worker.windmill.state.WindmillStateInternalsTest) failed

java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.lang.String java.lang.module.ModuleDescriptor.name accessible: module java.base does not "opens java.lang.module" to unnamed module @675d8c96
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:340)

testConsumedWorkItems_itemsSplitAcrossResponses (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.GrpcDirectGetWorkStreamTest) failed

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds

testUnboundedSourcesDrain[0: [streamingEngine=false]] (org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest) failed

java.lang.AssertionError:
Expected: iterable containing [<0>]
but: no item was <0>

@Amar3tto
Copy link
Collaborator

This issue was opened too long ago (October 12, 2023). We decided to track all failing tests in corresponding 'flaky_test' issues. For example, for Java PreCommit: #30683.

@github-actions github-actions bot added this to the 2.64.0 Release milestone Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants