Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in the tests #8

Merged
merged 1 commit into from
Nov 19, 2024
Merged

Conversation

JamesWrigley
Copy link
Collaborator

Quick explanation of what the GC tests for RemoteChannels test does:

  1. Create RemoteChannels rr and fstore on worker 1 and worker 2 respectively from the master process. At this point only the master process knows about rr and fstore.
  2. Master process calls put!(fstore, rr), i.e. we remotecall worker 2 and put rr (which is owned worker 1 but is currently only known about by the master) into fstore.
  3. Remotecall into worker 1 and check that it knows about rr.

Step 3 should succeed despite us never previously explicitly communicating with worker 1, because serialize(::ClusterSerializer, ::RemoteChannel) will send a message to the owner of the RemoteChannel to inform them of its existence (see send_add_client()). This happens asynchronously in step 2, and on rare occasions worker 1 would not process that message before step 3, causing the test to fail.

Now we give the check 10s to succeed.

Should fix failures seen in CI: https://github.com/JuliaParallel/DistributedNext.jl/actions/runs/11880658178/job/33104808773

Test Summary:        | Pass  Total  Time
GC tests for Futures |   42     42  4.0s
GC tests for RemoteChannels: Test Failed at /home/runner/work/DistributedNext.jl/DistributedNext.jl/test/distributed_exec.jl:280
  Expression: remotecall_fetch((k->begin
                #= /home/runner/work/DistributedNext.jl/DistributedNext.jl/test/distributed_exec.jl:280 =#
                haskey(DistributedNext.PGRP.refs, k)
            end), wid1, rrid) == true
   Evaluated: false == true

Stacktrace:
 [1] top-level scope
   @ ~/work/DistributedNext.jl/DistributedNext.jl/test/distributed_exec.jl:256
 [2] macro expansion
   @ /opt/hostedtoolcache/julia/nightly/x86/share/julia/stdlib/v1.12/Test/src/Test.jl:1700 [inlined]
 [3] macro expansion
   @ ~/work/DistributedNext.jl/DistributedNext.jl/test/distributed_exec.jl:280 [inlined]
 [4] macro expansion
   @ /opt/hostedtoolcache/julia/nightly/x86/share/julia/stdlib/v1.12/Test/src/Test.jl:679 [inlined]
Test Summary:               | Pass  Fail  Total  Time
GC tests for RemoteChannels |   10     1     11  2.8s
ERROR: LoadError: Some tests did not pass: 10 passed, 1 failed, 0 errored, 0 broken.
in expression starting at /home/runner/work/DistributedNext.jl/DistributedNext.jl/test/distributed_exec.jl:255
in expression starting at /home/runner/work/DistributedNext.jl/DistributedNext.jl/test/runtests.jl:18
Package DistributedNext errored during testing
Error: Process completed with exit code 1.

I was able to reproduce the failure roughly every 30 test executions. Checked this fix by running the tests locally 100 times.

Quick explanation of what the `GC tests for RemoteChannels` test does:
1. Create `RemoteChannel`s `rr` and `fstore` on worker 1 and worker 2
   respectively from the master process. At this point only the master process
   knows about `rr` and `fstore`.
2. Master process calls `put!(fstore, rr)`, i.e. we remotecall worker 2 and put
   `rr` (which is owned worker 1 but is currently only known about by the
   master) into `fstore`.
3. Remotecall into worker 1 and check that it knows about `rr`.

Step 3 should succeed despite us never previously explicitly communicating with
worker 1, because `serialize(::ClusterSerializer, ::RemoteChannel)` will send a
message to the owner of the `RemoteChannel` to inform them of its existence (see
`send_add_client()`). This happens asynchronously in step 2, and on rare
occasions worker 1 would not process that message before step 3, causing the
test to fail.

Now we give the check 10s to succeed.
@JamesWrigley JamesWrigley self-assigned this Nov 17, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.37%. Comparing base (e990442) to head (4c00a27).
Report is 26 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master       #8   +/-   ##
=======================================
  Coverage   86.37%   86.37%           
=======================================
  Files          10       10           
  Lines        1982     1982           
=======================================
  Hits         1712     1712           
  Misses        270      270           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! Gotta love how much of a pain it is to test async behavior...

@JamesWrigley JamesWrigley merged commit 2e52ffe into master Nov 19, 2024
11 checks passed
@JamesWrigley JamesWrigley deleted the tests-racecondition branch November 19, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants