Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safety improvements. Add tests with multiple threads. CI improvements. #122

Merged
merged 16 commits into from
Jan 26, 2025

Conversation

JamesWrigley
Copy link
Collaborator

@JamesWrigley JamesWrigley commented Jan 22, 2025

Co-authored by @IanButterworth

  • 2c5eee0 upstreams a fix for hangs from DistributedNext
  • 0928d68 includes some threadsafety changes from reviewing Moar threadsafe moar better #101 on DistributedNext
  • f7ba3ca enables multiple threads in CI to shake out problems
  • [multiple commits] reverts @spawn to @async to get tests passing. Moving back to @spawn will take more work, but it should be low priority because the main thread, where Distributed usually runs, will already be sticky anyway, so the gain is minor.
  • fbf8c12 (#122) is necessary to avoid each worker in those tests re-precompiling any stdlibs they use. Speeds up macos runs by ~1 minute
  • 97fc237 (#122) adds a sigusr1/siginfo profile before killing a worker that won't shut down, to aid debugging

Fixes #124

TODO

CC @jpsamaroo

@JamesWrigley JamesWrigley self-assigned this Jan 22, 2025
@JamesWrigley
Copy link
Collaborator Author

The test failures are coming from replacing @async with Threads.@spawn, we didn't merge that commit in DistributedNext because it would've required more testing: JuliaParallel/DistributedNext.jl#4 (comment)

I think it's best to be conservative with this update so I'm going to revert that commit.

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 82.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.19%. Comparing base (ff34f4c) to head (8494bbd).
Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
src/cluster.jl 76.92% 6 Missing ⚠️
src/managers.jl 90.00% 1 Missing ⚠️
src/messages.jl 50.00% 1 Missing ⚠️
src/process_messages.jl 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #122      +/-   ##
==========================================
- Coverage   79.29%   79.19%   -0.10%     
==========================================
  Files          10       10              
  Lines        1922     1923       +1     
==========================================
- Hits         1524     1523       -1     
- Misses        398      400       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Collaborator Author

As discussed offline, added back the commits replacing @async with Threads.@spawn and fixed the CI failure in 877be4f.

@IanButterworth
Copy link
Member

Again discussed on slack. We decided to revert to @async for now, and get tests passing with -4,4 then go from there.

@IanButterworth IanButterworth changed the title Thread-safety improvements Thread-safety improvements. Test with multiple threads. Jan 24, 2025
JamesWrigley and others added 5 commits January 24, 2025 22:47
This should fix an exception seen in CI from the lingering timeout task:
```
 Test Summary:                                | Pass  Total  Time
Deserialization error recovery and include() |   11     11  3.9s
      From worker 4:	Unhandled Task ERROR: EOFError: read end of file
      From worker 4:	Stacktrace:
      From worker 4:	 [1] wait
      From worker 4:	   @ .\asyncevent.jl:159 [inlined]
      From worker 4:	 [2] sleep(sec::Float64)
      From worker 4:	   @ Base .\asyncevent.jl:265
      From worker 4:	 [3] (::DistributedNext.var"#34#37"{DistributedNext.Worker, Float64})()
      From worker 4:	   @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:213
```
@IanButterworth IanButterworth force-pushed the threadsafety branch 2 times, most recently from f6f1d38 to 190a6b6 Compare January 25, 2025 03:09
@IanButterworth IanButterworth changed the title Thread-safety improvements. Test with multiple threads. Thread-safety improvements. Also test with multiple threads. Jan 25, 2025
@IanButterworth IanButterworth force-pushed the threadsafety branch 2 times, most recently from e7212a9 to dcd89eb Compare January 25, 2025 04:06
@IanButterworth IanButterworth changed the title Thread-safety improvements. Also test with multiple threads. Thread-safety improvements. Add tests with multiple threads. CI improvements. Jan 25, 2025
@IanButterworth IanButterworth force-pushed the threadsafety branch 7 times, most recently from ddb8ad4 to 836bed1 Compare January 25, 2025 23:24
@IanButterworth IanButterworth force-pushed the threadsafety branch 9 times, most recently from 3d6f794 to 54575e2 Compare January 26, 2025 05:16
@@ -274,7 +274,7 @@ end
const any_gc_flag = Threads.Condition()
function start_gc_msgs_task()
errormonitor(
Threads.@spawn begin
@async begin
Copy link
Member

@IanButterworth IanButterworth Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixed the race (#124). It seems to have always been a @spawn hence why I didn't initially convert it to @async.

Copy link
Member

@IanButterworth IanButterworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to help the process along given review is required for merge

.github/workflows/ci.yml Show resolved Hide resolved
test/distributed_exec.jl Outdated Show resolved Hide resolved
@JamesWrigley JamesWrigley merged commit 1c7eb92 into master Jan 26, 2025
8 checks passed
@JamesWrigley JamesWrigley deleted the threadsafety branch January 26, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test when multithreading
2 participants