Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use SIMD.jl directly instead of LV.jl for fast_findmin() #84

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Moelf
Copy link
Member

@Moelf Moelf commented Oct 27, 2024

fix #83

Copy link

codecov bot commented Oct 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.91%. Comparing base (eb1e269) to head (b0464df).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
+ Coverage   71.52%   71.91%   +0.38%     
==========================================
  Files          17       17              
  Lines        1166     1182      +16     
==========================================
+ Hits          834      850      +16     
  Misses        332      332              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@graeme-a-stewart
Copy link
Member

I will test this, but initial indications are that the performance on Apple silicon is rather bad

@Moelf
Copy link
Member Author

Moelf commented Oct 28, 2024

hmm, I don't have access to Apple Silicon, but luckily we don't run JetReco on aarch64 in HEP yet

@graeme-a-stewart
Copy link
Member

Yeah, but it's where I develop! And people testing FCC workflows, for example, do use their Mac laptops...

@graeme-a-stewart
Copy link
Member

graeme-a-stewart commented Oct 28, 2024

Yeah, it's quite a significant regression on Apple silicon. On my M2...

main

~/.julia/dev/JetReconstruction/examples/ [main*] julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 170.6484634375 ± 4.057617987022607 μs
Lowest time per event 166.17292 μs

exorcise_lv

~/.julia/dev/JetReconstruction/examples/ [exorcise_lv] julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 201.44234250000002 ± 3.8225514883321954 μs
Lowest time per event 197.07167 μs

I need to test also on x86 and follow-up with further benchmarks of suggestions which came up in the Discourse thread.

@graeme-a-stewart
Copy link
Member

Benchmarks for x86_64, AMD Ryzen 7 5700G.

main

pc-sft-fa-0a :: dev/JetReconstruction/examples ‹main› » julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 184.5104475 ± 6.358035705629633 μs
Lowest time per event 176.97703 μs

exorcise_lv

pc-sft-fa-0a :: dev/JetReconstruction/examples ‹exorcise_lv› » julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 184.13992875 ± 6.5172848018939655 μs
Lowest time per event 177.66260999999997 μs

So it's really doing a good job on x86. We just have to find a way to make it also work well on Apple silicon.

@Moelf
Copy link
Member Author

Moelf commented Oct 28, 2024

do you ever have NaN in this function? because:

julia> @fastmath foldl(min, [1.0, NaN, 0.5])
0.5

julia> foldl(min, [1.0, NaN, 0.5])
NaN

@graeme-a-stewart
Copy link
Member

graeme-a-stewart commented Oct 28, 2024

No, there can't be a NaN there. I think if you can have NaN then fast math's assumptions are violated and bad things happen.

@Moelf
Copy link
Member Author

Moelf commented Oct 28, 2024

ok yeah if you don't have NaN and don't care about -0.0 vs. 0.0 I think you can use ~fastmath

@graeme-a-stewart
Copy link
Member

ok yeah if you don't have NaN and don't care about -0.0 vs. 0.0 I think you can use ~fastmath

There shouldn't be any zeros at all, so -0.0 vs. +0.0 should be moot!

@Moelf
Copy link
Member Author

Moelf commented Oct 29, 2024

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

@graeme-a-stewart
Copy link
Member

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

They are quite sweet little machines (especially now they have M4s) - I am thinking about getting one myself!

@graeme-a-stewart
Copy link
Member

graeme-a-stewart commented Oct 30, 2024

So I was thinking that if we do not find a generic solution, we can tolerate two ways to implement fast_findmin, switching on Sys.KERNEL (specifically isapple() and friends).

Though it would be good to understand if it's OS we should switch on or CPU arch - @aoanla would be able to help us run a few tests.

@carstenbauer
Copy link

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

Poor man's workaround: ssh to a aarch64 MacOS GitHub Runner. ;)

@graeme-a-stewart graeme-a-stewart self-requested a review November 4, 2024 16:11
@graeme-a-stewart graeme-a-stewart marked this pull request as draft November 22, 2024 09:10
@graeme-a-stewart graeme-a-stewart added the Internals Changes that affect the internals of the package, but not the public API label Nov 22, 2024
@Moelf
Copy link
Member Author

Moelf commented Nov 22, 2024

btw, here's what LV.jl is doing:
https://gist.github.com/Moelf/7432073603f10d7718ef82552d5362dd

@Moelf
Copy link
Member Author

Moelf commented Jan 6, 2025

I know have a M4 mac mini to play with, and I can see the difference:

(thisPR) [1]> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3.gz
1×5 DataFrame
 Row │ File_path                          File                             mean_particles  n_samples  time_per_event
     │ Any                                String                           Int64           Int64      Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc…  events-pp-13TeV-20GeV.hepmc3.gz              -1         16         143.213

(main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3.gz
1×5 DataFrame
 Row │ File_path                          File                             mean_particles  n_samples  time_per_event
     │ Any                                String                           Int64           Int64      Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc…  events-pp-13TeV-20GeV.hepmc3.gz              -1         16         131.523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Changes that affect the internals of the package, but not the public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test removal of LoopVectorization.jl
3 participants