Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelAbstractions support #147

Merged
merged 26 commits into from
Feb 3, 2025
Merged

KernelAbstractions support #147

merged 26 commits into from
Feb 3, 2025

Conversation

leios
Copy link
Contributor

@leios leios commented Sep 7, 2023

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

  1. This builds off of adding preliminary AMDGPU support #99 and should replace it entirely
  2. CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
  3. If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
  4. I need to add in the changes from Non-atomic pairwise force summation kernels. #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

@leios
Copy link
Contributor Author

leios commented Sep 7, 2023

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

  1. Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
  2. Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
  3. Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
  4. Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

@jgreener64
Copy link
Collaborator

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

  1. KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to Non-atomic pairwise force summation kernels. #133 once the code is cleaned up.
  2. Enzyme should also not be an issue; however, there are some reports of error handling being an issue: Enzyme + KA Stalls on Error instead of reporting it EnzymeAD/Enzyme.jl#365

@jgreener64
Copy link
Collaborator

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html
So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

@jgreener64
Copy link
Collaborator

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

Ok, so a couple of quick notes here:

  1. There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (Transition GPUArrays to KernelAbstractions JuliaGPU/GPUArrays.jl#451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled adding preliminary AMDGPU support #99.
  2. Non-atomic pairwise force summation kernels. #133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
  3. I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

@jgreener64
Copy link
Collaborator

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

@leios
Copy link
Contributor Author

leios commented Sep 10, 2023

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

@jgreener64
Copy link
Collaborator

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

Copy link

codecov bot commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 17.70833% with 316 lines in your changes missing coverage. Please review.

Project coverage is 67.28%. Comparing base (636c9da) to head (1ccb627).
Report is 41 commits behind head on master.

Files with missing lines Patch % Lines
src/kernels.jl 0.00% 187 Missing ⚠️
src/interactions/implicit_solvent.jl 6.12% 46 Missing ⚠️
ext/MollyCUDAExt.jl 0.00% 38 Missing ⚠️
src/force.jl 8.33% 11 Missing ⚠️
src/neighbors.jl 41.17% 10 Missing ⚠️
src/spatial.jl 33.33% 8 Missing ⚠️
src/types.jl 85.71% 5 Missing ⚠️
src/coupling.jl 42.85% 4 Missing ⚠️
src/energy.jl 33.33% 4 Missing ⚠️
ext/MollyPythonCallExt.jl 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
+ Coverage   67.01%   67.28%   +0.26%     
==========================================
  Files          35       37       +2     
  Lines        5526     5489      -37     
==========================================
- Hits         3703     3693      -10     
+ Misses       1823     1796      -27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between AMDGPU and Molly. Even on the master branch, this script will create a segfault:

using Molly

n_atoms = 100
atom_mass = 10.0f0u"g/mol"
boundary = CubicBoundary(2.0f0u"nm")
temp = 100.0f0u"K"
cpu_coords = place_atoms(n_atoms, boundary; min_dist=0.3u"nm")
cpu_atoms = Array([Atom(mass=atom_mass, σ=0.3f0u"nm", ϵ=0.2f0u"kJ * mol^-1") for
 i in 1:n_atoms])
cpu_velocities = Array([random_velocity(atom_mass, temp) for i in 1:n_atoms])
cpu_simulator = VelocityVerlet(dt=0.002f0u"ps")

cpu_sys = System(
    atoms=cpu_atoms,
    coords=cpu_coords,
    boundary=boundary,
    velocities=cpu_velocities,
    pairwise_inters=(LennardJones(),),
    loggers=(
        temp=TemperatureLogger(typeof(1.0f0u"K"), 10),
        coords=CoordinateLogger(typeof(1.0f0u"nm"), 10),
    ),
)

simulate!(deepcopy(cpu_sys), cpu_simulator, 20) # Compile function
simulate!(cpu_sys, cpu_simulator, 2000)

But only if AMDGPU is loaded before include("cpu.jl"). Not sure how to go about debugging this on, but writing it down so it is documented somewhere. The segfault:


julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
[1727708527.809644] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809644] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:0:37894] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:1:37897] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:3:37893] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809649] [noema:37885:2]           debug.c:1297 UCX  WARN  ucs_debug_disable_signal: signal 1 was not set in ucs
[noema:37885:2:37892] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:4]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:4:37889] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:6:37890] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809728] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:7:37888] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:8:37891] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:9:37895] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809730] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809735] [noema:37885:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:5:37898] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809741] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:10:37896] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
==== backtrace (tid:  37894) ====
 0 0x000000000004d212 ucs_event_set_fd_get()  ???:0
 1 0x000000000004d3dd ucs_event_set_fd_get()  ???:0
 2 0x000000000003d1d0 __sigaction()  ???:0
 3 0x00000000000845d4 ijl_process_events()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
 4 0x0000000000097f8d ijl_task_get_next()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
 5 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:985
 6 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:987
 7 0x0000000000997f72 julia_wait_74665()  ./task.jl:994
 8 0x0000000000962c1c julia_task_done_hook_75296()  ./task.jl:675
 9 0x0000000001443a97 jfptr_task_done_hook_75297.1()  :0
10 0x0000000000046a0e _jl_invoke()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
11 0x0000000000069c17 jl_apply()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982
12 0x0000000000069d9e start_task()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
=================================

[37885] signal (11.-6): Segmentation fault
in expression starting at /home/leios/projects/CESMIX/Molly.jl/tmp/cpu.jl:25
ijl_process_events at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
ijl_task_get_next at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
poptask at ./task.jl:985
wait at ./task.jl:994
task_done_hook at ./task.jl:675
jfptr_task_done_hook_75297.1 at /home/leios/builds/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_finish_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:320
start_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
Allocations: 39536048 (Pool: 39464158; Big: 71890); GC: 46
Segmentation fault (core dumped)

st:

(Molly) pkg> st
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Project.toml`
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [de9282ab] BioStructures v4.2.0
⌃ [052768ef] CUDA v5.4.3
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [861a8166] Combinatorics v1.0.2
  [864edb3b] DataStructures v0.18.20
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
⌅ [7da242da] Enzyme v0.12.36
  [8f5d6c58] EzXML v1.2.0
  [cc61a311] FLoops v0.2.2
  [f6369f11] ForwardDiff v0.10.36
  [86223c79] Graphs v1.12.0
  [5ab0869b] KernelDensity v0.6.9
  [b8a86587] NearestNeighbors v0.4.20
  [7b2266bf] PeriodicTable v1.2.1
  [189a3867] Reexport v1.2.2
⌅ [64031d72] SimpleCrystals v0.2.0
  [90137ffa] StaticArrays v1.9.7
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [e88e6eb3] Zygote v0.6.71
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest.

@vchuravy
Copy link

vchuravy commented Sep 30, 2024

@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

The fix mentioned there seems to work:

[leios@noema Molly.jl]$ export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 

Note st - m has no MPI or UCX

(Molly) pkg> st -m
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [7d9f7c33] Accessors v0.1.38
  [79e6a3ab] Adapt v4.0.4
  [66dad0bd] AliasTables v1.1.3
  [dce04be8] ArgCheck v2.3.0
  [ec485272] ArnoldiMethod v0.4.0
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [13072b0f] AxisAlgorithms v1.1.0
  [ab4f0b2a] BFloat16s v0.5.0
  [198e06fe] BangBang v0.4.3
  [9718e550] Baselet v0.1.1
  [47718e42] BioGenerics v0.1.5
  [de9282ab] BioStructures v4.2.0
  [3c28c6f8] BioSymbols v5.1.3
  [fa961155] CEnum v0.5.0
⌃ [052768ef] CUDA v5.4.3
  [1af6417a] CUDA_Runtime_Discovery v0.3.5
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [944b1d66] CodecZlib v0.7.6
  [3da002f7] ColorTypes v0.11.5
  [5ae59095] Colors v0.12.11
  [861a8166] Combinatorics v1.0.2
  [bbf7d656] CommonSubexpressions v0.3.1
  [34da2185] Compat v4.16.0
  [a33af91c] CompositionsBase v0.1.2
  [187b0558] ConstructionBase v1.5.8
  [6add18c4] ContextVariablesX v0.1.3
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.7.0
  [864edb3b] DataStructures v0.18.20
  [e2d170a0] DataValueInterfaces v1.0.0
  [244e2a9f] DefineSingletons v0.1.2
  [163ba53b] DiffResults v1.1.0
  [b552c78f] DiffRules v1.15.1
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
  [ffbed154] DocStringExtensions v0.9.3
⌅ [7da242da] Enzyme v0.12.36
⌅ [f151be2c] EnzymeCore v0.7.8
  [e2ba6199] ExprTools v0.1.10
  [8f5d6c58] EzXML v1.2.0
  [7a1cc6ca] FFTW v1.8.0
  [cc61a311] FLoops v0.2.2
  [b9860ae5] FLoopsBase v0.1.1
  [1a297f60] FillArrays v1.13.0
  [53c48c17] FixedPointNumbers v0.8.5
  [1fa38f19] Format v1.3.7
  [f6369f11] ForwardDiff v0.10.36
  [0c68f7d7] GPUArrays v10.3.1
  [46192b85] GPUArraysCore v0.1.6
⌅ [61eb1bfa] GPUCompiler v0.26.7
  [86223c79] Graphs v1.12.0
  [34004b35] HypergeometricFunctions v0.3.24
  [7869d1d1] IRTools v0.4.14
  [d25df0c9] Inflate v0.1.5
  [22cec73e] InitialValues v0.3.1
  [842dd82b] InlineStrings v1.4.2
  [a98d9a8b] Interpolations v0.15.1
  [3587e190] InverseFunctions v0.1.17
  [41ab1584] InvertedIndices v1.3.0
  [92d709cd] IrrationalConstants v0.2.2
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.6.0
  [b14d175d] JuliaVariables v0.2.4
⌃ [63c18a36] KernelAbstractions v0.9.26
  [5ab0869b] KernelDensity v0.6.9
⌅ [929cbde3] LLVM v8.1.0
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.3.1
  [2ab3a3ac] LogExpFunctions v0.3.28
  [d8e11817] MLStyle v0.4.17
  [1914dd2f] MacroTools v0.5.13
  [128add7d] MicroCollections v0.2.0
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v0.3.4
  [77ba4419] NaNMath v1.0.2
  [71a1bf82] NameResolution v0.1.5
  [b8a86587] NearestNeighbors v0.4.20
  [d8793406] ObjectFile v0.4.2
  [6fe1bfb0] OffsetArrays v1.14.1
  [bac558e1] OrderedCollections v1.6.3
  [90014a1f] PDMats v0.11.31
  [d96e819e] Parameters v0.12.3
  [7b2266bf] PeriodicTable v1.2.1
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [8162dcfd] PrettyPrint v0.2.0
  [08abe8d2] PrettyTables v2.4.0
  [92933f4c] ProgressMeter v1.10.2
  [43287f4e] PtrArrays v1.2.1
  [1fd47b50] QuadGK v2.11.1
  [74087812] Random123 v1.7.0
  [e6cf234a] RandomNumbers v1.6.0
  [c84ed2f1] Ratios v0.4.5
  [c1ae055f] RealDot v0.1.0
  [3cdcf5f2] RecipesBase v1.3.4
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [79098fc4] Rmath v0.8.0
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.5
  [efcf1570] Setfield v1.1.1
⌅ [64031d72] SimpleCrystals v0.2.0
  [699a6c99] SimpleTraits v0.9.4
  [a2af1166] SortingAlgorithms v1.2.1
  [dc90abb0] SparseInverseSubset v0.1.2
  [276daf66] SpecialFunctions v2.4.0
  [171d559e] SplittablesBase v0.1.15
  [90137ffa] StaticArrays v1.9.7
  [1e83bf80] StaticArraysCore v1.4.3
  [82ae8749] StatsAPI v1.7.0
  [2913bbd2] StatsBase v0.34.3
  [4c63d2b9] StatsFuns v1.3.2
  [892a3eda] StringManipulation v0.4.0
  [09ab397b] StructArrays v0.6.18
  [53d494c1] StructIO v0.3.1
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [1c621080] TestItems v1.0.0
  [a759f4b9] TimerOutputs v0.5.24
  [3bb67fe8] TranscodingStreams v0.11.2
  [28d57a85] Transducers v0.4.82
  [3a884ed6] UnPack v1.0.2
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [efce3f68] WoodburyMatrices v1.0.0
  [e88e6eb3] Zygote v0.6.71
  [700de1a5] ZygoteRules v0.2.5
⌅ [4ee394cb] CUDA_Driver_jll v0.9.2+0
⌅ [76a88914] CUDA_Runtime_jll v0.14.1+0
  [78a364fa] Chemfiles_jll v0.10.4+0
⌅ [7cc45869] Enzyme_jll v0.0.148+0
  [f5851436] FFTW_jll v3.3.10+1
  [1d5cc7b8] IntelOpenMP_jll v2024.2.1+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
⌅ [dad2f222] LLVMExtra_jll v0.0.31+0
  [94ce4f54] Libiconv_jll v1.17.0+0
  [856f044c] MKL_jll v2024.2.0+0
  [e98f9f5b] NVTX_jll v3.1.0+2
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [f50d1b31] Rmath_jll v0.5.1+0
  [02c8fc9c] XML2_jll v2.13.3+0
  [1317d2d5] oneTBB_jll v2021.12.0+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.0+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated -m`

@vchuravy
Copy link

vchuravy commented Oct 1, 2024

Wild... What is Libc.dllist()? Who loads this darn library

@leios
Copy link
Contributor Author

leios commented Oct 1, 2024

julia> Libc.Libdl.dllist()
32-element Vector{String}:
 "linux-vdso.so.1"
 "/usr/lib/libdl.so.2"
 "/usr/lib/libpthread.so.0"
 "/usr/lib/libc.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/libjulia.so.1.10"
 "/lib64/ld-linux-x86-64.so.2"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgcc_s.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenlibm.so"
 "/usr/lib/libstdc++.so.6"
 "/usr/lib/libm.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-internal.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libunwind.so.8"
 "/usr/lib/librt.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libz.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libatomic.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-codegen.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libLLVM-15jl.so"
 "/home/leios/builds/julia-1.10.2/lib/julia/sys.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libpcre2-8.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgmp.so.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmpfr.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgfortran.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libquadmath.so.0"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenblas64_.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libblastrampoline.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedcrypto.so.7"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedtls.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedx509.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libssh2.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgit2.so.1.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libnghttp2.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libcurl.so.4"

Is it a linux thing like libpthread?

# This triggers an error but it isn't printed
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692
# for how to throw a more meaningful error
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpolation here is particularly tricky. I would avoid that if at all possible.

Copy link
Contributor Author

@leios leios Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: are you referring to the error(...) call in an inlined function within an @kernel? Or carrying the units through to this stage in the first place?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error("Oops") is fine, error("Oops, $F") is sadly not since it string interpolation is really tough on the GPU compiler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove that, no problem.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it

Differentiable simulation: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/zygote.jl:30
  Got exception outside of a @test
  MethodError: _pullback(::Zygote.Context{false}, ::typeof(Base.Broadcast.broadcasted), ::CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, ::typeof(mass), ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}) is ambiguous.
  
  Candidates:
    _pullback(__context__::ZygoteRules.AContext, var"639"::typeof(Base.Broadcast.broadcasted), var"640"::AbstractGPUArrayStyle, f, args...)
      @ Zygote ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
    _pullback(__context__::ZygoteRules.AContext, var"387"::typeof(Base.Broadcast.broadcasted), var"388"::Base.Broadcast.AbstractArrayStyle, f::typeof(mass), args...)
      @ Molly ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
  
  Possible fix, define
    _pullback(::ZygoteRules.AContext, ::typeof(Base.Broadcast.broadcasted), ::AbstractGPUArrayStyle, ::typeof(mass), ::Vararg{Any})
  
  Stacktrace:
    [1] _apply(::Function, ::Vararg{Any})
      @ Core ./boot.jl:838
    [2] adjoint
      @ ~/.julia/packages/Zygote/NRp5C/src/lib/lib.jl:203 [inlined]
    [3] _pullback
      @ ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:67 [inlined]
    [4] broadcasted
      @ ./broadcast.jl:1341 [inlined]
    [5] #System#3
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:565 [inlined]
    [6] _pullback(::Zygote.Context{false}, ::Molly.var"##System#3", ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::CubicBoundary{Float64}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::Vector{Any}, ::Nothing, ::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, ::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, ::Tuple{}, ::Tuple{}, ::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, ::Tuple{}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Float64, ::Nothing, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [7] System
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:485 [inlined]
    [8] _pullback(::Zygote.Context{false}, ::typeof(Core.kwcall), ::@NamedTuple{atoms::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, coords::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, boundary::CubicBoundary{Float64}, velocities::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, pairwise_inters::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, specific_inter_lists::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, general_inters::Tuple{}, neighbor_finder::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, force_units::Unitful.FreeUnits{(), NoDims, nothing}, energy_units::Unitful.FreeUnits{(), NoDims, nothing}}, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [9] loss
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:140 [inlined]
   [10] _pullback(::Zygote.Context{false}, ::var"#loss#39"{UnionAll, Bool, Bool, Bool, Bool, DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, Vector{Float64}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, Float64}}, Vector{SVector{3, Float64}}, VelocityVerlet{Float64, RescaleThermostat{Float64}}, CubicBoundary{Float64}, Float64, Int64, Int64, var"#mean_min_separation#28"{var"#abs2_vec#27"}}, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
   [11] pullback(::Function, ::Zygote.Context{false}, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:90
   [12] pullback(::Function, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:88
   [13] gradient(::Function, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:147
   [14] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:202 [inlined]
   [15] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [16] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:31
   [17] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [18] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:110
   [19] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [20] top-level scope
      @ none:6
   [21] eval
      @ ./boot.jl:385 [inlined]
   [22] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [23] _start()
      @ Base ./client.jl:552

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in.

As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims).

@jgreener64
Copy link
Collaborator

Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error.

I'm happy to help update this PR after #182, I realise that the ground has been moving under you.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it.

On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time.

Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672

@jgreener64
Copy link
Collaborator

Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem.

@vchuravy
Copy link

@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it.

@jgreener64
Copy link
Collaborator

jgreener64 commented Oct 24, 2024

Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues.

@jgreener64
Copy link
Collaborator

#182 is now merged.

@jgreener64
Copy link
Collaborator

The potential energy no NL kernel is the same as the NL kernel, the NoNeighborList object does the iteration over pairs.

I made some changes and fixes to the branch, hope you don't mind. I also reviewed it as I went and I think it is looking strong.

Next week I'll work on fixing the tests, since I broke something during my changes, and we can merge.

@leios
Copy link
Contributor Author

leios commented Jan 24, 2025

Ah, I see what happened. Thanks for looking at this. Please take your time with the review. It's a lot of changed lines over a long period of time.

I'm happy to let you do your magic for a bit. Let me know when / if you want me to look at it again for final touches.

@jgreener64
Copy link
Collaborator

Okay I think this is ready from my end.

I did notice wrong results with forces on Mac but that can be dealt with later.

Would you be able to run the tests on an AMDGPU? I don't have access to one.

@leios
Copy link
Contributor Author

leios commented Jan 30, 2025

Great work! Seems like there was a lot of cleanup.

AMD tests are broken somehow, but I have seen the error before, so I think it's manageable to fix tomorrow.

It is breaking here (neighbors.jl L175):

    pairs = findall(nf.neighbors)

So there is something funky about the findall(bools::...) function in AMDGPU for this case.

Error:

Spatial: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/basic.jl:1
  Got exception outside of a @test
  GPU Kernel Exception
  Stacktrace:
    [1] error(s::String)
      @ Base ./error.jl:35
    [2] throw_if_exception(dev::HIPDevice)
      @ AMDGPU ~/.julia/packages/AMDGPU/MtLT2/src/exception_handler.jl:125
    [3] synchronize(stm::HIPStream; blocking::Bool, stop_hostcalls::Bool)
      @ AMDGPU ~/.julia/packages/AMDGPU/MtLT2/src/highlevel.jl:40
    [4] synchronize(stm::HIPStream)
      @ AMDGPU ~/.julia/packages/AMDGPU/MtLT2/src/highlevel.jl:36
    [5] device_synchronize()
      @ AMDGPU.HIP ~/.julia/packages/AMDGPU/MtLT2/src/hip/HIP.jl:90
    [6] HIPModule
      @ ~/.julia/packages/AMDGPU/MtLT2/src/hip/module.jl:5 [inlined]
    [7] hiplink(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{obj::Vector{UInt8}, entry::String, global_hostcalls::Vector{Symbol}})
      @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/MtLT2/src/compiler/codegen.jl:231
    [8] actual_compilation(cache::Dict{Any, AMDGPU.HIP.HIPFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}, compiler::typeof(AMDGPU.Compiler.hipcompile), linker::typeof(AMDGPU.Compiler.hiplink))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/Nxf8r/src/execution.jl:262
    [9] cached_compilation(cache::Dict{Any, AMDGPU.HIP.HIPFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/Nxf8r/src/execution.jl:151
   [10] macro expansion
      @ ~/.julia/packages/AMDGPU/MtLT2/src/compiler/codegen.jl:161 [inlined]
   [11] macro expansion
      @ ./lock.jl:267 [inlined]
   [12] hipfunction(f::typeof(AMDGPU.partial_scan), tt::Type{Tuple{typeof(Base.add_sum), AMDGPU.Device.ROCDeviceVector{Int64, 1}, AMDGPU.Device.ROCDeviceVector{Bool, 1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{0, Tuple{}}, CartesianIndices{0, Tuple{}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Int64, Nothing, Val{true}}}; kwargs::@Kwargs{})
      @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/MtLT2/src/compiler/codegen.jl:155
   [13] hipfunction(f::typeof(AMDGPU.partial_scan), tt::Type{Tuple{typeof(Base.add_sum), AMDGPU.Device.ROCDeviceVector{Int64, 1}, AMDGPU.Device.ROCDeviceVector{Bool, 1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{0, Tuple{}}, CartesianIndices{0, Tuple{}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Int64, Nothing, Val{true}}})
      @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/MtLT2/src/compiler/codegen.jl:154
   [14] macro expansion
      @ ~/.julia/packages/AMDGPU/MtLT2/src/highlevel.jl:153 [inlined]
   [15] scan!(f::Function, output::ROCArray{Int64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, input::ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}; dims::Int64, init::Nothing, neutral::Int64)
      @ AMDGPU ~/.julia/packages/AMDGPU/MtLT2/src/kernels/accumulate.jl:52
   [16] scan!
      @ ~/.julia/packages/AMDGPU/MtLT2/src/kernels/accumulate.jl:31 [inlined]
   [17] _accumulate!
      @ ~/.julia/packages/AMDGPU/MtLT2/src/kernels/accumulate.jl:17 [inlined]
   [18] #accumulate!#897
      @ ./accumulate.jl:348 [inlined]
   [19] accumulate!
      @ ./accumulate.jl:345 [inlined]
   [20] _cumsum!
      @ ./accumulate.jl:63 [inlined]
   [21] #cumsum!#889
      @ ./accumulate.jl:53 [inlined]
   [22] cumsum!
      @ ./accumulate.jl:51 [inlined]
   [23] cumsum(A::ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}; dims::Int64)
      @ Base ./accumulate.jl:115
   [24] cumsum
      @ ./accumulate.jl:113 [inlined]
   [25] cumsum
      @ ./accumulate.jl:146 [inlined]
   [26] findall(bools::ROCArray{Bool, 2, AMDGPU.Runtime.Mem.HIPBuffer})
      @ AMDGPU ~/.julia/packages/AMDGPU/MtLT2/src/kernels/indexing.jl:10
   [27] find_neighbors(sys::System{3, ROCArray, Float64, ROCArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, ROCArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, Vector{AtomData}, MolecularTopology, Tuple{LennardJones{DistanceCutoff{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2, Unitful.FreeUnits{(nm^2,), 𝐋^2, nothing}}, Quantity{Float64, 𝐋^-2, Unitful.FreeUnits{(nm^-2,), 𝐋^-2, nothing}}}, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Float64}, CoulombReactionField{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Float64, Float64, Quantity{Float64, 𝐋^3 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm, mol^-1), 𝐋^3 𝐌 𝐍^-1 𝐓^-2, nothing}}}}, Tuple{InteractionList2Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{HarmonicBond{Quantity{Float64, 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-2, mol^-1), 𝐌 𝐍^-1 𝐓^-2, nothing}}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList3Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{HarmonicAngle{Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}, Float64}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList4Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList4Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}}, Tuple{}, Tuple{}, DistanceNeighborFinder{ROCArray{Bool, 2, AMDGPU.Runtime.Mem.HIPBuffer}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, ROCArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, Nothing}, nf::DistanceNeighborFinder{ROCArray{Bool, 2, AMDGPU.Runtime.Mem.HIPBuffer}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, current_neighbors::Nothing, step_n::Int64, force_recompute::Bool; kwargs::@Kwargs{})
      @ Molly ~/projects/CESMIX/Molly.jl/src/neighbors.jl:175
   [28] find_neighbors (repeats 2 times)
      @ ~/projects/CESMIX/Molly.jl/src/neighbors.jl:156 [inlined]
   [29] find_neighbors(sys::System{3, ROCArray, Float64, ROCArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, ROCArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, Vector{AtomData}, MolecularTopology, Tuple{LennardJones{DistanceCutoff{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2, Unitful.FreeUnits{(nm^2,), 𝐋^2, nothing}}, Quantity{Float64, 𝐋^-2, Unitful.FreeUnits{(nm^-2,), 𝐋^-2, nothing}}}, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Float64}, CoulombReactionField{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Float64, Float64, Quantity{Float64, 𝐋^3 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm, mol^-1), 𝐋^3 𝐌 𝐍^-1 𝐓^-2, nothing}}}}, Tuple{InteractionList2Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{HarmonicBond{Quantity{Float64, 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-2, mol^-1), 𝐌 𝐍^-1 𝐓^-2, nothing}}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList3Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{HarmonicAngle{Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}, Float64}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList4Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, InteractionList4Atoms{ROCArray{Int32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}}}, Tuple{}, Tuple{}, DistanceNeighborFinder{ROCArray{Bool, 2, AMDGPU.Runtime.Mem.HIPBuffer}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, ROCArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, AMDGPU.Runtime.Mem.HIPBuffer}, Nothing})
      @ Molly ~/projects/CESMIX/Molly.jl/src/neighbors.jl:42
   [30] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/basic.jl:199 [inlined]
   [31] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [32] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/basic.jl:2
   [33] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [34] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:115
   [35] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [36] top-level scope
      @ none:6
   [37] eval
      @ ./boot.jl:385 [inlined]
   [38] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [39] _start()
      @ Base ./client.jl:552
Test Summary: | Pass  Error  Total   Time
Spatial       |   95      1     96  56.4s
ERROR: LoadError: Some tests did not pass: 95 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/basic.jl:1
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/runtests.jl:106
ERROR: Package Molly errored during testing

@leios
Copy link
Contributor Author

leios commented Jan 31, 2025

Well, I am still not sure about the AMD errors, but I am getting the following errors on CUDA:

   	ERROR: a BoundsError was thrown during kernel execution on thread (396, 1, 1) in block (156, 1, 1).
Out-of-bounds array access
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

Neighbor lists: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/basic.jl:216
  Got exception outside of a @test
  KernelException: exception thrown during kernel execution on device Tesla V100S-PCIE-32GB
  Stacktrace:
    [1] check_exceptions()
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/compiler/exceptions.jl:39
    [2] device_synchronize(; blocking::Bool, spin::Bool)
      @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/synchronization.jl:191
    [3] device_synchronize
      @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/synchronization.jl:178 [inlined]
    [4] checked_cuModuleLoadDataEx(_module::Base.RefValue{Ptr{CUDA.CUmod_st}}, image::Ptr{UInt8}, numOptions::Int64, options::Vector{CUDA.CUjit_option_enum}, optionValues::Vector{Ptr{Nothing}})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/module.jl:18
    [5] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/module.jl:60
    [6] CuModule
      @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/module.jl:49 [inlined]
    [7] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/compiler/compilation.jl:409
    [8] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/Nxf8r/src/execution.jl:262
    [9] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/Nxf8r/src/execution.jl:151
   [10] macro expansion
      @ ~/.julia/packages/CUDA/1kIOw/src/compiler/execution.jl:380 [inlined]
   [11] macro expansion
      @ ./lock.jl:267 [inlined]
   [12] cufunction(f::typeof(CUDA.partial_scan), tt::Type{Tuple{typeof(Base.add_sum), CuDeviceVector{Int64, 1}, CuDeviceVector{Bool, 1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{0, Tuple{}}, CartesianIndices{0, Tuple{}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Int64, Nothing, Val{true}}}; kwargs::@Kwargs{})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/compiler/execution.jl:375
   [13] cufunction(f::typeof(CUDA.partial_scan), tt::Type{Tuple{typeof(Base.add_sum), CuDeviceVector{Int64, 1}, CuDeviceVector{Bool, 1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{0, Tuple{}}, CartesianIndices{0, Tuple{}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Int64, Nothing, Val{true}}})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/compiler/execution.jl:372
   [14] macro expansion
      @ ~/.julia/packages/CUDA/1kIOw/src/compiler/execution.jl:112 [inlined]
   [15] scan!(f::Function, output::CuArray{Int64, 1, CUDA.DeviceMemory}, input::CuArray{Bool, 1, CUDA.DeviceMemory}; dims::Int64, init::Nothing, neutral::Int64)
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/accumulate.jl:152
   [16] scan!
      @ ~/.julia/packages/CUDA/1kIOw/src/accumulate.jl:135 [inlined]
   [17] _accumulate!
      @ ~/.julia/packages/CUDA/1kIOw/src/accumulate.jl:203 [inlined]
   [18] #accumulate!#897
      @ ./accumulate.jl:348 [inlined]
   [19] accumulate!
      @ ./accumulate.jl:345 [inlined]
   [20] _cumsum!
      @ ./accumulate.jl:63 [inlined]
   [21] #cumsum!#889
      @ ./accumulate.jl:53 [inlined]
   [22] cumsum!
      @ ./accumulate.jl:51 [inlined]
   [23] cumsum(A::CuArray{Bool, 1, CUDA.DeviceMemory}; dims::Int64)
      @ Base ./accumulate.jl:115
   [24] cumsum
      @ ./accumulate.jl:113 [inlined]
   [25] cumsum
      @ ./accumulate.jl:146 [inlined]
   [26] findall(bools::CuArray{Bool, 2, CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/1kIOw/src/indexing.jl:28
   [27] find_neighbors(sys::System{3, CuArray, Float64, CuArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}, CuArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, CuArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, CUDA.DeviceMemory}, Vector{AtomData}, MolecularTopology, Tuple{LennardJones{DistanceCutoff{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2, Unitful.FreeUnits{(nm^2,), 𝐋^2, nothing}}, Quantity{Float64, 𝐋^-2, Unitful.FreeUnits{(nm^-2,), 𝐋^-2, nothing}}}, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Float64}, CoulombReactionField{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Float64, Float64, Quantity{Float64, 𝐋^3 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm, mol^-1), 𝐋^3 𝐌 𝐍^-1 𝐓^-2, nothing}}}}, Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Quantity{Float64, 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-2, mol^-1), 𝐌 𝐍^-1 𝐓^-2, nothing}}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}}}, Tuple{}, Tuple{}, GPUNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, CuArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, CUDA.DeviceMemory}, Nothing}, nf::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, current_neighbors::Nothing, step_n::Int64, force_recompute::Bool; kwargs::@Kwargs{})
      @ Molly ~/projects/CESMIX/Molly.jl/src/neighbors.jl:175
   [28] find_neighbors
      @ ~/projects/CESMIX/Molly.jl/src/neighbors.jl:156 [inlined]
   [29] find_neighbors(sys::System{3, CuArray, Float64, CuArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}, CuArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, CuArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, CUDA.DeviceMemory}, Vector{AtomData}, MolecularTopology, Tuple{LennardJones{DistanceCutoff{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2, Unitful.FreeUnits{(nm^2,), 𝐋^2, nothing}}, Quantity{Float64, 𝐋^-2, Unitful.FreeUnits{(nm^-2,), 𝐋^-2, nothing}}}, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Float64}, CoulombReactionField{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Float64, Float64, Quantity{Float64, 𝐋^3 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm, mol^-1), 𝐋^3 𝐌 𝐍^-1 𝐓^-2, nothing}}}}, Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Quantity{Float64, 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-2, mol^-1), 𝐌 𝐍^-1 𝐓^-2, nothing}}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}}}, Tuple{}, Tuple{}, GPUNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, CuArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, CUDA.DeviceMemory}, Nothing}, nf::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}})
      @ Molly ~/projects/CESMIX/Molly.jl/src/neighbors.jl:156
   [30] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/basic.jl:328 [inlined]
   [31] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [32] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/basic.jl:217
   [33] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [34] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:115
   [35] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [36] top-level scope
      @ none:6
   [37] eval
      @ ./boot.jl:385 [inlined]
   [38] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [39] _start()
      @ Base ./client.jl:552
Test Summary:  | Pass  Error  Total     Time
Neighbor lists |   28      1     29  2m51.3s
ERROR: LoadError: Some tests did not pass: 28 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/basic.jl:216
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/runtests.jl:106
ERROR: Package Molly errored during testing

@leios
Copy link
Contributor Author

leios commented Jan 31, 2025

So this error is not triggered when I throw each line individually into the REPL. It only happens with ]test, so I'm actually a bit lost on how to tackle it.

We are using the same versions for AMDGPU (1.1.7) in both cases, and I cannot think of other packages that might be influencing the result.

More than that, the error is actually an error of an error. As in there was a problem, but the error message shown is telling me that they couldn't show me the error.

It runs on Metal and CUDA for you?

@jgreener64
Copy link
Collaborator

CUDA and Metal seem to work okay for me (bar the separate issue with Metal forces I mentioned).

Could it be a GPUArrays issue? If it worked for you before when the version was set to 10.

Only throwing in the test suite could be due to @inbounds, since bound checking is always enabled during testing. The error suggests a bounds issue.

@jgreener64
Copy link
Collaborator

Also, I just reverted a change that I made for Metal compatibility because I noticed it was giving wrong indices due to precision errors. I don't see how that could affect the findall line, but it might be worth trying with the latest commit anyway.

@leios
Copy link
Contributor Author

leios commented Jan 31, 2025

The updates actually fixed my errors as well. I think it must have been an odd interaction with bounds checking in the distance_neighbor_finder_kernel! that was somehow causing an error on the findall fx afterwards.

I think I am also getting some correctness issues in the force kernels because I am getting some failed simulation tests:

Lennard-Jones 2D: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:53
  Expression: all((all(c .> 0.0 * u"nm") for c = final_coords))

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:53 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:2
Lennard-Jones 2D: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:54
  Expression: all((all(c .< boundary) for c = final_coords))

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:54 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:2
Lennard-Jones 2D: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1
  Got exception outside of a @test
  ArgumentError: quantiles are undefined in presence of NaNs or missing values
  Stacktrace:
    [1] _quantilesort!(v::Vector{Float64}, sorted::Bool, minp::Float64, maxp::Float64)
      @ Statistics ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Statistics/src/Statistics.jl:994
    [2] #quantile!#49
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Statistics/src/Statistics.jl:964 [inlined]
    [3] quantile!
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Statistics/src/Statistics.jl:960 [inlined]
    [4] quantile(v::Vector{Float64}, p::Vector{Float64}; sorted::Bool, alpha::Float64, beta::Float64)
      @ Statistics ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Statistics/src/Statistics.jl:1089
    [5] quantile
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Statistics/src/Statistics.jl:1089 [inlined]
    [6] default_bandwidth(data::Vector{Float64}, alpha::Float64)
      @ KernelDensity ~/.julia/packages/KernelDensity/uv9BT/src/univariate.jl:39
    [7] default_bandwidth
      @ ~/.julia/packages/KernelDensity/uv9BT/src/univariate.jl:34 [inlined]
    [8] rdf(coords::Vector{SVector{2, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}}, boundary::RectangularBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}; npoints::Int64)
      @ MollyKernelDensityExt ~/projects/CESMIX/Molly.jl/ext/MollyKernelDensityExt.jl:15
    [9] rdf(coords::Vector{SVector{2, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}}, boundary::RectangularBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}})
      @ MollyKernelDensityExt ~/projects/CESMIX/Molly.jl/ext/MollyKernelDensityExt.jl:9
   [10] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:57 [inlined]
   [11] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [12] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:2
   [13] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [14] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:118
   [15] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [16] top-level scope
      @ none:6
   [17] eval
      @ ./boot.jl:385 [inlined]
   [18] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [19] _start()
      @ Base ./client.jl:552
Test Summary:    | Pass  Fail  Error  Total   Time
Lennard-Jones 2D |   10     2      1     13  38.2s

@jgreener64
Copy link
Collaborator

Interesting. It seems like there are NaN values somewhere.

I'm going to see if my institute has any AMD GPUs lying around so I can test too. I'll also try and debug the Metal forces issue, since that might be related.

@leios
Copy link
Contributor Author

leios commented Feb 1, 2025

Yeah, I am sure I did something screwy with the force kernels then. I guess a small precision issue could eventually lead to NaNs if left unchecked for a bit, so the Metal and AMD issues might still be related.

@leios
Copy link
Contributor Author

leios commented Feb 3, 2025

What a catch. Tests seem to be passing on my end until:

Monte Carlo membrane barostat: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1229
  Expression: 260.0 * u"K" < mean(values(sys.loggers.temperature)) < 300.0 * u"K"
   Evaluated: 260.0 K < 301.41174903814704 K < 300.0 K

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1229 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1179
Monte Carlo membrane barostat: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1233
  Expression: mean(values(sys.loggers.potential_energy)) < 0.0 * u"kJ * mol^-1"
   Evaluated: 0.015475547091773527 kJ mol^-1 < 0.0 kJ mol^-1

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1233 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1179
Test Summary:                 | Pass  Fail  Total   Time
Monte Carlo membrane barostat |   28     2     30  55.3s
ERROR: LoadError: Some tests did not pass: 28 passed, 2 failed, 0 errored, 0 broken.
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1177
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/runtests.jl:106
ERROR: Package Molly errored during testing

@jgreener64
Copy link
Collaborator

Same for me. I'm also getting an occasional precision error in the protein tests. Given that they are small errors I am going to disable those tests on non-CUDA backends and merge this.

Thanks for all your work on this James, I'll look to get the next release out soon.

@jgreener64 jgreener64 merged commit 1124a83 into JuliaMolSim:master Feb 3, 2025
3 of 7 checks passed
@leios
Copy link
Contributor Author

leios commented Feb 3, 2025

Great! Glad we finally got this through. Thanks for the patience and taking it over at the end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants