Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New regional example for the ACC #142

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

francispoulin
Copy link

Purpose

Following up on #106, this is a first attempt to create a regional model with ECCO-derived restoring at the boundaries. We decided to try focusing on the ACC in the southern ocean.

To-do

It does not run yet, but after it does, it would be good to know if people agree this is a good example to include. If yes, then we need to turn this into an example.

Content

This is the only file that is different from main at the moment.
https://github.com/CliMA/ClimaOcean.jl/blob/fjp/acc_regional_model/examples/acc_regional_simulation.jl


  • [ x] I have read and checked the items on the review checklist.

@francispoulin
Copy link
Author

The error that I get is copied below.

@simone-silvestri @glwagner

[ Info: In-painting ecco salinity
[ Info: In-painting ecco salinity
ERROR: a bounds error was thrown during kernel execution on thread (65, 1, 1) in block (3, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA A100-SXM4-40GB
Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/Tl08O/src/compiler/exceptions.jl:39
  [2] device_synchronize(; blocking::Bool, spin::Bool)

@simone-silvestri
Copy link
Collaborator

what if you use the CPU and start julia with --check-bounds=yes?

@francispoulin
Copy link
Author

Thanks @simone-silvestri for the suggestion. Will try it now.

@francispoulin
Copy link
Author

francispoulin commented Aug 15, 2024

Things go a lot further but there is a problem with the lines that defines coupled_model. It seems that the matrix it has and wants are not the same size.

It seems this is with assemble_atmosphere_ocean_fluxes.

julia> include("acc_regional_simulation.jl")
[ Info: Regridding bathymetry from existing file /u/fpoulin/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/Bathymetry/ETOPO_2022_v1_60s_N90W180_surface.nc.
┌ Warning: The westernmost meridian of `target_grid` 0.0 does not coincide with the closest meridian of the bathymetry grid, -1.4210854715202004e-14.
└ @ ClimaOcean.Bathymetry ~/software/ClimaOcean.jl/src/Bathymetry.jl:147
[ Info: In-painting ecco temperature
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
[ Info: In-painting ecco salinity
┌ Warning: This simulation will run forever as stop iteration = stop time = wall time limit = Inf.
└ @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/dvdXO/src/Simulations/simulation.jl:55
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
ERROR: LoadError: BoundsError: attempt to access 21×46 Matrix{Float64} at index [22, 40]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] net_downwelling_radiation
    @ ~/software/ClimaOcean.jl/src/OceanSeaIceModels/CrossRealmFluxes/tabulated_albedo.jl:156 [inlined]
  [3] macro expansion
    @ ~/software/ClimaOcean.jl/src/OceanSeaIceModels/CrossRealmFluxes/atmosphere_ocean_fluxes.jl:275 [inlined]
  [4] cpu__assemble_atmosphere_ocean_fluxes!
    @ ~/.julia/packages/KernelAbstractions/QE5mt/src/macros.jl:287 [inlined]
  [5] cpu__assemble_atmosphere_ocean_fluxes!(__ctx__::KernelAbstractions.CompilerMetadata{…}, centered_velocity_fluxes::@NamedTuple{…}, net_tracer_fluxes::@NamedTuple{…}, grid::ImmersedBoundaryGrid{…}, clock::Clock{…}, ocean_temperature::SubArray{…}, ocean_salinity::SubArray{…}, ocean_temperature_units::ClimaOcean.OceanSeaIceModels.CrossRealmFluxes.DegreesCelsius, similarity_theory_fields::@NamedTuple{…}, downwelling_radiation::@NamedTuple{…}, prescribed_freshwater_flux::@NamedTuple{…}, atmos_grid::Oceananigans.Grids.ZRegularLLG{…}, atmos_times::StepRangeLen{…}, atmos_backend::JRA55NetCDFBackend, atmos_time_indexing::Oceananigans.OutputReaders.Cyclical{…}, runoff_args::Tuple{…}, radiation_properties::Radiation{…}, ocean_reference_density::Float64, ocean_heat_capacity::Float64, freshwater_density::Float64)
    @ ClimaOcean.OceanSeaIceModels.CrossRealmFluxes ./none:0
  [6] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.DynamicCheck)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/QE5mt/src/cpu.jl:140
  [7] __run(obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.DynamicCheck, static_threads::Bool)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/QE5mt/src/cpu.jl:107
  [8] (::KernelAbstractions.Kernel{…})(::@NamedTuple{…}, ::Vararg{…}; ndrange::Nothing, workgroupsize::Nothing)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/QE5mt/src/cpu.jl:46
  [9] (::KernelAbstractions.Kernel{…})(::@NamedTuple{…}, ::Vararg{…})
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/QE5mt/src/cpu.jl:39
 [10] launch!(::CPU, ::ImmersedBoundaryGrid{…}, ::Oceananigans.Utils.KernelParameters{…}, ::typeof(ClimaOcean.OceanSeaIceModels.CrossRealmFluxes._assemble_atmosphere_ocean_fluxes!), ::@NamedTuple{…}, ::Vararg{…}; include_right_boundaries::Bool, reduced_dimensions::Tuple{}, location::Nothing, active_cells_map::Nothing, kwargs::@Kwargs{})
    @ Oceananigans.Utils ~/.julia/packages/Oceananigans/dvdXO/src/Utils/kernel_launching.jl:168
 [11] launch!(::CPU, ::ImmersedBoundaryGrid{…}, ::Oceananigans.Utils.KernelParameters{…}, ::Function, ::@NamedTuple{…}, ::Vararg{…})
    @ Oceananigans.Utils ~/.julia/packages/Oceananigans/dvdXO/src/Utils/kernel_launching.jl:154
 [12] compute_atmosphere_ocean_fluxes!(coupled_model::OceanSeaIceModel{…})
    @ ClimaOcean.OceanSeaIceModels.CrossRealmFluxes ~/software/ClimaOcean.jl/src/OceanSeaIceModels/CrossRealmFluxes/atmosphere_ocean_fluxes.jl:77
 [13] update_state!(coupled_model::OceanSeaIceModel{…}, callbacks::Vector{…}; compute_tendencies::Bool)
    @ ClimaOcean.OceanSeaIceModels ~/software/ClimaOcean.jl/src/OceanSeaIceModels/ocean_only_model.jl:42
 [14] update_state!
    @ ClimaOcean.OceanSeaIceModels ~/software/ClimaOcean.jl/src/OceanSeaIceModels/ocean_only_model.jl:30 [inlined]
 [15] update_state!(coupled_model::OceanSeaIceModel{…})
    @ ClimaOcean.OceanSeaIceModels ~/software/ClimaOcean.jl/src/OceanSeaIceModels/ocean_only_model.jl:30
 [16] OceanSeaIceModel(ocean::Simulation{…}, sea_ice::ClimaOcean.OceanSeaIceModels.MinimumTemperatureSeaIce{…}; atmosphere::ClimaOcean.OceanSeaIceModels.PrescribedAtmospheres.PrescribedAtmosphere{…}, radiation::Radiation{…}, similarity_theory::Nothing, ocean_reference_density::Float64, ocean_heat_capacity::Float64, clock::Clock{…})
    @ ClimaOcean.OceanSeaIceModels ~/software/ClimaOcean.jl/src/OceanSeaIceModels/ocean_sea_ice_model.jl:82
 [17] top-level scope
    @ REPL[15]:1
Some type information was truncated. Use `show(err)` to see complete types.

@francispoulin
Copy link
Author

@simone-silvestri , any advice on what is going wrong here?

@simone-silvestri
Copy link
Collaborator

It looks like there is a bug in the TabulatedAlbedo function. i.e., there is no check to make sure that we stay within bounds when interpolating in the table.
I will open a PR to fix this issue. In the meantime, if you want to proceed with the implementation without incurring in this problem, you can use radiation = Radiation(ocean_albedo = LatitudeDependentAlbedo())

@francispoulin
Copy link
Author

Thanks @simone-silvestri , I will give that a try!

Make changes so that it runs
@francispoulin
Copy link
Author

@simone-silvestri : I tried it and it seems like a function is not defined.

I added this at the beginning and now it seems to be running!

using ClimaOcean.OceanSeaIceModels.CrossRealmFluxes: LatitudeDependentAlbedo

@simone-silvestri
Copy link
Collaborator

Ah nice. I think we can export that type.

Copy link

codecov bot commented Aug 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (b3ae3fe) to head (8cc2719).
Report is 5 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main    #142   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         34      34           
  Lines       1962    1983   +21     
=====================================
- Misses      1962    1983   +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@francispoulin
Copy link
Author

It's running on a CPU (i.e. slow) and still on the initial time step.

I made all these changes on the branch and can revert back to what we had previously as other fixes come along.

Maybe I'll have something to share tomorrow.

@francispoulin
Copy link
Author

I started the job yesterday and it hasn't updated the output files in over 24 hours. I think something has gone wrong. Below is the currently display that I have. It hasn't stopped and still running on a CPU. Maybe we need to try it on a GPU or have more output to see what has gone wrong? Any suggestions?

julia> include("acc_regional_simulation.jl")
[ Info: Regridding bathymetry from existing file /u/fpoulin/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/Bathymetry/ETOPO_2022_v1_60s_N90W180_surface.nc.
┌ Warning: The westernmost meridian of `target_grid` 0.0 does not coincide with the closest meridian of the bathymetry grid, -1.4210854715202004e-14.
└ @ ClimaOcean.Bathymetry ~/software/ClimaOcean.jl/src/Bathymetry.jl:147
[ Info: In-painting ecco temperature
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
[ Info: In-painting ecco salinity
┌ Warning: This simulation will run forever as stop iteration = stop time = wall time limit = Inf.
└ @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/dvdXO/src/Simulations/simulation.jl:55
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
[ Info: Initializing simulation...
[ Info: Time: 0 seconds, Iteration 0, Δt 5 minutes, max(vel): (0.00e+00, 0.00e+00, 0.00e+00), max(T): 29.70, min(T): -1.94, wtime: 2.679 minutes 
[ Info:     ... simulation initialization complete (14.029 seconds)
[ Info: Executing initial time step...
┌ Warning: Simulation stopped during initialization.
└ @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/dvdXO/src/Simulations/run.jl:129

@francispoulin
Copy link
Author

My correction. It is still running on one CPU. It is at 4 days after 7 days of computing. Not a great ratio.

What needs to be done so we need to do to run this on a GPU? @simone-silvestri

@simone-silvestri
Copy link
Collaborator

Wow, that seems quite slow! What if you move it on the GPU?

@francispoulin
Copy link
Author

Wow, that seems quite slow! What if you move it on the GPU?

Sorry @simone-silvestri for the late reply.

I am happy to try it again on a GPU but last time there was an error. I can try it again and let you know what the error is.

@francispoulin
Copy link
Author

@simone-silvestri
I ran it on a GPU and found the following error. It suggests I try passing -g2 when I run Julia. I can try that but I believe I tried this before and didn't see much more.

ERROR: a bounds error was thrown during kernel execution on thread (225, 1, 1) in block (7, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

@francispoulin
Copy link
Author

I stand corrected, there is more information. To me this actually looks very different even.

julia> include("acc_regional_simulation.jl")
Precompiling Oceananigans
  166 dependencies successfully precompiled in 109 seconds
Precompiling ClimaOcean
        Info Given ClimaOcean was explicitly requested, output will be shown live 
WARNING: using Units.day in module ECCO conflicts with an existing identifier.
  204 dependencies successfully precompiled in 143 seconds. 168 already precompiled.
  2 dependencies had output during precompilation:
┌ ClimaOcean
│  [Output was shown above]
└  
┌ Accessors → AccessorsUnitfulExt
│  [pid 760940] waiting for IO to finish:
│   Handle type        uv_handle_t->data
│   fs_event           0x25d5fe0->0x7fdd5f8fbeb0
│   timer              0x2385440->0x7fdd5f8fbee0
│  This means that a package has started a background task or event source that has not finished running. For precompilation to complete successfully, the event source needs to be closed explicitly. See the developer documentation on fixing precompilation hangs for more help.
└  
[ Info: Regridding bathymetry from existing file /u/fpoulin/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/Bathymetry/ETOPO_2022_v1_60s_N90W180_surface.nc.
┌ Warning: The westernmost meridian of `target_grid` 0.0 does not coincide with the closest meridian of the bathymetry grid, -1.4210854715202004e-14.
└ @ ClimaOcean.Bathymetry ~/software/ClimaOcean.jl/src/Bathymetry.jl:147
[ Info: In-painting ecco temperature
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
[ Info: In-painting ecco salinity
ERROR: a bounds error was thrown during kernel execution on thread (1, 1, 1) in block (3, 1, 1).
Stacktrace:
 [1] indexed_iterate at ./tuple.jl:92
 [2] indexed_iterate at ./tuple.jl:92
 [3] stateindex at /u/fpoulin/software/ClimaOcean.jl/src/ClimaOcean.jl:40
 [4] ECCORestoring at /u/fpoulin/software/ClimaOcean.jl/src/DataWrangling/ecco_restoring.jl:210
 [5] DiscreteForcing at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Forcings/discrete_forcing.jl:51
 [6] hydrostatic_free_surface_tracer_tendency at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_tendency_kernel_functions.jl:133
 [7] macro expansion at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/compute_hydrostatic_free_surface_tendencies.jl:240
 [8] gpu_compute_hydrostatic_free_surface_Gc! at /u/fpoulin/.julia/packages/KernelAbstractions/QE5mt/src/macros.jl:95
 [9] gpu_compute_hydrostatic_free_surface_Gc! at ./none:0

ERROR: a bounds error was thrown during kernel execution on thread (1, 1, 1) in block (67, 1, 1).
Stacktrace:
 [1] indexed_iterate at ./tuple.jl:92
 [2] indexed_iterate at ./tuple.jl:92
 [3] stateindex at /u/fpoulin/software/ClimaOcean.jl/src/ClimaOcean.jl:40
 [4] ECCORestoring at /u/fpoulin/software/ClimaOcean.jl/src/DataWrangling/ecco_restoring.jl:210
 [5] DiscreteForcing at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Forcings/discrete_forcing.jl:51
 [6] hydrostatic_free_surface_tracer_tendency at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_tendency_kernel_functions.jl:133
 [7] macro expansion at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/compute_hydrostatic_free_surface_tendencies.jl:240
Unhandled Task ERROR: KernelException: exception thrown during kernel execution on device NVIDIA A100-SXM4-40GB

@glwagner
Copy link
Member

Can you make an MWE for this and open an issue?

@francispoulin
Copy link
Author

Can you make an MWE for this and open an issue?

I will certainly give it a try and see what part of it is causing the issue. This will likely take me a day or two to get to.

@francispoulin
Copy link
Author

@simone-silvestri , I realize it's been a few months but I am still keen to this this example up and running.

I can try this all again this week but if you had time to meet for an hour, I wonder if that would help?

@simone-silvestri
Copy link
Collaborator

Sure, I ll text on slack.

@francispoulin
Copy link
Author

I'm happy to say that @simone-silvestri and I got this example working on a GPU.

I ran it on a coarse grid and it looks reasonable. I'm now running it on a more reasonable grid. I hope to have some resutls to share with everyone tomorrow.

If anyone has any suggestions about the code please let me know. @glwagner. We currently include -80 to -20 degrees but it doesn't have to be that.

@francispoulin
Copy link
Author

The good news is that things are running on a GPU. Below are the results for the first 30 days.

The bad news is that things are very slow. For comparison, the near global ocean model does the siulation of 1 day in 16 minutes. In contrast, this coupled regional model does 1 day in 78 minutes. That is more than 4 times slower.

I will keep this running as long as I can but can someone tell me how much slower we should expect the slower model to be? I saw that it used a time step of 10 seconds, but didn't have any output for the global uncoupled model and don't know what time step it took.

@simone-silvestri @glwagner

near_global_ocean_surface_e.mp4
near_global_ocean_surface_s.mp4
near_global_ocean_surface_T.mp4

@glwagner
Copy link
Member

What GPU?

@francispoulin
Copy link
Author

I'm running it on A100's, so it should be fast. @simone-silvestri suggested I change the time step to 10 minutes instead of 10 seconds. Working on that now. That should solve the problem.

@glwagner
Copy link
Member

Haha yeah, that will give you a factor of 60x right away!

@francispoulin
Copy link
Author

Changing the time step made a world of difference. I used 1 minute for the first 10 days and then 10 minutes for the remaining two years. Here are the results for the temperature and turbulent kinetic energy. I can't post the speed as it's larger than 100 MB.

One observation is there is a lot of unbalanced motions. But on the bright side we do see eddies forming and interesting stuff is happening.

I presume the next step is to spin this up. If yes, for long how? I have heard people mention 100 years but I'm not sure what is standard.

Any thoughts are welcome!

near_global_ocean_surface_e.mp4
near_global_ocean_surface_T.mp4

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Dec 21, 2024

Very nice. I am wondering if the north restoring is a bit too strong.
The change in density is quite visible and it seems to spin up baroclinic eddies due to instabilities at the edge of the sponge region during winter periods. Maybe we should reduce the sponge region? Or use a higher-order mask rather than a linear one?

Edit: I think this effect is caused by using only 5 months for the restoring. If we correct that bug probably we will not see this strange behavior. However, we still probably want to reduce a bit the restoring rate.

@francispoulin
Copy link
Author

Thanks @simone-silvestri . I ran this for 10 years and it seems to wokr fine, but the animations are getting rather large.

Do you want me to make an animation outputting every week or so?

Or I could try running it for 100 years and output it only a few times a week.

Let me know what you would prefer to see.

@simone-silvestri
Copy link
Collaborator

Is it too large a video of surface quantity every 2 days maybe? A week might be too much probably

@francispoulin
Copy link
Author

This is a plot of the temperature, outputting every day. It's the smallest of the three files, only 48 MB. I can share more if you like.

near_global_ocean_surface_T.mp4

Do you think the sponge is any better here?

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Jan 10, 2025

It seems a little better, at least I cannot spot sizeable baroclinic instabilities on the sponge boundary like before. However, it does not seem good enough. I can easily see the demarcation line where the sponge is active. We probably should either reduce the restoring timescale or the size of the sponge region that is quite large

@francispoulin
Copy link
Author

Thanks @simone-silvestri . Did you want to change the script however you think is best and then I can run it? We should probably output less often if we want to run it for 10 years.

@simone-silvestri
Copy link
Collaborator

Sure, I can reduce the restoring rate and sponge region and output once every 5 days

@simone-silvestri
Copy link
Collaborator

I have reduced a bit the restoring rate, reduced the extend of the northern sponge region, and changed the output writer schedule to 5 days

@francispoulin
Copy link
Author

Thanks @simone-silvestri . I just updated my repo and will run it right away. I hope to have some results by tomorrow, maybe.

@francispoulin
Copy link
Author

When I tried running it I got an error. Maybe this is related to other updates?

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.

@navidcy
Copy link
Collaborator

navidcy commented Jan 13, 2025

When I tried running it I got an error. Maybe this is related to other updates?

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.

Yeah, see #318 (comment), CliMA/Oceananigans.jl#4037, CliMA/Oceananigans.jl#4036

We need to fix it!

@francispoulin
Copy link
Author

Thanks @navidcy for pointing this out. I will look to see if there is a bandaide fix in the mean time.

@navidcy
Copy link
Collaborator

navidcy commented Jan 13, 2025

Thanks @navidcy for pointing this out. I will look to see if there is a bandaide fix in the mean time.

CliMA/Oceananigans.jl#4036 (comment) as bandaid 🩹?

@navidcy navidcy changed the title creating new example for the ACC New regional example for the ACC Jan 13, 2025
@simone-silvestri
Copy link
Collaborator

I think we need to update Climaocean to Oceananigans version 0.95.6 to make sure the GPUArrays that comes with it is correct

@francispoulin
Copy link
Author

Thanks @navidcy for pointing this out. I tried the example and can get the same error, but I don't see the fix anywhere in the comment. Is there an easy temporary fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants