Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely high first call latency Julia 1.6 versus 1.5 with multiphysics PDE solver #43206

Closed
amartinhuertas opened this issue Nov 24, 2021 · 12 comments
Labels
compiler:latency Compiler latency regression Regression in behavior compared to a previous version

Comments

@amartinhuertas
Copy link

amartinhuertas commented Nov 24, 2021

Dear all,

apologies if this is not the appropriate forum to report the following.

As suggested by the issue system I already tried in julia lang Discourse without too much success, so that I am afraid that it might have passed unadverted to some of the Julia compiler experts. The julia discourse thread is available here https://discourse.julialang.org/t/extremely-high-first-call-latency-julia-1-6-versus-1-5-with-multiphysics-pde-solver/71692.

For convenience let me copy here the contents of the main post there.

Dear all,

we are experiencing extremely high first call latency times with Julia 1.6 when running our geophysical flows multiphysics finite element PDE solver (on the order of 15 hours on a supercomputer node with 192 GBytes of mem; on a local laptop it is killed after some hours by the OS due to memory consumption). Obviously, this significantly geopardizes development.

On the other hand, it is surprising that with Julia 1.5 the first call latency of the very same code is only around 15 mins on the local laptop referred above.

For the records, the code is available in the public domain in the following git repo

https://github.com/gridapapps/GridapGeosciences.jl

And can be executed as

git clone https://github.com/gridapapps/GridapGeosciences.jl
cd GridapGeosciences.jl
git checkout remove_dependency_pardiso
julia --project=.
using Pkg; Pkg.instantiate()
include("test/Williamson2ThetaMethodFullNewtonTests.jl")

We would like to understand the cause of the problem. It seems that some feature(s) in our code are challenging the Julia compiler performance, but it is really hard to say without having a deep understanding of the underlying machinery. It would be ideal to come up with a MWE, but the code complexity and the fact that it stalls with Julia 1.6 does not facilitate the task.

Is there anyone in the community experiencing similar issues? Which performance analysis tools may we use to detect the cause of the problem? We guess that this issue might be related to https://discourse.julialang.org/t/how-to-cut-down-compile-time-when-inference-is-not-the-problem/59077, but we are not certain.

Thanks in advance for your help!
Regards,
Alberto.

@amartinhuertas

@oscardssmith oscardssmith added compiler:latency Compiler latency regression Regression in behavior compared to a previous version labels Nov 24, 2021
@barucden
Copy link
Contributor

barucden commented Nov 25, 2021

It seems that a conclusion has been reached in the referenced discourse thread.

@amartinhuertas
Copy link
Author

It seems that a conclusion has been reached in the referenced discourse thread.

Not really. Still under investigation. :-)

@goerch
Copy link

goerch commented Nov 27, 2021

Analyzing the problem with --trace-compile we identified a critical function, whose compilation was fast on Julia 1.5.4 and very slow on 1.6.4:

precompile(Tuple{typeof(Gridap.FESpaces._numeric_loop_vector!)
[cut 300 kB type signature]

Adding a @nospecialize to the critical function like so

@noinline function _numeric_loop_vector!(vec,caches,cell_vals,cell_rows)
  @nospecialize
  add_cache, vals_cache, rows_cache = caches
  @assert length(cell_vals) == length(cell_rows)
  add! = AddEntriesMap(+)
  for cell in 1:length(cell_rows)
    rows = getindex!(rows_cache,cell_rows,cell)
    vals = getindex!(vals_cache,cell_vals,cell)
    evaluate!(add_cache,add!,vec,vals,rows)
  end
end

the compile time on Julia 1.6.4 was comparable to 1.5.4 again. Inspired by a different thread I also checked the MWE without @nospecialize but with -O1 instead of -O2. This again lead to the compile on Julia 1.6.4 being comparable to 1.5.4.

I currently conclude:

  • -O1 and -O2 generate different specializations
  • -O2 occasionally gets problems specializing very complex type signatures which can only be combated identifying these (with tools like --trace-compile or SnoopCompile) and @nospecialize-ing them
  • 1.5.4 -O2 behaves like 1.6.4 -O1 in these cases

Is this intended behaviour?

@KristofferC
Copy link
Member

How is the behavior on 1.7? #42263 fixed on of the latency problems with gridap and is in 1.7.

@goerch
Copy link

goerch commented Nov 27, 2021

The following MWE tries to mimic the problem

using BenchmarkTools

zero = nothing
succ(n) = (n,)

unary(n) = n == 0 ? zero : succ(unary(n - 1))
binary(n::Nothing) = 0
binary(n) = begin @nospecialize; 1 + binary(n[1]); end

add(x::Nothing, y) = y
add(x, y) = begin @nospecialize; succ(add(x[1], y)); end

mul(x::Nothing, y) = nothing
mul(x, y) = begin @nospecialize; add(y, mul(x[1], y)); end

double(n) = begin @nospecialize; plus(n, n); end
square(n) = begin @nospecialize; mul(n, n); end

versioninfo()
for n in 0:10:50
    print("compile ", n, " ")
    @time binary(square(unary(n)))
    print("run ", n, " ")
    @btime binary(square(unary($n)))
end

When run on 1.7.0-rc3 -O2 I see

compile 0   0.008525 seconds (11.99 k allocations: 745.920 KiB, 98.05% compilation time)
run 0   14.314 ns (0 allocations: 0 bytes)
compile 10   0.010630 seconds (5.88 k allocations: 395.744 KiB, 96.80% compilation time)
run 10   56.200 μs (0 allocations: 0 bytes)
compile 20   0.001936 seconds (7.37 k allocations: 506.453 KiB, 25.99% compilation time)
run 20   282.500 μs (0 allocations: 0 bytes)
compile 30   0.004797 seconds (10.70 k allocations: 712.703 KiB, 13.73% compilation time)
run 30   710.200 μs (389 allocations: 6.08 KiB)
compile 40   0.006381 seconds (13.92 k allocations: 895.000 KiB, 11.14% compilation time)
run 40   1.124 ms (1089 allocations: 17.02 KiB)
compile 50   0.014616 seconds (19.46 k allocations: 1.197 MiB, 7.90% compilation time)
run 50   2.626 ms (1989 allocations: 31.08 KiB)

Removing @nospecialize I see

compile 0   0.004973 seconds (3.17 k allocations: 200.032 KiB, 99.10% compilation time)
run 0   15.115 ns (0 allocations: 0 bytes)
compile 10   0.035212 seconds (334.22 k allocations: 20.271 MiB, 99.91% compilation time)
run 10   248.705 ns (0 allocations: 0 bytes)
compile 20   0.353602 seconds (1.04 M allocations: 62.770 MiB, 2.10% gc time, 99.99% compilation time)
run 20   544.149 ns (0 allocations: 0 bytes)
compile 30   2.232014 seconds (1.87 M allocations: 112.422 MiB, 0.78% gc time, 100.00% compilation time)
run 30   900.000 ns (0 allocations: 0 bytes)
compile 40   8.972075 seconds (2.79 M allocations: 167.371 MiB, 0.33% gc time, 100.00% compilation time)
run 40   1.450 μs (0 allocations: 0 bytes)
compile 50  24.454485 seconds (3.98 M allocations: 237.567 MiB, 0.21% gc time, 100.00% compilation time)
run 50   1.860 μs (0 allocations: 0 bytes)

Unfortunately I can't reproduce the -O1 behavior for this example with 1.7.0-rc3 and am therefore unsure how good a MWE this is.

But at least this indicates that the compiler should stop at some point generating specializations?

@goerch
Copy link

goerch commented Nov 27, 2021

How is the behavior on 1.7? #42263 fixed on of the latency problems with gridap and is in 1.7.

Checked 1.7.0-rc3 -O1/-O2: behavior looks similar to 1.6.4.

@KristofferC
Copy link
Member

Thanks for checking and that's a great MWE.

@goerch
Copy link

goerch commented Nov 27, 2021

OK, if you like this stuff, then the obvious continuation

using BenchmarkTools

zero = nothing
succ(n) = (n,)

unary(n) = n == 0 ? zero : succ(unary(n - 1))
binary(n::Nothing) = 0
binary(n) = begin @nospecialize; 1 + binary(n[1]); end

add(x::Nothing, y) = y
add(x, y) = begin @nospecialize; succ(add(x[1], y)); end

double(n) = begin @nospecialize; plus(n, n); end

mul(x::Nothing, y) = nothing
mul(x, y) = begin @nospecialize; add(y, mul(x[1], y)); end

square(n) = begin @nospecialize; mul(n, n); end

exp(x, y::Nothing) = x
exp(x, y) = begin @nospecialize; mul(x, exp(x, y[1])); end

ack(x::Nothing, y) = begin @nospecialize; succ(y); end
ack(x, y::Nothing) = begin @nospecialize; ack(x[1], succ(zero)); end
ack(x, y) = begin @nospecialize; ack(x[1], ack(x, y[1])); end

versioninfo()
for n in 1:5
    print("compile")
    @time binary(exp(unary(n), unary(n)))
    print("run")
    @btime binary(exp(unary($n), unary($n)))
end

# print("compile")
# @time binary(ack(unary(4), unary(1)))
# print("run")
# @btime binary(ack(unary(4), unary(1)))

works

compile  0.011635 seconds (16.15 k allocations: 1000.735 KiB, 96.52% compilation time)
run  1.410 μs (0 allocations: 0 bytes)
compile  0.000279 seconds (2.09 k allocations: 145.500 KiB, 71.23% compilation time)
run  8.533 μs (0 allocations: 0 bytes)
compile  0.000519 seconds (4.33 k allocations: 299.750 KiB, 58.67% compilation time)
run  66.800 μs (0 allocations: 0 bytes)
compile  0.002059 seconds (8.70 k allocations: 551.469 KiB, 21.24% compilation time)
run  838.200 μs (513 allocations: 8.02 KiB)
compile  0.026189 seconds (55.24 k allocations: 2.536 MiB, 3.29% compilation time)
run  16.290 ms (15114 allocations: 236.16 KiB)

and crashes the compiler without @nospecialize.

@JeffBezanson
Copy link
Member

  • -O1 and -O2 generate different specializations

FYI -O1 does not actually change which specializations are generated, but @nospecialize might have a similar effect due to giving LLVM worse code that some optimization passes can't handle, causing them to be skipped.

@JeffBezanson
Copy link
Member

If you have time a bisect might help here.

@goerch
Copy link

goerch commented Nov 30, 2021

I'm not convinced anymore my reproducers are any good. Bisecting with the 12 hour compile time of the original MWE might take some time? Trying to reduce the original MWE I stumbled about quite some type instabilities and have the distinct impression of a relation between type instabilities and longer compile times on -O2. Therefore I filed gridapapps/GridapGeosciences.jl#28 and
gridapapps/GridapGeosciences.jl#29.

Edit: to demonstrate the differences to the original MWE

using BenchmarkTools
using JET

_zero = nothing
succ(n) = (n,)

unary(n) = n == 0 ? _zero : succ(unary(n - 1))
binary(n::Nothing) = 0
binary(n) = begin 1 + binary(n[1]); end

add(x::Nothing, y) = y
add(x, y) = begin succ(add(x[1], y)); end

double(n) = begin add(n, n); end

@time binary(double(unary(500)))
@btime binary(double(unary(500)))
@report_opt binary(double(unary(100)))

yields on Julia 1.8.0-DEV -O2

7.644946 seconds (2.47 M allocations: 150.754 MiB, 0.79% gc time, 100.00% compilation time)
  31.200 μs (0 allocations: 0 bytes)
No errors !

and on Julia 1.5.4 -O2 without JET

  4.251926 seconds (2.37 M allocations: 146.511 MiB, 0.61% gc time)
  28.299 μs (0 allocations: 0 bytes)

So I see a slight compile time regression, but not orders of magnitude as reported and the example does not contain type instabilities. It might be related (and an issue in itself) but is not a clear reproducer.

@fverdugo
Copy link

fverdugo commented Dec 1, 2021

We were able to solve the problem by removing quite a lot of @inline statements in our code. I am afraid that we were in-lining A LOT of code in the same function. See the analysis I have performed here (even though this is for a different example as the original posted in this issue): gridap/Gridap.jl#723

Hope this provides some clue.

cc @amartinhuertas

@vtjnash vtjnash closed this as completed Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

8 participants