Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tell people to stop using global models #1085

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

oxinabox
Copy link
Member

People keep doing this and it is bad.
Having it in the performance tips should at at least help.

But throughout the docs we violate this rule nearly constantly.
My bad example was basically copied straight from https://fluxml.ai/Flux.jl/stable/training/training/#Loss-Functions-1

But this PR is just going to add it to the performance tips.
Its beyond scope of this PR to rewrite every example that violates this

```

#### Make the loss function actually close over `m`.
Closures can be very useful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Closures can be very useful.
[Closures](https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-captured-1) can be very useful.

Similarly anything else that is a non-constant global that is used in functions should also be made constant

#### Put everything in a main function:
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter of `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter of `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable.
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter if `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable.

@DhairyaLGandhi
Copy link
Member

Do we link to https://docs.julialang.org/en/v1/manual/performance-tips/ appropriately too?


Flux.train!(loss, Flux.params(m), data, Descent(0.1))
```
Similarly anything else that is a non-constant global that is used in functions should also be made constant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps explicitly call it data? Or inputs and outputs. That way most bases are covered, I can't imagine other things being used in forward passes other than models and data.

@bhvieira
Copy link
Contributor

I'll be frank here, I didn't know declaring the models constant helped, because the parameters are type-stable. I declare models inside functions, so I'm not hit by this when training, but I've been running some post-hoc analysis on the trained models and this should benefit me when loading trained models into memory, right? I'll try to benchmark this later.

@bhvieira
Copy link
Contributor

bhvieira commented Mar 27, 2020

using Flux
using BenchmarkTools

m = Chain(Dense(784, 32, σ), Dense(32, 10), softmax);
const const_m = Chain(Dense(784, 32, σ), Dense(32, 10), softmax);
loss(x, y) = Flux.mse(m(x), y);
const_loss(x, y) = Flux.mse(const_m(x), y);
x, y = (randn(Float32, 784, 100), softmax(randn(Float32, 10, 100)));
const const_x, const_y = (randn(Float32, 784, 100), softmax(randn(Float32, 10, 100)));

#tests without `$` interpolation, I guess it makes sense since we want to measure the effect of globals
@benchmark loss(x, y)

BenchmarkTools.Trial:
memory estimate: 46.88 KiB
allocs estimate: 20

minimum time: 228.200 μs (0.00% GC)
median time: 249.301 μs (0.00% GC)
mean time: 279.954 μs (1.77% GC)
maximum time: 5.604 ms (94.28% GC)

samples: 10000
evals/sample: 1

@benchmark loss(const_x, const_y)

BenchmarkTools.Trial:
memory estimate: 46.88 KiB
allocs estimate: 20

minimum time: 227.500 μs (0.00% GC)
median time: 242.500 μs (0.00% GC)
mean time: 266.806 μs (1.45% GC)
maximum time: 4.452 ms (94.59% GC)

samples: 10000
evals/sample: 1

@benchmark const_loss(x, y)

BenchmarkTools.Trial:
memory estimate: 46.88 KiB
allocs estimate: 20

minimum time: 228.600 μs (0.00% GC)
median time: 241.200 μs (0.00% GC)
mean time: 272.915 μs (1.30% GC)
maximum time: 3.870 ms (92.49% GC)

samples: 10000
evals/sample: 1

@benchmark const_loss(const_x, const_y)

BenchmarkTools.Trial:
memory estimate: 46.86 KiB
allocs estimate: 19

minimum time: 230.500 μs (0.00% GC)
median time: 242.700 μs (0.00% GC)
mean time: 280.045 μs (0.89% GC)
maximum time: 3.766 ms (0.00% GC)

samples: 10000
evals/sample: 1


@benchmark Flux.train!(loss, Flux.params(m), [(x,y)], Descent(0.1))

BenchmarkTools.Trial:
memory estimate: 744.63 KiB
allocs estimate: 10388

minimum time: 1.133 ms (0.00% GC)
median time: 1.239 ms (0.00% GC)
mean time: 1.414 ms (4.25% GC)
maximum time: 7.640 ms (0.00% GC)

samples: 3516
evals/sample: 1

@benchmark Flux.train!(const_loss, Flux.params(const_m), [(const_x,const_y)], Descent(0.1))

BenchmarkTools.Trial:
memory estimate: 744.59 KiB
allocs estimate: 10387

minimum time: 1.134 ms (0.00% GC)
median time: 1.233 ms (0.00% GC)
mean time: 1.437 ms (4.32% GC)
maximum time: 6.885 ms (0.00% GC)

samples: 3458
evals/sample: 1

@jondeuce
Copy link
Contributor

jondeuce commented Jun 28, 2020

Echoing @bhvieira's thoughts, I'm not seeing much of a performance hit in practice, with the exception of very small models. Consider the following examples:

m_large = Flux.Chain(Flux.Dense(768,1024), Flux.Dense(1024,10), Flux.softmax)
m_small = Flux.Chain(Flux.Dense(1,1), Flux.Dense(1,1), Flux.softmax)
const const_m_large = deepcopy(m_large)
const const_m_small = deepcopy(m_small)

loss_large(x,y) = Flux.mse(m_large(x), y)
loss_small(x,y) = Flux.mse(m_small(x), y)
grad_large(x,y) = Flux.gradient(() -> loss_large(x,y), Flux.params(m_large))
grad_small(x,y) = Flux.gradient(() -> loss_small(x,y), Flux.params(m_small))
loss_const_large(x,y) = Flux.mse(const_m_large(x), y)
loss_const_small(x,y) = Flux.mse(const_m_small(x), y)
grad_const_large(x,y) = Flux.gradient(() -> loss_const_large(x,y), Flux.params(const_m_large))
grad_const_small(x,y) = Flux.gradient(() -> loss_const_small(x,y), Flux.params(const_m_small))

makexy(in, out, batch) = rand(Float32, in, batch), rand(Float32, out, batch)

@btime loss_large(x,y)       setup = ((x,y) = makexy(768, 10, 1024)); # 6.491 ms (19 allocations: 8.20 MiB)
@btime loss_const_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 6.515 ms (18 allocations: 8.20 MiB)

@btime grad_large(x,y)       setup = ((x,y) = makexy(768, 10, 1024)); # 18.681 ms (161 allocations: 18.54 MiB)
@btime grad_const_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 18.459 ms (157 allocations: 18.54 MiB)

@btime loss_small(x,y)       setup = ((x,y) = makexy(1, 1, 1)); # 1.059 μs (12 allocations: 944 bytes)
@btime loss_const_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 1.021 μs (11 allocations: 928 bytes)

@btime grad_small(x,y)       setup = ((x,y) = makexy(1, 1, 1)); # 15.880 μs (133 allocations: 5.81 KiB)
@btime grad_const_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 12.370 μs (129 allocations: 5.72 KiB) <--- Performance hit

Interestingly, the ~3 μs performance hit in the gradient of the small model is fairly robust to changes in model size.

To be honest, I'm not exactly sure of the point I'm trying to make; of course, declaring models as const is strictly better than not, and reinforcing this in the documentation (especially considering it is good general Julia practice) is a good idea. Perhaps, however, it can be noted that for large computations the use of global models is not likely to seriously denigrate performance, due to internal model parameters themselves being type stable etc.

@bhvieira
Copy link
Contributor

@jondeuce Yeah, I think that's the takehome message here. The microsecond difference might be due to that few allocations we get without const

@CarloLucibello
Copy link
Member

given the above benchmarks, the tone here is too alarmistic and should be moderated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants