-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tell people to stop using global models #1085
base: master
Are you sure you want to change the base?
Conversation
``` | ||
|
||
#### Make the loss function actually close over `m`. | ||
Closures can be very useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closures can be very useful. | |
[Closures](https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-captured-1) can be very useful. |
Similarly anything else that is a non-constant global that is used in functions should also be made constant | ||
|
||
#### Put everything in a main function: | ||
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter of `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter of `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable. | |
For more flexibility, you could even make this take `m` as a argument -- it doesn't matter if `m` was originally declared as a non-const global once it has been passed in as a argument because it then becomes a local variable. |
Do we link to https://docs.julialang.org/en/v1/manual/performance-tips/ appropriately too? |
|
||
Flux.train!(loss, Flux.params(m), data, Descent(0.1)) | ||
``` | ||
Similarly anything else that is a non-constant global that is used in functions should also be made constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps explicitly call it data? Or inputs and outputs. That way most bases are covered, I can't imagine other things being used in forward passes other than models and data.
I'll be frank here, I didn't know declaring the models constant helped, because the parameters are type-stable. I declare models inside functions, so I'm not hit by this when training, but I've been running some post-hoc analysis on the trained models and this should benefit me when loading trained models into memory, right? I'll try to benchmark this later. |
|
Echoing @bhvieira's thoughts, I'm not seeing much of a performance hit in practice, with the exception of very small models. Consider the following examples: m_large = Flux.Chain(Flux.Dense(768,1024), Flux.Dense(1024,10), Flux.softmax)
m_small = Flux.Chain(Flux.Dense(1,1), Flux.Dense(1,1), Flux.softmax)
const const_m_large = deepcopy(m_large)
const const_m_small = deepcopy(m_small)
loss_large(x,y) = Flux.mse(m_large(x), y)
loss_small(x,y) = Flux.mse(m_small(x), y)
grad_large(x,y) = Flux.gradient(() -> loss_large(x,y), Flux.params(m_large))
grad_small(x,y) = Flux.gradient(() -> loss_small(x,y), Flux.params(m_small))
loss_const_large(x,y) = Flux.mse(const_m_large(x), y)
loss_const_small(x,y) = Flux.mse(const_m_small(x), y)
grad_const_large(x,y) = Flux.gradient(() -> loss_const_large(x,y), Flux.params(const_m_large))
grad_const_small(x,y) = Flux.gradient(() -> loss_const_small(x,y), Flux.params(const_m_small))
makexy(in, out, batch) = rand(Float32, in, batch), rand(Float32, out, batch)
@btime loss_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 6.491 ms (19 allocations: 8.20 MiB)
@btime loss_const_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 6.515 ms (18 allocations: 8.20 MiB)
@btime grad_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 18.681 ms (161 allocations: 18.54 MiB)
@btime grad_const_large(x,y) setup = ((x,y) = makexy(768, 10, 1024)); # 18.459 ms (157 allocations: 18.54 MiB)
@btime loss_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 1.059 μs (12 allocations: 944 bytes)
@btime loss_const_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 1.021 μs (11 allocations: 928 bytes)
@btime grad_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 15.880 μs (133 allocations: 5.81 KiB)
@btime grad_const_small(x,y) setup = ((x,y) = makexy(1, 1, 1)); # 12.370 μs (129 allocations: 5.72 KiB) <--- Performance hit Interestingly, the ~3 μs performance hit in the gradient of the small model is fairly robust to changes in model size. To be honest, I'm not exactly sure of the point I'm trying to make; of course, declaring models as |
@jondeuce Yeah, I think that's the takehome message here. The microsecond difference might be due to that few allocations we get without |
given the above benchmarks, the tone here is too alarmistic and should be moderated |
People keep doing this and it is bad.
Having it in the performance tips should at at least help.
But throughout the docs we violate this rule nearly constantly.
My bad example was basically copied straight from https://fluxml.ai/Flux.jl/stable/training/training/#Loss-Functions-1
But this PR is just going to add it to the performance tips.
Its beyond scope of this PR to rewrite every example that violates this