Optimise a subset of parameters #35

mcabbott · 2022-01-28T13:40:41Z

Flux's trainable works like this:

julia> Flux.trainable(BatchNorm(2, relu))  # this avoids half the parameters
(Float32[0.0, 0.0], Float32[1.0, 1.0])

julia> Functors.children(BatchNorm(2, relu))   # this sees them all, for |> gpu
(λ = NNlib.relu, β = Float32[0.0, 0.0], γ = Float32[1.0, 1.0], μ = Float32[0.0, 0.0], σ² = Float32[1.0, 1.0], ϵ = 1.0f-5, momentum = 0.1f0, affine = true, track_stats = true, active = nothing, chs = 2)

This doesn't seem great, it relies on objectid to know which parameters those really are. So this:

function _trainable_walk(f, x)
  func, re = functor(x)
  nb = trainable(x)
  re(map(c -> c in nb ? f(c) : c, func))
end

will not work correctly for say β === SA[0.0, 0.0] === μ.

How should it work?

One idea would be to clone the @functor macro to have @trainable BatchNorm (β, γ)? In fact this case is even worse, it checks a value here but we could probably move affine into the type.
Another idea would be just to have trainable(:: BatchNorm) = (:β, :γ) the symbols. That's much easier to write and perhaps less mysterious. Might be slower, do we care? Or might not be, if the symbols are known from the type. It would be easy here to allow Flux-style tuples as a fallback, detecting NTuple{Symbol} etc, making it easier to have both old- and new-style at once.

This would be used during setup, just one pass. After that, the tree of optimiser states should tell you whether or not to update a given array, so update need never call this.

What might call it more often is destructure, which I think we want to walk only the trainable parameters, and will sometimes be called in a loop.

The text was updated successfully, but these errors were encountered:

darsnack · 2022-01-28T14:04:43Z

_trainable_walk seems correct assuming fmap's behavior for shared parameters. I feel like this is more evidence that fmap should not be caching nodes for the sake of sharing.

mcabbott · 2022-01-28T14:15:20Z

It should work fine with Arrays. With SArrays... right now fmap will conclude that β and μ are shared, but I think map(c -> ... runs on children before it's checked that... and then I don't quite know, I guess it may depend on the order in the struct?

But anyway the scope here is narrower. I think you are agreeing that this trainable doesn't capture the right information. So what should replace it?

darsnack · 2022-01-28T15:37:48Z

You're right, I completely misread the issue. We definitely agree on what's wrong here.

Why not just have trainable return a NamedTuple instead? Or the NTuple{Symbol} is okay too.

darsnack · 2022-01-28T15:42:11Z

This would be used during setup, just one pass. After that, the tree of optimiser states should tell you whether or not to update a given array, so update need never call this.

That's an option though I'd suggest we don't introduce state we can get for free. What I mean by this is that if trainable returns the correct thing, then fmap(..., walk = walk_subset(trainable)) should work. We don't use fmap now, but eventually I think we should get it to the place where we can.

mcabbott · 2022-01-28T15:57:11Z

Why not just have trainable return a NamedTuple instead?

We could do this. Flux's rules can be updated to do that without breaking anything there.

I am picturing that, in a not distant version of Flux, it should depend on Optimisers and provide methods which work here, while still being usable in its old way.

If user code returns a Tuple, we can convert it (using objectid) and print a warning.

mcabbott · 2022-01-28T16:07:50Z

don't introduce state we can get for free

You can write setup in one line with fmap and some trainable-walk. But whether this is better than trusting the tree of optimiser states I don't know. It's an object we already make and pass along, and it must be right.

I guess that they start to differ more once we handle tied parameters. I picture the tree of states also having (at its root) some lenses or something telling us about what transformations to perform before starting; these things are figured out ounce during setup. They could also be done every time, but I think that would need another pass over the model before the update one.

ToucheSir · 2022-01-28T16:39:44Z

But setup is the function that constructs that tree of optimizer states? So it dictates the traversal behaviour for all subsequent operations. The root issue here is that Functors.functor(::Type{T}, x) currently has a monopoly over reconstruction of T from its functored representation. We can work around this downstream well enough by subsetting as discussed here and mergeing the changes back in afterwards before reconstruction, but longer term this (the inflexibility of the re closure and the fact that you have to carry around a closure at all) ought to be addressed.

ToucheSir mentioned this issue Jan 28, 2022

Support scalar numbers #33

Closed

mcabbott mentioned this issue Jan 28, 2022

Add trainable #36

Merged

mcabbott closed this as completed in #36 Jan 28, 2022

ToucheSir mentioned this issue Aug 21, 2022

RFC: add Functors-aware structural gradient FluxML/Tracker.jl#129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise a subset of parameters #35

Optimise a subset of parameters #35

mcabbott commented Jan 28, 2022 •

edited

Loading

darsnack commented Jan 28, 2022

mcabbott commented Jan 28, 2022 •

edited

Loading

darsnack commented Jan 28, 2022

darsnack commented Jan 28, 2022

mcabbott commented Jan 28, 2022

mcabbott commented Jan 28, 2022

ToucheSir commented Jan 28, 2022

Optimise a subset of parameters #35

Optimise a subset of parameters #35

Comments

mcabbott commented Jan 28, 2022 • edited Loading

darsnack commented Jan 28, 2022

mcabbott commented Jan 28, 2022 • edited Loading

darsnack commented Jan 28, 2022

darsnack commented Jan 28, 2022

mcabbott commented Jan 28, 2022

mcabbott commented Jan 28, 2022

ToucheSir commented Jan 28, 2022

mcabbott commented Jan 28, 2022 •

edited

Loading

mcabbott commented Jan 28, 2022 •

edited

Loading