-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPSA improvements [RFC] #535
Comments
From my experience on SPSA, the main problem is the high level of noise in the results. If any proposition reduce this noise, I agree with it :-)
Can we choose them number N ? Increase it specially. I think that below 100 games, the result can be completely wrong and lead to a bad convergence. |
@MJZ1977 the companion code of the seminal paper asks for the number of averaged SP gradients to be used per iteration. List updated, thank you :) |
@ppigazzini : what are the effects of these options? did they change the number N of games before updating parameters ? |
@MJZ1977 "careful clipping" 7eebda7 and randomized rounding 5f63500 are theoretical improvements with little/no effect on SPSA convergence wrt other parameters. People stuck to default, so the GUI was simplified dropping the possibility to chose them. I will do some other tests and then I will simplify the code dropping the options not useful. |
From what i'm finding online, alpha is usually 0.602, gamma at 0.101 is ok, and A is ~ 10% the number of iterations. would these be good defaults for the SPSA fields? Sources: |
@linrock it makes definitely sense to have defaults for the fields (actually, I was thinking they had defaults...). Also @ppigazzini suggests to have A depend on the number of games. Shouldn't we call the field 'A[in %]' and give it a default of 10%, so that the field doesn't need to be adjusted when the number of games is changed ? |
ah yea, i removed the SPSA defaults in the "create new test" redesign PR when all that should've been removed was the list of hard-coded params in the SPSA parameter list. A as a percentage of # games makes sense. from what i'm reading, A is typically less than or equal to 10% the expected # of iterations (2 games per iteration). So maybe it could be either:
|
Haha, in all this time I never realised that A was (/ should be) related to the number of games! :) Regarding SPSA at very low tc, does that stress the server a lot because workers are continually returning small batches of data? |
@xoto10 the SPSA at very low tc can be also done locally :) |
@linrock either percentage seems fine to me. Probably games, since we specify #games for SPSA and not number of iterations. In the future, I could imagine that an iteration contains more than 2 games (i.e. batching for SPSA, @vdbergh?), to reduce server load, and because it presumably makes sense (but I don't know the SPSA details). |
@vondele I am working on a small PR to allow the server to set a batch_size. It is mainly for sprt but it will also work for spsa and fixed games although for those one may consider leaving it to the worker. We can see. |
@ppigazzini : I am trying to understand how SPSA code is working and my knowledge is very weak. Nevermind, I am trying.
My questions are:
And sorry for these "technical questions" ... Update : latest version of code |
@MJZ1977 I think it is great somebody is looking at the implementation of SPSA. I'm still puzzled why our tuning attempts have such a low success rate (@linrock recent experience). I do think we need a very large number of games, as the Elo difference we're looking for are so small, and the parameters of SPSA are not obvious or automatic, but I also think we need to critically audit the actual implementation, just in case. |
@MJZ1977 You should also look at the worker code to get the complete picture, at and below this line See https://github.com/zamar/spsa for the original implementation |
@MJZ1977 A worker plays batches of 2*N-CPU games (white/black alternating) and requests a parameter update from the server after every batch. |
@vondele SPSA claims to minimize the number of function evaluations. Classic SPSA evaluates the function only at "variables_values_k+delta; variables_values_k-delta" for the gradient estimation, so SPSA obviously diverges with wrong delta. This is why I suggest to test locally the SPSA parameters with USTC before submitting to fishtest. The one side SPSA computes the gradient with "variables_values_k+delta; variables_values_k", so having a CPU cost free function evaluation with variable_value_k it's possible to implement:
Neither policies can guarantee the convergence with bad delta, though. SPSA (and all gradient descent algorithms) works only to refine the starting values within the starting basin, to find better local maxima we should switch to global optimization algorithms based on function evaluations (Nelder-Mead, genetic etc.) to explore the space variables. |
@tomtor : thank you for the links ! |
@ppigazzini concerning Nelder-Mead, I did work on interfacing cutechess games to the nevergrad suite of optimizers : https://github.com/vondele/nevergrad4sf and picked TBPSA which seems to be the recommended optimizer for noisy functions. I found it robust if given enough games (millions literally). Unfortunately, the optimized parameters seem very good at the TC they have been optimized (VSTC), but not transferable. Since I can't optimize at STC or LTC, it would need to be integrated in fishtest.... but I'm not able to do that (time and experience with the framework lacking atm).... if somebody wants to pick it up, I would be happy to help. |
After making some tests, I think that one of principal problems is that the random parameter "flip" is only taking values +1 or -1 (please correct if I am wrong). So basically, fishtest always tries to change all variables at the same time. |
If we want to tune 1 constant, it would be nice if tuning could simply test the start value and N values either side (3? 5?) and then display a bar chart of the resulting performance. That might give us an easy to read clue as to whether there's a trend in in which values are better. |
SPSA = Simultaneous perturbation stochastic approximation
Random [+1, -1] is the Rademacher distribution, you can use other distributions, but the result IMO will not change: we can get good fishtest gains from SPSA only for bad tuned parameters or when SPSA finds for serendipity a new local maximum. SPSA, like other gradient algorithms, it'a a local optimization, useful to refine the starting values "without hopping from the starting basin". @xoto10 you are talking about a global optimization algorithm, take a look to the @vondele work. |
while the TBPSA might also work for global optimization (that's always hard), I don't think we're typically stuck in local minima. At least, I have never seen evidence of that. TBPSA seems to be just rather good of doing the right thing in the presence of noise, also in (a relatively small) number of dimensions. @xoto10 the bar chart will tell almost nothing in most cases, unless we do on the order of 240000 games per point (that's roughly 1Elo error, i.e. the typical gain from a tune). I once did a scan of one parameter for one of the search parameter, and the graph is somewhere in a thread on github, which I can't find right now, and it looks like this: |
In that case (a proper implemented) SPSA should be able to find a better value, but in my first post I collected all my doubts about our SPSA implementation. A simple proof is to set a blatant wrong value for a parameter (eg. Queen = 0.1 pawn, sorry I'm not a SF developer :) and view if our SPSA is able to recover a good value. |
I made some tests since yesterday and come to the conclusion that SPSA is not working well actually because of too much noise in individual results. The only solution to this is to make iterations for at least 200 games instead of 2N games. An improvement can be to add an SPSA parameter = minimum number of games per iteration instead of the default 2N |
@vondele Some parameters naturally have only a discrete set of values. E.g. depth. I have been reading the Fishtest rounding and clipping code and it does more or less the right thing in the sense that it will not put rounded values in its data base. So a parameter will not get stuck at a particular value due to rounding. It might be good though to reenable stochastic rounding for the integer parameters that are actually sent to the worker. I don't see any drawback to that. |
Stochastic rounding should have a practical effect only with a wrong parameter scaling or bad hyper parameters. |
@ppigazzini Why do you say this? I am mainly thinking of parameters which are naturally integers like depth. There is no benefit in replacing The SPSA algorithm fundamentally deals with real numbers and stochastic rounding is an elegant 1-line trick for combining real numbers and integers. |
Ops, errata corrige "wrong parameter scaling", "wrong parameter range/scale" If depth range is [80,81] I think that the right way is to run 2 SPSAs (with depth=80 and depth=81) to optimize the other parameters. If depth range is [0,160] I don't think that stochastic rounding has an effects wrt deterministic rounding on the SPSA convergence. I don't view drawbacks in the stochastic rounding, but the most important thing is to have a good SPSA guideline to avoid to wast CPU resources. |
@ppigazzini Those are extreme examples. I was more thinking of intervals like I am not sure actually if stochastic rounding provides any benefit at all. But it kind of avoids one having to think about the issue. |
Stochastic rounding is fine, but I fear a dev that avoid to think before submitting a SPSA :) |
I started writing some javascript. Currently I am bit stressed out about the draw ratio. The draw ratio can be measured dynamically but this requires changing things on the server side. Probably it is best to start with only client side changes. On the client side we can make a reasonable guess about the draw ratio starting from the time control (i.e. interpolating). This would depend on the book, but the book does not change very often. So that should be ok. But then there is nodestime :( :( I don't really want to think about nodestime. AFAIK its benefit has never been demonstrated and so I am tempted to just disable it when using the alternative algorithm. |
I'd take any reasonable draw ratio (typical STC or LTC values)... at this point nothing too advanced. |
The effect of the draw ratio on the required number of games is too big to ignore unfortunately. But the following simple function predicts the draw ratio within 1% accuracy from VSTC to LTC.
Disclaimer. The fit was based on only 4 data points. 1+0.01, 10+0.1, 20+0.2 and 60+0.6. |
Just checked a 120+1.2 test and the function still gave the right answer (0.79). |
Can we please increase the batch size from 2N to 4N or 8N ?
|
bench or batch? |
batch, sorry ! I correct it |
anything you guys think is particularly worth implementing based on this massive discussion so far? i'm happy to help either improve SPSA or introduce another optimizer (i.e. nevergrad) |
I don't have enough knowledge to suggest one option or another, I'm fine with what the experts suggest to be the default. Note that this thread has lead to optionally new defaults for SPSA. Also with the merge of NNUE, I expect there will be fewer SPSA tunes... even there is potential there as well. |
@linrock : if you can increase the batch size from 2xN to 4xN or make it an option it will be great ! |
A question about SPSA; is it possible to incorporate the draw rate into the parameter optimisation, i.e. obviously have wins-losses as the main target, but also prefer a lower draw rate ? |
SPSA could optimize another objective function. Right now, it optimizes the score of the match (i.e. Elo), you could have it optimize something else, in principle. Imagine you have (w,l,d) what formula f(w,l,d) would you optimize? |
I was thinking that the objective function was (something like) W-L, but reading your comment and a quick bit of the wikipedia Elo page suggests that the objective function is perhaps W + 0.5 * D ? (Which I should have guessed at / known to start with) If so, can we reduce the draw weighting with the idea that this would encourage a more aggressive style of play ? It would be interesting if spsa runs can be parameterized to allow this so that we could try some tests. (Well, it seems like an interesting possibility to me - happy to be corrected :-) |
yes, probably one could us e.g. 'w + 0.4 * d' and one would effectively optimize something like contempt. One could similarly do SPSA with the current objective function, but use e.g. time-odds (so optimize scores against a weaker player). I locally played with that, but without much success. It would presumably become even more difficult to have patches pass after such an SPSA tune, since our SPRT tests measure strength, and it seems contempt goes against that.... |
Hmmm. I tend to think in terms of w-l being the ultimate aim, but because the 3 variables are linked I think this is equivalent to w+0.5d (not surprising I guess, it would be strange if years spent optimizing for w+0.5d were not actually aiming at the best target!)
So optimizing for w+0.5d makes +20-10 equivalent to +15-5. Changing to w+0.4d would make +19-11 equivalent to +15-5 ! So it would be as happy with +8 instead of +10 because of the 10 extra decisive games. Hmmm. Maybe 0.49 :) I would want w-l to be the most important, but lower draw rates to be preferred if w-l values are the same. So perhaps include the draw rate in the fractions, e.g. w+0.5d-f where f = d/(w+l+d)/2 so that f cannot be more important than the main part of the formula.
This would give a slight preference to the +20-10 result compared to the +15-5 one. |
Thinking some more, I think there are 2 slightly different angles here, aggressive play against strong players and aggressive play against weaker players. The difference is mainly one of degree, against weaker players we can play definitely sub-par moves if the gain in winning chances is good enough, while against strong players we need to still play good moves even if we're trying to choose more aggressive ones. Currently I'm more interested in the aggression against strong players, e.g. in self-play, the aim being to get a more aggressive play style and a slightly lower draw-rate. This has the advantage that we can just test against master, we don't need all that complication of testing against weaker players. I'm assuming this would work out well against weaker players anyway :-) Looking at rundb.py, line 983 does :
where X and Y are parameters to the test run - would that work? |
Issue opened to collect info about possible future SPSA improvements.
SPSA references
SPSA is a fairly simple algorithm to be used for local optimization (not global optimization).
The wiki has now a simple documentation to explain the SPSA implementation in fishtest
Here is other documentation:
SPSA implementation problems/improvements
SPSA testing process (aka Time Control)
EDIT_000
this paragraph is outdated, I kept it to avoid disrupting the chain of posts:
I suggest this process to optimize the developer time and the framework CPU.
I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar:
https://tests.stockfishchess.org/tests/view/5e2dade6ab2d69d58394fb5e
The text was updated successfully, but these errors were encountered: