-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix least squares #8859
Fix least squares #8859
Conversation
I updated the pull request message with the correctly formatted tests scripts that replicate the problem. Basically, both Matlab and Octave give different results from Julia, if ill-conditioned matrices are involved during the least squares solve. As currently implemented, Julia's least squares solver loses half of machine accuracy and is equivalent to solving the least squares via normal equations. It is not a big issue in most computations, but it is well-known that a correctly implemented least squares solver should give full machine precision. The attached test scripts are constructed to be sensitive to this. The SVD based least square solver has several bugs and should be fixed anyway. The order in which U, S, and V matrices are applied is incorrect (should be V then S and then U), which is equivalent in forming a kind of pseudoinverse. References for methods applicable to testing least squares? Unfortunately, this issue is neglected in Golub and Van Loan (and other standard textbooks). The paper by Ake Bjorck "Solving loinear least squares bloblems by Gram-Schmidt orthogonalization" mentions that the discrepancy in the right hand side is better conditioned that the solution, and "Optimal Sensitivity Analysis As for the pseudoinverse fix, no good theory exists, but one should not use the standard Matlab implementation either. Too many digits are lost if all singular values are used and the standard relations for the pseudo-inverse are not satisfied. Restricting singular value cutoff to the square root or simply using the normal equations is much better, although nobody seems to notice/care. |
To summarize:
|
Thanks for the comment and for paying attention to this. I'll have to defer to the local linear algebra experts as I have no idea whether any of that is right or not :-) |
@zgimbutas Thank you for looking into this and for all the details. I'll have to take a closer look. |
Okay. I have looked a bit more into this. I don't have any comments to the changes that are not related to the threshold values for rank determination. Those changes are just right. Regarding the threshold values, my concern is that you potentially blow up the error just to get a slightly smaller residual. In real case least squares problems, the explained variable is measures with errors that are much larger than the error due to floating point arithmetic. Consider this example julia> A = [ones(3) [1e-15,0,0]]
3x2 Array{Float64,2}:
1.0 1.0e-15
1.0 0.0
1.0 0.0
julia> b = randn(3,1)
3x1 Array{Float64,2}:
1.82643
-0.656788
0.445374
julia> A_ldiv_B!(qrfact(A, pivot = true), copy(b), eps())
(
2x1 Array{Float64,2}:
-0.105707
1.93214e15,
2)
julia> A_ldiv_B!(qrfact(A, pivot = true), copy(b), sqrt(eps()))
(
2x1 Array{Float64,2}:
0.538339
1.79446e-16,
1) so if the Regarding the choice of regularization, I believe that you are still doing that, but just with a different regularization parameter. Do you disagree? I believe that MATLAB does some regularization, e.g. >> [ones(3,1) [1e-16;0;0]]\randn(3,1)
Warning: Rank deficient, rank = 1, tol = 1.153778e-15.
ans =
0.0376
0 and To sum up, you proposed changes might be a good idea, but I'd like to hear your opinions on the issue mentioned above. |
No, neither Matlab nor Octave do not perform regularization for least The choice of nonregularized least squares in both Matlab and Octave, By the way, we noticed this issue in Julia while solving an Julia 0.3, before fix: julia> [ones(3,1) [1e-16;0;0]]\randn(3,1) julia> [ones(3,1) [1e-15;0;0]]\randn(3,1) julia> [ones(3,1) [1e-8;0;0]]\randn(3,1) julia> [ones(3,1) [1e-7;0;0]]\randn(3,1) Julia 0.3, after fix: julia> [ones(3,1) [1e-16;0;0]]\randn(3,1) julia> [ones(3,1) [1e-15;0;0]]\randn(3,1) Octave 3.4.3 octave:1> [ones(3,1) [1e-16;0;0]]\randn(3,1)
octave:2> [ones(3,1) [1e-15;0;0]]\randn(3,1) -2.5627e-03 Matlab R2012a matlab>> [ones(3,1) [1e-16;0;0]]\randn(3,1) ans =
matlab>> [ones(3,1) [1e-15;0;0]]\randn(3,1) ans =
|
Why is it not regularization when the threshold parameter is set to
|
The goal of least squares is to find a set of parameters of a linear model which fits the data best, i.e. we attempt to minimize the residual norm ||Ax-b||_2 (this is the standard definition of the least squares method). This quantity is bounded by the machine precision for a user selected floating point arithmetic model, so setting the threshold parameter to eps() leads to non-regularized least squares for all practical purposes. The best match/smallest residual is achieved if the right hand side is in the range of A which is to say that we should be able to match well-defined data accurately. The regularized least squares attempt to minimize the residual norm AND the norm of the solution simultaneously, i.e. we minimize ||Ax-b||_2 + \lambda ||x||_2. Setting the threshold parameter to sqrt(eps()) limits the norm of of solution ||x||_2, but at the tradeoff of ||Ax-b||_2 now being bounded from below by the square root of machine precision only. One more note: in the test where A = [ones(3) [1e-16,0,0]], b = randn(3,1), the right hand side is definitely not in the range of A, and the norm of the residual ||Ax-b||_2 is of order O(1), no matter what kind of least squares, regularized or not, is used (.73 and .42). Both solutions are equally bad and unusable, so this test does not tell us much about the quality of the underlying solvers. |
I think that your use of least squares is non-standard. Usually, It might be that setting the tolerance to Because of this issue, I have spent some time on reading up on the choice of tolerance, mainly, Golub and Van Loan, Eke Björck's "Numerical Methods for Least Squares Problems" and G. W. Stewart's "Rank degeneracy". My conclusion so far is that it is very hard to give a general answer to what the right value should be as it depends critically on the assumptions you make about the errors in your data. Your example is also almost the definition of a problem where you cannot get a good answer because the singular values decay very regularly. However, even in this case it might a good idea with some regularization as the following examples show. I have fixed
In the There are couple of open questions. JuliaLang/LinearAlgebra.jl#59 as @jiahao linked to is one of them. It would be great to have an easier way to specify the tolerance for specific applications. I have just pushed a couple of changes to both |
Actually, it all depends on accuracy of the right hand side b. In
Agree, some very mild regularization by setting the tolerance
Unfortunately, the standard text are not very helpful. The best
I would like to disagree here, and, I think, this is the most This forward stability property is currently violated in Julia Matlab implementation of \ does not have this issue (and neither It is interesting to note that sqrt(eps()) truncation for ldiv,
The matrix division operator is so common in the scripts, that is |
And sorry for my use of terminology for forward and backward stability, in the linear algebra community these terms are reversed. |
I caught up with the discussion with @andreasnoack today. @zgimbutas's claim that "it is well-known that a correctly implemented least squares solver should give full machine precision" does not agree with my reading of the references provided. The statement implies that the error in the correctly computed least squares solution in finite precision arithmetic can never exceed machine epsilon, and therefore does not depend on any information on the particular least-squares problem. On the contrary, Björck, 1967 (doi:10.1007/BF01934122) gives in (8.19) the error bounds on the computed residuals and the computed errors as: both of which clearly scale as
There is no need to change a single line of base Julia; it already wraps both julia> for i=20:60
xy, ry = Base.LinAlg.LAPACK.gelsy!(copy(a), copy(b), 2.0^-i)
xd, rd = Base.LinAlg.LAPACK.gelsd!(copy(a), copy(b), 2.0^-i)
push!(residy, norm(a*xy-b))
push!(residd, norm(a*xd-b))
push!(errory, norm(xy-x0))
push!(errord, norm(xd-x0))
end which can be plotted as: using Gadfly
plot(layer(x=map(x->2.0^-x, 20:60), y=residy, Geom.line, Theme(default_color=color("red"))),
layer(x=map(x->2.0^-x, 20:60), y=errory, Geom.line, Theme(default_color=color("red"))),
layer(x=map(x->2.0^-x, 20:60), y=residd, Geom.line),
layer(x=map(x->2.0^-x, 20:60), y=errord, Geom.line),
layer(xintercept=[eps(), sqrt(eps())], Geom.vline(color=color("orange"))),
Guide.xlabel("Tolerance"), Guide.ylabel("Error or residual"), Scale.x_log2, Scale.y_log2,
) (red = DGELSD, blue = DGELSY, orange = eps() and sqrt(eps(), errors are the top lines, residuals are the bottom lines) showing that despite appeal to higher authority, This graph actually shows why this discussion seems to be going at cross-purposes. If you set the threshold to Oddly enough |
Ok, it looks like there is more than numerology going on here. If we make the approximation that β is its upper bound (unnumbered equation at the very end of Section 6 of Björck, 1967), then For the problem at hand (n=100), the rcond threshold is I could not see a similar trick to derive a heuristic condition number threshold that worked to minimize the error that did not involve the norms of A, b, x and r. tl;dr: there is a very simple approximation for the rcond threshold for least-squares problems where you want to minimize the residual. for least-squares problems where you want to minimize the error in the solution, I don't know of a simple heuristic. |
Thank you very much for your systematic and thoughtful response. Although I agree that the least squares problem I posed, in which the right-hand-side b is in the range of the matrix A, is somewhat atypical, the goal of a least squares problem min || r ||, where r = Ax-b, does not require a balance between this objective and stability of the solution x. While in many applications it is desirable for x to be stable, that is achieved in practice, as necessary, by separate introduction of some variety of regularization. The form of regularization must depend on how these two objectives are to be balanced, which in turn depends on the particular problem. The pure least squares problem is simply to obtain an x that minimizes || r ||; none of the literature is ambiguous about this objective. Björck’s error bounds are correct, of course, but are overly pessimistic when b is in the range of A. My earlier statement about obtaining full machine precision, which was intended to refer to this case, was insufficiently circumscribed. I appreciate your evaluation of residuals and errors for LAPACK routines gelsy and gelsd. I attempted to match them, with the definitions
and your code subsequently. (Did we do something differently?) For both your plot and mine, the residuals from gelsy and gelsd are similar, but to my eye gelsd (shown in blue) appears to have a modest edge in accuracy. My plot differs from yours in that the errors in the solution are nearly independent of the specified tolerance, and the residual begins to increase shortly after the tolerance increases beyond machine precision: In any case, setting the tolerance to machine epsilon is most effective at minimizing the residuals. |
Yes, I had neglected to mention that my plot was for @andreasnoack's slightly modified problem with your The conclusion I draw from these numerical experiments and my readings is that we really should work more toward exposing Again to summarize:
@andreasnoack has solicited advice from Per Christian Hansen on the problem of picking a sensible default |
The minres problem in the L2-norm is only one possible application of least squares. Other problems which can also be solved by least squares can be formulated, such as problems to minimize errors, or problems to minimize both errors and residuals. |
Thank you, I have replicated the first set of plots by setting x0=ones(n) and b=A_x0. x0=ones(n) projects well to the right singular vectors of A due to largest singular values, which would explain faster convergence of the residuals with respect to the tolerance parameter. The test solution x0=randn(n) activates and tests all singular vectors (x0=cos(c_ones(n)), for c in [0,1], for something in between). |
A few more answers below. As I wrote earlier,
Yes. In general, I really think it is, but this could of course be different in special applications.
The pure least squares problem is stated in exact arithmetic, but we have to deal with floating point arithmetic. As I said above, I think the goal is to get a good solution, i.e. a solution close to the infinite precision least squares solution. We want a
By backward engineering I have just found it be m_eps() which is
Changing behaviour depending on prefactorizing is pretty much our design choice. E.g. in the square case you can solve a system with |
I see, so it would be really nice to find the middle ground and keep accuracy in the residuals while providing as good and robust solution as possible without introducing unnecessary numerical errors due to truncation below the effective noise floor (effectively tracking the numerical rank). I completely missed the rank argument, and, of course, it should be respected by the scheme. For this use case, I found the following interesting discussion on numpy discussion board. They were facing the same problem of determining a reasonable value of threshold for general use within numpy package and had a long discussion on this subject with references to Golub and van Loan, Matlab, Numerical Recipes, etc.: http://github.com/numpy/numpy/pull/357 http://www.mail-archive.com/[email protected]/msg37819.html Their decision was to go with a conservative Matlab-type estimate eps*max(m,n). This type of estimate is currently used in Julia to determine the numerical rank, the null-space, and for the svd-based least squares, so it might be a good consistent choice to make (see also http://www.mathworks.com/help/matlab/ref/rank.html). One can also make a strong argument from a statistical point of view to argue for the tolerance threshold eps/2*sqrt(m+n+1), or something like that, again see http://www.mail-archive.com/[email protected]/msg37819.html, and
On average, this threshold should give the best results, but, if robustness is required, it might not be appropriate. And I dare not to argue for eps threshold in this use case. It would be interesting to investigate what the optimal threshold parameter might be for a fixed right side. Thanks for pointing out that this is a somewhat subtle issue. |
When updating the pull request, could you please also squash the commits? |
The present tolerance is larger than necessary in the implementation of ldiv operator which can lead to loss of accuracy. Set the default tolerance parameter that is compatible with the conservative MATLAB estimate for determining the numerical rank of A, i.e., tol= eps()*maximum(size(A)). In addition, fix the implementation of the SVD based least squares, and increase the default tolerance parameter for the Moore–Penrose pseudoinverse which can not be determined to full machine precision for ill-conditioned matrices.
I agree, going with the MATLAB convention seems to be a very reasonable compromise. It is slightly suboptimal (I am losing a digit or so in my experiments compared with MATLAB/Octave which is ok), but it is robust and consistent with the current numerical rank convention in Julia. I updated the pull request, incorporating the above convention for the ldiv operator as follows: maximum(size(A))*eps(real(float(one(eltype(B))))). I also squashed the extra commits and forced And thanks for the interesting discussion, it appears that this issue needs more careful investigation in the long run... |
Because of the changes I made to separate the handling of pivoted and non-pivoted QR in the least squares solver, your pull request had to get rebased on top of master and the conflicts resolved. I've just done that and pushed it to master, to save you the work (you are still the author of the commit though). Hence I close this now. Could you try to use the new threshold and call I also learned a lot from the conversation, so please continue to report issues and send fixes. |
Adds in discussion and references useful for following #8859 [av skip]
The least squares routines lose accuracy in Julia 0.3. This branch
applies a set of simple fixes to address this issue. We test the implementation
of least squares by forming an ill-conditioned operator A, a right hand
side b in the range of A, and finding a least squares solution x. The
error in norm(a*x-b) then should be within the machine precision, even
for ill-conditioned matrices.
Also, the SVD based least squares code fails completely for
non-square matrices in Julia 0.3.
Note that the pseudoinverse is not precise in Matlab/Octave either.
The above test should be accurate to a square root of machine precision
in this case.
The tests scripts and results for Julia 0.3, Octave 3.4.3, and Matlab R2012a follow:
Julia 0.3
After fixes:
Julia 0.3
Before fixes:
Octave 3.4.3
Matlab R2012a
[jiahao: formatting code blocks]