Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix least squares #8859

Closed
wants to merge 1 commit into from
Closed

Fix least squares #8859

wants to merge 1 commit into from

Conversation

zgimbutas
Copy link
Contributor

The least squares routines lose accuracy in Julia 0.3. This branch
applies a set of simple fixes to address this issue. We test the implementation
of least squares by forming an ill-conditioned operator A, a right hand
side b in the range of A, and finding a least squares solution x. The
error in norm(a*x-b) then should be within the machine precision, even
for ill-conditioned matrices.

Also, the SVD based least squares code fails completely for
non-square matrices in Julia 0.3.

Note that the pseudoinverse is not precise in Matlab/Octave either.
The above test should be accurate to a square root of machine precision
in this case.

The tests scripts and results for Julia 0.3, Octave 3.4.3, and Matlab R2012a follow:


Julia 0.3

After fixes:

julia> include("test_rdiv_lsquares.jl")
6.602160352047045e-14

julia> include("test_svd_lsquares.jl")
6.533966983681165e-14

julia> include("test_pinv.jl")
9.439277699982046e-7

Julia 0.3

Before fixes:

julia> include("test_rdiv_lsquares.jl")
8.788416928121552e-7

julia> include("test_svd_lsquares.jl")
ERROR: DimensionMismatch("assigned 100 elements to length 13 destination")
 in assign_bool_vector_1d! at array.jl:382
 in \ at linalg/factorization.jl:573
 in include at ./boot.jl:242
 in include_from_node1 at ./loading.jl:128
while loading /local/tmp/zng2/github-contrib/julia/test_svd_lsquares.jl, in expression starting on line 25

julia> include("test_pinv.jl")
0.00035320238415246054

Octave 3.4.3

octave> test_rdiv_lsquares
ans =  3.8725e-14

octave> test_svd_lsquares
ans =  3.9309e-14

octave> test_pinv
ans =  4.3197e-04

Matlab R2012a

matlab> test_rdiv_lsquares
ans =

   1.0821e-13

matlab> test_svd_lsquares
ans =

   1.0564e-13

octave> test_pinv
ans =

   2.0777e-05
 ## === test_rdiv_lsquares.jl ===

 function hilb(n::Integer)
    a=Array(typeof(1.0),n,n)
    for i=1:n
      for j=1:n
        a[j,i]=1.0/(i+j-1.0)
      end
    end
    return a
 end

 m = 1000
 n = 100

 # form an ill-conditioned matrix
 a = randn(m,n) * hilb(n);

 # create a right hand side in the range of a
 x0 = randn(n);
 b = a*x0;

 # find the least squares solution
 x = a\b;

 # the error in rhs should be within machine precision
 norm(a*x - b)
%% === test_rdiv_lsquares.m ===

m = 1000
n = 100

% form an ill-conditioned matrix
a = randn(m,n) * hilb(n);

% create a right hand side in the range of a
x0 = randn(n,1);
b = a*x0;

% find the least squares solution
x = a\b;

% the error in rhs should be within machine precision
norm(a*x - b)
 ## === test_svd_lsquares.jl ===

 function hilb(n::Integer)
    a=Array(typeof(1.0),n,n)
    for i=1:n
      for j=1:n
        a[j,i]=1.0/(i+j-1.0)
      end
    end
    return a
 end

m = 1000
n = 100

 # form an ill-conditioned matrix
a = randn(m,n) * hilb(n);

 # create a right hand side in the range of a
x0 = randn(n);
b = a*x0;

 # find the least squares solution
x = svdfact(a)\b;

 # the error in rhs should be within machine precision
norm(a*x - b)
%% === test_svd_lsquares.m ===

m = 1000
n = 100

% form an ill-conditioned matrix
a = randn(m,n) * hilb(n);

% create a right hand side in the range of a
x0 = randn(n,1);
b = a*x0;

% find the least squares solution
[u,s,v] = svd(a,0);
s = diag(s);
k = sum(s > eps*s(1))
s = diag(s);

s = s(1:k,1:k);
u = u(:,1:k);
v = v(:,1:k);
x = v*(s^(-1)*(u'*b));

% the error in rhs should be within machine precision
norm(a*x - b)
 ## === test_pinv.jl ===

function hilb(n::Integer)
    a=Array(typeof(1.0),n,n)
    for i=1:n
      for j=1:n
        a[j,i]=1.0/(i+j-1.0)
      end
    end
    return a
end

m = 1000
n = 100

 # form an ill-conditioned matrix
a = randn(m,n) * hilb(n);

 # create a right hand side in the range of a
x0 = randn(n);
b = a*x0;

 # find the pseudo-inverse solution
x = pinv(a)*b;

 # the error in rhs should be within the square root of machine precision
norm(a*x - b)
%% === test_pinv.m ===

m = 1000
n = 100

% form an ill-conditioned matrix
a = randn(m,n) * hilb(n);

% create a right hand side in the range of a
x0 = randn(n,1);
b = a*x0;

% find the pseudo-inverse solution
x = pinv(a)*b;

% the error in rhs should be within the square root of machine precision
norm(a*x - b)

[jiahao: formatting code blocks]

@StefanKarpinski
Copy link
Member

cc @andreasnoack, @jiahao, @alanedelman

@jiahao
Copy link
Member

jiahao commented Oct 30, 2014

Ref: JuliaLang/LinearAlgebra.jl#59

@zgimbutas
Copy link
Contributor Author

I updated the pull request message with the correctly formatted tests scripts that replicate the problem.

Basically, both Matlab and Octave give different results from Julia, if ill-conditioned matrices are involved during the least squares solve. As currently implemented, Julia's least squares solver loses half of machine accuracy and is equivalent to solving the least squares via normal equations. It is not a big issue in most computations, but it is well-known that a correctly implemented least squares solver should give full machine precision. The attached test scripts are constructed to be sensitive to this.

The SVD based least square solver has several bugs and should be fixed anyway. The order in which U, S, and V matrices are applied is incorrect (should be V then S and then U), which is equivalent in forming a kind of pseudoinverse.

References for methods applicable to testing least squares? Unfortunately, this issue is neglected in Golub and Van Loan (and other standard textbooks). The paper by Ake Bjorck "Solving loinear least squares bloblems by Gram-Schmidt orthogonalization" mentions that the discrepancy in the right hand side is better conditioned that the solution, and "Optimal Sensitivity Analysis
of Linear Least Squares" Joseph F. Grcar, Lawrence Berkeley National Laboratory Report LBNL-52434, investigates this method in more details.

As for the pseudoinverse fix, no good theory exists, but one should not use the standard Matlab implementation either. Too many digits are lost if all singular values are used and the standard relations for the pseudo-inverse are not satisfied. Restricting singular value cutoff to the square root or simply using the normal equations is much better, although nobody seems to notice/care.

@zgimbutas
Copy link
Contributor Author

To summarize:

  1. ldiv, rdiv operators are regularized but they shouldn't be. In Matlab and Octave, they are not regularized, and neither are the standard least square solvers in Lapack.

  2. SVD based rdiv operator is regularized and contains several implementation bugs. This should not be regularized and be compatible with rdiv.

  3. pinv (pseudoinverse) is NOT regularized but it should be. In Matlab and Octave, pinv is not regularized, so for now one can drop this part, if compatibility with Matlab is desired. It would be nice to get the pinv routine a well-defined behavior though, namely, to make it as good as the normal equation solver.

@StefanKarpinski
Copy link
Member

Thanks for the comment and for paying attention to this. I'll have to defer to the local linear algebra experts as I have no idea whether any of that is right or not :-)

@andreasnoack
Copy link
Member

@zgimbutas Thank you for looking into this and for all the details. I'll have to take a closer look.

@andreasnoack
Copy link
Member

Okay. I have looked a bit more into this. I don't have any comments to the changes that are not related to the threshold values for rank determination. Those changes are just right.

Regarding the threshold values, my concern is that you potentially blow up the error just to get a slightly smaller residual. In real case least squares problems, the explained variable is measures with errors that are much larger than the error due to floating point arithmetic. Consider this example

julia> A = [ones(3) [1e-15,0,0]]
3x2 Array{Float64,2}:
 1.0  1.0e-15
 1.0  0.0    
 1.0  0.0    

julia> b = randn(3,1)
3x1 Array{Float64,2}:
  1.82643 
 -0.656788
  0.445374

julia> A_ldiv_B!(qrfact(A, pivot = true), copy(b), eps())
(
2x1 Array{Float64,2}:
 -0.105707  
  1.93214e15,

2)

julia> A_ldiv_B!(qrfact(A, pivot = true), copy(b), sqrt(eps()))
(
2x1 Array{Float64,2}:
 0.538339   
 1.79446e-16,

1)

so if the 10e-15 is just an error then the solution gets completely ruined even though the last solution has the smallest residual. The problem is to choose the right value. There is a trade off and I think the right choice very much depends on you assumptions about the data your are considering. I guess that you and LAPACK choose eps() because the error from the QR is in that order? Is that right?

Regarding the choice of regularization, I believe that you are still doing that, but just with a different regularization parameter. Do you disagree? I believe that MATLAB does some regularization, e.g.

>> [ones(3,1) [1e-16;0;0]]\randn(3,1)
Warning: Rank deficient, rank = 1, tol =  1.153778e-15. 

ans =

    0.0376
         0

and xgesly, xgelss and xgesd all have an input value for rank threshold so I believe that LAPACK is not completely against that idea either.

To sum up, you proposed changes might be a good idea, but I'd like to hear your opinions on the issue mentioned above.

@zgimbutas
Copy link
Contributor Author

No, neither Matlab nor Octave do not perform regularization for least
squares, i.e. the threshold parameter is set to eps(). And, since the
right hand side is effectively not in the range of the system matrix,
the solution must grow in order to match the right hand side. The
solution growth is also visible in the current implementation of Julia
but at a different threshold value, at sqrt(eps()). So the solution
also gets ruined, except not so badly, and the right hand side is
never matched to full double precision accuracy (we are effectively
discarding a part of operator due to small singular values).

The choice of nonregularized least squares in both Matlab and Octave,
allow the solution to grow a bit more (up to a factor of 1/eps(), as
predicted by the theory), while preserving accuracy in matching the
right hand side.

By the way, we noticed this issue in Julia while solving an
optimization problem: we were losing accuracy and couldn't converge to
full double precision accuracy. We did not have this behavior in our
Matlab codes. The problem was in the least squares part, namely the
backslash \ operator (commonly used in the scripts) which was ignoring
a part of operator due to small but still significant singular values.


Julia 0.3, before fix:

julia> [ones(3,1) [1e-16;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
0.538842
1.79614e-17

julia> [ones(3,1) [1e-15;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
-1.01696
-3.38985e-16

julia> [ones(3,1) [1e-8;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
-0.487846
-1.62615e-9

julia> [ones(3,1) [1e-7;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
-0.737966
1.16392e7


Julia 0.3, after fix:

julia> [ones(3,1) [1e-16;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
-0.0589392
-1.96464e-18

julia> [ones(3,1) [1e-15;0;0]]\randn(3,1)
2x1 Array{Float64,2}:
-0.0191322
-1.65799e15


Octave 3.4.3

octave:1> [ones(3,1) [1e-16;0;0]]\randn(3,1)
ans =

4.7043e-01
1.5681e-17

octave:2> [ones(3,1) [1e-15;0;0]]\randn(3,1)
ans =

-2.5627e-03
-9.1269e+14


Matlab R2012a

matlab>> [ones(3,1) [1e-16;0;0]]\randn(3,1)
Warning: Rank deficient, rank = 1, tol = 6.661338e-16.

ans =

 0.0376
      0

matlab>> [ones(3,1) [1e-15;0;0]]\randn(3,1)

ans =

1.0e+15 *

-0.0000
 1.3566

@andreasnoack
Copy link
Member

Why is it not regularization when the threshold parameter is set to eps()? E.g.

julia> A = [ones(3) [1e-16,0,0]]
3x2 Array{Float64,2}:
 1.0  1.0e-16
 1.0  0.0    
 1.0  0.0    

julia> b = randn(3,1);

julia> norm(b - A*A_ldiv_B!(qrfact(A, pivot = true), copy(b), eps())[1])
0.7343268945942051

julia> norm(b - A*(qrfact(A) |> t -> t[:R]\(t[:Q]'b)[1:2]))
0.42650444779616326

@zgimbutas
Copy link
Contributor Author

The goal of least squares is to find a set of parameters of a linear model which fits the data best, i.e. we attempt to minimize the residual norm ||Ax-b||_2 (this is the standard definition of the least squares method). This quantity is bounded by the machine precision for a user selected floating point arithmetic model, so setting the threshold parameter to eps() leads to non-regularized least squares for all practical purposes. The best match/smallest residual is achieved if the right hand side is in the range of A which is to say that we should be able to match well-defined data accurately.
In this standard definition we do not make any assumptions on the norm of solution ||x||_2, this quantity can be arbitrary big/small, we just want to minimize the residual.

The regularized least squares attempt to minimize the residual norm AND the norm of the solution simultaneously, i.e. we minimize ||Ax-b||_2 + \lambda ||x||_2. Setting the threshold parameter to sqrt(eps()) limits the norm of of solution ||x||_2, but at the tradeoff of ||Ax-b||_2 now being bounded from below by the square root of machine precision only.

One more note: in the test where A = [ones(3) [1e-16,0,0]], b = randn(3,1), the right hand side is definitely not in the range of A, and the norm of the residual ||Ax-b||_2 is of order O(1), no matter what kind of least squares, regularized or not, is used (.73 and .42). Both solutions are equally bad and unusable, so this test does not tell us much about the quality of the underlying solvers.

@andreasnoack
Copy link
Member

I think that your use of least squares is non-standard. Usually, b would not be in the range of A when solving a least squares problem. Hence, I don't think the example I are gave are irrelevant.

It might be that setting the tolerance to eps() is the same as no regularization when b is in the range of A, but generally there will be regularization whenever the tolerance is larger than zero. My example clearly showed the regularization taking place.

Because of this issue, I have spent some time on reading up on the choice of tolerance, mainly, Golub and Van Loan, Eke Björck's "Numerical Methods for Least Squares Problems" and G. W. Stewart's "Rank degeneracy". My conclusion so far is that it is very hard to give a general answer to what the right value should be as it depends critically on the assumptions you make about the errors in your data.

Your example is also almost the definition of a problem where you cannot get a good answer because the singular values decay very regularly. However, even in this case it might a good idea with some regularization as the following examples show. I have fixed A drawn in the same way as you did and when drawn different xs and calculated b in hight precision to avoid errors before the solver. Then I have plotted the errors as a function of the tolerance.

x0 = ones(100, 1);
b = float64(A*big(x0));

hilb_ones

x0 = randn(100, 1);
b = float64(A*big(x0));

hilb_randn

x0 = randn(100, 1) .+ 5;
b = float64(A*big(x0));

hilb_randn5

x0 = randn(100, 1) .+ 50;
b = float64(A*big(x0));

hilb_randn50

In the N(0,1) case, where a tolerance of eps() seems optimal it should be noted that the y axis is very compressed.

There are couple of open questions. JuliaLang/LinearAlgebra.jl#59 as @jiahao linked to is one of them. It would be great to have an easier way to specify the tolerance for specific applications. pinv shouldn't be that hard, but for an infix operator it is not obvious how you should provide the tolerance.

I have just pushed a couple of changes to both master and release-0.3 which give you the option of avoiding regularization by using the unpivoted QR, i.e. qrfact(A)\b instead of just A\b. Maybe it is sufficient for your use of least squares. We can also consider to adjust the default tolerance in \, but I don't think it is right to set it to eps().

@zgimbutas
Copy link
Contributor Author

I think that your use of least squares is
non-standard. Usually, b would not be in the range of A when
solving a least squares problem. Hence, I don't think the
example I are gave are irrelevant.

Actually, it all depends on accuracy of the right hand side b. In
my case, b was accurate to 14 digits or so, and I simply could
not match it and I knew I could do it while using Matlab or
Fortran, hence, the investigation was launched. And you are
absolutely right (I am taking part of my extreme judgement back),
one can use b that is not in the range of A, and the answer is
perfectly defined for well-conditioned matrices.

It might be that setting the tolerance to eps() is the same as
no regularization when b is in the range of A, but generally
there will be regularization whenever the tolerance is larger
than zero. My example clearly showed the regularization taking
place.

Agree, some very mild regularization by setting the tolerance
parameter to the machine precision eps() is likely taking place
in Matlab. It definitely does not hurt much but I don't have
access to Matlab source code. The tolerance parameter is set to
0.0 in Octave which is using xGELSD (divide and conquer SVD)
routines from LAPACK, things are a bit more complicated
internally there, gaining an extra digit or two in accuracy vs
MATLAB choice.

Because of this issue, I have spent some time on reading up on
the choice of tolerance, mainly, Golub and Van Loan, Eke
Björck's "Numerical Methods for Least Squares Problems" and
G. W. Stewart's "Rank degeneracy". My conclusion so far is
that it is very hard to give a general answer > to what the
right value should be as it depends critically on the
assumptions you make about the errors in your data.

Unfortunately, the standard text are not very helpful. The best
discussion is given in "Numerical Methods for Least Squares
Problems" by Ake Bjorck.

Your example is also almost the definition of a problem where
you cannot get a good answer because the singular values decay
very regularly. However, even in this case it might a good idea
with some regularization as the following examples show.

I would like to disagree here, and, I think, this is the most
important thing to clarify. Basically, the question can be
formulated as follows: is getting the the solution x right more
important than matching the right hand side b; and how do these
quantities relate. The answer to this question is, unfortunately,
obscured in the textbooks (only Ake Bjorck gets it right), but
one can state that the error in the solution x is k(A) times
bigger then the error (achievable by least squares solver) in the
right hand side b. This indicates that for well-conditioned
matrices both the solution and the right hand side can be
approximated equally well. For ill-conditioned matrices, the
situation is dramatically different. We can not talk about
backward stability of least squares, the solution x is simply not
well-defined (see "Numerical Methods for Least Squares Problems"
by Ake Bjorck). We can, however, talk about approximation of the
right hand side b, and in properly implemented least squares
solver, the right hand side can be matched with the same
precision it was initially given (so called forward stability
property). As an example, for extremely ill-conditioned matrices,
if k(A) = 1/eps(A), if b is not in the range of A (i.e., it is
not aligned with left singular vectors of A), then the
discrepancy in the right hand side is O(1), but the solution
blows up like k(A), which is reasonable. And if b is in the range
of A (i.e., it is aligned with left singular vectors of A), then
we have the following interesting behavior: the discrepancy in
the right hand side is O(eps()), but the error in the solution is
now of O(1). The solution seems extremely noisy but it DOES carry
all information about the right hand side b to full precision
accuracy!

This forward stability property is currently violated in Julia
due to too harsh truncation to sqrt(eps()). Currently, the
solution carries information about b to single precision only.

Matlab implementation of \ does not have this issue (and neither
Octave or Lapack), so if Julia want to compete with these systems
in a long term, it should fix this behavior. Truncating to
sqrt(eps()) is just wrong (here, I said it, usually I am very
non-confrontational). Setting the truncation parameter to eps()
was a quick and reasonably good workaround, which was easy to set
up as a pull request. Switching to xGELSD (divide and conquer
SVD) routines from LAPACK gaining an extra digit or two would be
ideal, but it would have required extensive rewriting of Julia base
libraries. By the way, xGELSD routines are written by Ming Gu,
who is the leading expert in this area (and, I think, you should
trust him).

It is interesting to note that sqrt(eps()) truncation for ldiv,
rdiv was introduced around the summer of 2013 (if I am not
mistaken), before that it was compatible with Matlab and the
right division operator was perfectly usable. Was this decision
made due to some bug report, user request, or was it just an
attempt to control the unavoidable growth of the solution for
ill-conditioned cases?

The option of avoiding regularization by using the unpivoted
QR, i.e. qrfact(A)\b instead of just A\b

The matrix division operator is so common in the scripts, that is
is more natural to write just A\b and expect the system to do the
rest. In my opinion, this is not a good idea, since the accuracy
will change, depending only on a fact that we prefactorize the
matrix A.

@zgimbutas
Copy link
Contributor Author

And sorry for my use of terminology for forward and backward stability, in the linear algebra community these terms are reversed.

@jiahao
Copy link
Member

jiahao commented Nov 3, 2014

I caught up with the discussion with @andreasnoack today.

@zgimbutas's claim that "it is well-known that a correctly implemented least squares solver should give full machine precision" does not agree with my reading of the references provided. The statement implies that the error in the correctly computed least squares solution in finite precision arithmetic can never exceed machine epsilon, and therefore does not depend on any information on the particular least-squares problem. On the contrary, Björck, 1967 (doi:10.1007/BF01934122) gives in (8.19) the error bounds on the computed residuals and the computed errors as:

screen shot 2014-11-03 at 4 31 47 pm
screen shot 2014-11-03 at 4 31 57 pm
screen shot 2014-11-03 at 7 13 59 pm

both of which clearly scale as f(n) times machine epsilon. The error bound simplifies for @zgimbutas's rather unusual least squares problem (because the true residual r=0), which weakens the dependence on the condition number of A, but nevertheless f(n) is demonstrably not constant over the number of columns n.

xGELSD (divide and conquer SVD) routines from LAPACK gaining an extra digit or two would be ideal, but it would have required extensive rewriting of Julia base libraries. By the way, xGELSD routines are written by Ming Gu, who is the leading expert in this area (and, I think, you should trust him).

There is no need to change a single line of base Julia; it already wraps both Base.LinAlg.LAPACK.gelsy! (our default) and Base.LinAlg.LAPACK.gelsd!, so you can very easily do

julia> for i=20:60
    xy, ry = Base.LinAlg.LAPACK.gelsy!(copy(a), copy(b), 2.0^-i)
    xd, rd = Base.LinAlg.LAPACK.gelsd!(copy(a), copy(b), 2.0^-i)
    push!(residy, norm(a*xy-b))
    push!(residd, norm(a*xd-b))
    push!(errory, norm(xy-x0))
    push!(errord, norm(xd-x0))
end

which can be plotted as:

using Gadfly
plot(layer(x=map(x->2.0^-x, 20:60), y=residy, Geom.line, Theme(default_color=color("red"))),
     layer(x=map(x->2.0^-x, 20:60), y=errory, Geom.line, Theme(default_color=color("red"))),
     layer(x=map(x->2.0^-x, 20:60), y=residd, Geom.line),
     layer(x=map(x->2.0^-x, 20:60), y=errord, Geom.line),
     layer(xintercept=[eps(), sqrt(eps())], Geom.vline(color=color("orange"))),
     Guide.xlabel("Tolerance"), Guide.ylabel("Error or residual"), Scale.x_log2, Scale.y_log2,
)

screen shot 2014-11-03 at 6 04 05 pm

(red = DGELSD, blue = DGELSY, orange = eps() and sqrt(eps(), errors are the top lines, residuals are the bottom lines)

showing that despite appeal to higher authority, gelsd! can uniformly compute a worse solution than gelsy! on your matrix for regularization thresholds below 2^-42.

This graph actually shows why this discussion seems to be going at cross-purposes. If you set the threshold to eps(), there is essentially no change in the computed residuals but it can cause the error to blow up. If you set the threshold to sqrt(eps()) both the error and residual are contaminated by regularization. There is a happy medium for this problem where the tolerance can be set to ~ 2^-36 where both the error in the residual and the error in the solution are effectively minimized. But it is clear from my experiments and @andreasnoack's that setting the default threshold to eps() is also the wrong thing to do in general, since one cannot assume that the user wants to minimize the residual at the expense of the error or vice versa.

Oddly enough 2^-36 is approximately the value of eps()*(1+1.8 √n)*1.9*√n*(n+1)), which is f(n) * eps() bounded above by assuming 2⁻ᵗ κ(R̄′) < 1. I am not sure yet whether there is something to this, or if this is simply numerology.

@jiahao
Copy link
Member

jiahao commented Nov 4, 2014

Ok, it looks like there is more than numerology going on here.

If we make the approximation that β is its upper bound (unnumbered equation at the very end of Section 6 of Björck, 1967), then β ≈ c √n κ(A) where c is all the numerical constants in the definition of β. Differentiate the upper bound for the error in the residual with respect to κ(A), set the derivative to 0 and solve. The result is simply 1/κ(A) = 0.5 c √n = 1.71 n (n+1) 2⁻ᵗ - remarkably, all the data regarding the problem drop out except for the problem size!

For the problem at hand (n=100), the rcond threshold is 1/κ(A) = 3.83e-12 ~ 2^-37.9, which is not a bad guess for the optimal threshold.

I could not see a similar trick to derive a heuristic condition number threshold that worked to minimize the error that did not involve the norms of A, b, x and r.

tl;dr: there is a very simple approximation for the rcond threshold for least-squares problems where you want to minimize the residual. for least-squares problems where you want to minimize the error in the solution, I don't know of a simple heuristic.

@zgimbutas
Copy link
Contributor Author

Thank you very much for your systematic and thoughtful response. Although I agree that the least squares problem I posed, in which the right-hand-side b is in the range of the matrix A, is somewhat atypical, the goal of a least squares problem min || r ||, where r = Ax-b, does not require a balance between this objective and stability of the solution x. While in many applications it is desirable for x to be stable, that is achieved in practice, as necessary, by separate introduction of some variety of regularization. The form of regularization must depend on how these two objectives are to be balanced, which in turn depends on the particular problem. The pure least squares problem is simply to obtain an x that minimizes || r ||; none of the literature is ambiguous about this objective.

Björck’s error bounds are correct, of course, but are overly pessimistic when b is in the range of A. My earlier statement about obtaining full machine precision, which was intended to refer to this case, was insufficiently circumscribed.

I appreciate your evaluation of residuals and errors for LAPACK routines gelsy and gelsd. I attempted to match them, with the definitions

m = 1000
n = 100
a = randn(m,n) * hilb(n)
x0 = randn(n)
b = a * x0

and your code subsequently. (Did we do something differently?) For both your plot and mine, the residuals from gelsy and gelsd are similar, but to my eye gelsd (shown in blue) appears to have a modest edge in accuracy. My plot differs from yours in that the errors in the solution are nearly independent of the specified tolerance, and the residual begins to increase shortly after the tolerance increases beyond machine precision:

myplot

In any case, setting the tolerance to machine epsilon is most effective at minimizing the residuals.

@jiahao
Copy link
Member

jiahao commented Nov 4, 2014

Yes, I had neglected to mention that my plot was for @andreasnoack's slightly modified problem with your a matrix but with b = ones(n) instead of randn(n). Björck's error bound shows quite clearly that the norm of b contributes strongly to the bound on the computed error, so b=ones(n) turns out to have quite strong sensitivity for the error on small cutoffs.

The conclusion I draw from these numerical experiments and my readings is that we really should work more toward exposing cond or rcond as a user-specifiable cutoff in all these least-squares routines, since the correct choice of cutoff depends on the problem at hand.

Again to summarize:

  • rcond=sqrt(eps())(our current hard-coded default) gives quite a lot of regularization noise,
  • rcond=eps() happens to work for some problems but is not always a good general choice especially for problems pertaining to minimizing the error,
  • for minres problems the optimal cutoff should be something like rcond=2n*(n+1)*eps(),
  • for minimizing-error problems there is no simple heuristic that can be derived from a careful error analysis like Björck's.

@andreasnoack has solicited advice from Per Christian Hansen on the problem of picking a sensible default cond/rcond cutoff, so we may yet get a definitive answer from the experts on least-squares problems.

@jiahao
Copy link
Member

jiahao commented Nov 4, 2014

The pure least squares problem is simply to obtain an x that minimizes || r ||; none of the literature is ambiguous about this objective.

The minres problem in the L2-norm is only one possible application of least squares. Other problems which can also be solved by least squares can be formulated, such as problems to minimize errors, or problems to minimize both errors and residuals.

@zgimbutas
Copy link
Contributor Author

Thank you, I have replicated the first set of plots by setting x0=ones(n) and b=A_x0. x0=ones(n) projects well to the right singular vectors of A due to largest singular values, which would explain faster convergence of the residuals with respect to the tolerance parameter. The test solution x0=randn(n) activates and tests all singular vectors (x0=cos(c_ones(n)), for c in [0,1], for something in between).

@andreasnoack
Copy link
Member

A few more answers below. As I wrote earlier, sqrt(eps()) might be too much, but I also believe that eps() is too little.

is getting the the solution x right more important than matching the right hand side b

Yes. In general, I really think it is, but this could of course be different in special applications.

The pure least squares problem is simply to obtain an x that minimizes || r ||; none of the literature is ambiguous about this objective.

The pure least squares problem is stated in exact arithmetic, but we have to deal with floating point arithmetic. As I said above, I think the goal is to get a good solution, i.e. a solution close to the infinite precision least squares solution.

We want a \ that is able to handle rank deficient problems as well which also requires some degree of regularization. Maybe it is too much right now, but it would help if you a specific references when you state that "truncating to sqrt(eps()) is just wrong". Which discussion of rank deficient problems is this statement based on?

Agree, some very mild regularization by setting the tolerance parameter to the machine precision eps() is likely taking place in Matlab.

By backward engineering I have just found it be m_eps() which is 1000_eps()≈2^(-42) in your example.

In my opinion, this is not a good idea, since the accuracy will change, depending only on a fact that we prefactorize the matrix A.

Changing behaviour depending on prefactorizing is pretty much our design choice. E.g. in the square case you can solve a system with lufact(A)\b, qrfact(A)\b, bkfact(A)\b and cholfact(A)\b and you can in most cases turn pivoting on and off which will then give you different precision. This gives you flexibility with very simple syntax. Let's see where we end, but in your case where you don't care about the solution at all, I think it would be okay to express that through qrfact(A)\b. After all, I believe most users of \ are interested in a good solution.

@zgimbutas
Copy link
Contributor Author

I see, so it would be really nice to find the middle ground and keep accuracy in the residuals while providing as good and robust solution as possible without introducing unnecessary numerical errors due to truncation below the effective noise floor (effectively tracking the numerical rank). I completely missed the rank argument, and, of course, it should be respected by the scheme.

For this use case, I found the following interesting discussion on numpy discussion board. They were facing the same problem of determining a reasonable value of threshold for general use within numpy package and had a long discussion on this subject with references to Golub and van Loan, Matlab, Numerical Recipes, etc.:

http://github.com/numpy/numpy/pull/357

http://www.mail-archive.com/[email protected]/msg37819.html

Their decision was to go with a conservative Matlab-type estimate eps*max(m,n). This type of estimate is currently used in Julia to determine the numerical rank, the null-space, and for the svd-based least squares, so it might be a good consistent choice to make (see also http://www.mathworks.com/help/matlab/ref/rank.html).

One can also make a strong argument from a statistical point of view to argue for the tolerance threshold eps/2*sqrt(m+n+1), or something like that, again see http://www.mail-archive.com/[email protected]/msg37819.html, and

@article{konstantinides1988statistical,
 title={Statistical analysis of effective singular values in matrix
 rank determination},
 author={Konstantinides, K. and Yao, K.},
 journal={Acoustics, Speech and Signal Processing, IEEE Transactions on},
 volume={36},
 number={5},
 pages={757--763},
 year={1988},
 publisher={IEEE}
 }

On average, this threshold should give the best results, but, if robustness is required, it might not be appropriate.

And I dare not to argue for eps threshold in this use case.

It would be interesting to investigate what the optimal threshold parameter might be for a fixed right side. Thanks for pointing out that this is a somewhat subtle issue.

@andreasnoack
Copy link
Member

Sorry for the delay. I have looked more at the residual norm as a function of the threshold when A is a random matrix with different kinds of singular value decay and with different kinds of solution vector, e.g. random and constant different mean.

In most cases, the residual norm is almost insensitive to the threshold value. This is true, even in moderately ill-posed problems (regularly decaying singular values). However, when the singular values hit the bottom of the machine precision as in the this example
svddecay

the residual norm will start growing immediately in the threshold tolerance.

For rank deficient problems, the residual norm will be smaller when the rank is wrongly detected to be full and then completely stabilizes at a higher plateau when the threshold gets high enough to estime the correct rank.

The conclusion is that we cannot have both at the same time and as I read you last post you kind of agree in that. However, our present tolerance seems to be larger than necessary, so I'd suggest that we follow MATLAB here as their choice seems to work well in all the examples I have tried. If a user happen to have an ill-posed problem she'd just have to be more careful.

If you are okay with this, please update the pull request accordingly and it should be ready to merge soon. We should still continue to think about how we can make it easier for the user to experiment with different choices for the tolerance, but there is a separate issue for that.

@andreasnoack
Copy link
Member

When updating the pull request, could you please also squash the commits?

The present tolerance is larger than necessary in the implementation
of ldiv operator which can lead to loss of accuracy. Set the default
tolerance parameter that is compatible with the conservative MATLAB estimate
for determining the numerical rank of A, i.e., tol= eps()*maximum(size(A)).

In addition, fix the implementation of the SVD based least squares,
and increase the default tolerance parameter for the Moore–Penrose
pseudoinverse which can not be determined to full machine precision
for ill-conditioned matrices.
@zgimbutas
Copy link
Contributor Author

I agree, going with the MATLAB convention seems to be a very reasonable compromise. It is slightly suboptimal (I am losing a digit or so in my experiments compared with MATLAB/Octave which is ok), but it is robust and consistent with the current numerical rank convention in Julia.

I updated the pull request, incorporating the above convention for the ldiv operator as follows: maximum(size(A))*eps(real(float(one(eltype(B))))). I also squashed the extra commits and forced
the push into the forked repository.

And thanks for the interesting discussion, it appears that this issue needs more careful investigation in the long run...

@andreasnoack
Copy link
Member

Because of the changes I made to separate the handling of pivoted and non-pivoted QR in the least squares solver, your pull request had to get rebased on top of master and the conflicts resolved. I've just done that and pushed it to master, to save you the work (you are still the author of the commit though). Hence I close this now.

Could you try to use the new threshold and call LAPACK.gelsy! directly to see if it is more precise. I have tried to follow LAPACK'sxgelsy` closely in https://github.com/JuliaLang/julia/blob/master/base/linalg/factorization.jl#L272, but it might be the reason you are loosing precision. Alternately, it might just be the extra precision from using the SVD instead of the pivoted QR.

I also learned a lot from the conversation, so please continue to report issues and send fixes.

@zgimbutas zgimbutas deleted the fix-least-squares branch November 23, 2014 02:01
jiahao added a commit that referenced this pull request Feb 1, 2015
Adds in discussion and references useful for following #8859

[av skip]
jiahao added a commit to jiahao/jin that referenced this pull request Feb 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants