Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory fault during scaling of singular matrix #205

Merged
merged 4 commits into from
Jan 10, 2025

Conversation

chrhansk
Copy link
Contributor

Should fix #200

@jfowkes jfowkes added the bug label Jun 15, 2024
@jfowkes
Copy link
Contributor

jfowkes commented Jun 15, 2024

@chrhansk unfortunately this change seems to break the main SSIDS test on all platforms.

@mjacobse
Copy link
Collaborator

Function hungarian_match is not only called from the scaling API, but also for the matching-based METIS ordering of SSIDS. Function mo_match in match_order.f90 expects unmatched entries to be signaled by negative values:

if (cperm(i) .lt. 0) then

So I would suggest not to change the behaviour of hungarian_match, but instead fix how the unsymmetric scaling code deals with what hungarian_match returns.

Alternatively one could tackle

! FIXME: At some stage replace call to mo_match() with call to
! a higher level routine from spral_scaling instead (NB: have to cope with
! fact we are currently expecting a full matrix, even if it means 2x more log
! operations)
but that might require quite a bit of refactoring?

@chrhansk
Copy link
Contributor Author

I was not aware of the problem. I tries to zero out the problematic entries manually in the postprocessing function. There still seems to be a problem in ssmfe_ciface_test though. Problem is that I cannot reproduce the error on my system (tests pass without any memory issues). Do you have any ideas?

@jfowkes
Copy link
Contributor

jfowkes commented Jun 15, 2024

Many thanks @chrhansk, the intermittent SSMFE C test failure is #204 (nothing to do with your changes) which annoyingly I also cannot reproduce on my system making it very difficult to debug and fix. @mjacobse could you review?

@jfowkes jfowkes requested a review from mjacobse June 15, 2024 15:20
Copy link
Collaborator

@mjacobse mjacobse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is certainly an improvement over the current behavior that causes a segfault. However I am not sure the solution of doing the negative to zero conversion in the postprocessing function match_postproc is ideal:

  • It gives unnecessary responsibility to match_postproc and requires changing the match argument to intent(inout) which to my mind makes the calling contract less clear
  • When using the auction method, the conversion in match_postproc is done redundantly for a second time after

    spral/src/scaling.f90

    Lines 1487 to 1488 in 662c7ac

    ! We expect unmatched columns to have match(col) = 0
    where(match(:) .eq. -1) match(:) = 0
    already did it

Instead, I think it would be better to do the conversion in hungarian_wrapper before calling match_postproc. This is pretty minor and invisible to users though, so perhaps not relevant enough on its own.

More relevant though is that the way unmatched entries are returned from hungarian_scale_sym and hungarian_scale_unsym is now inconsistent. The unsymmetric version will now return 0 while the symmetric one continues to return negative entries, since the singular symmetric case does not call match_postproc as seen here:

spral/src/scaling.f90

Lines 679 to 686 in 662c7ac

if ((.not. sym) .or. (inform%matched .eq. n)) then ! Unsymmetric or symmetric and full rank
! Note that in this case m=n
rscaling(1:m) = dualu(1:m)
cscaling(1:n) = dualv(1:n) - cmax(1:n)
call match_postproc(m, n, ptr, row, val, rscaling, cscaling, &
inform%matched, match, inform%flag, inform%stat)
return
end if
Curiously, the current documentation incorrectly claims to return zero for unmatched in both cases (hungarian_scale_sym and hungarian_scale_unsym), perhaps copy-pasted from the description of the auction method for which it is correct. Options I can think of:

  • Do the conversion from negative to zero on a temporary copy of match. That way, both hungarian_scale_sym and hungarian_scale_unsym continue to return negative entries. With this option, the incorrect documentation for both cases should be fixed (perhaps in a separate issue).
  • Accept this inconsistency. With this option, the incorrect documentation for the symmetric case should be fixed (perhaps in a separate issue)
  • Make the symmetric case work with and return zeros for unmatched entries too. The necessary changes should be limited to hungarian_wrapper and should work well with the changes for the unsymmetric case (when done in hungarian_wrapper, which would add a major reason for doing so to the above). Because of that it might make sense to change both at once instead of in a seperate issue.

The latter two options would break potential users who are relying on the negative entries (despite the wrong documentation) or would like to do so in the future, but it would introduce consistency with how the auction methods returns the matching and with the documented behavior. Not sure what's the best call here.

tests/scaling.f90 Outdated Show resolved Hide resolved

allocate(a%ptr(n+1))
allocate(a%row(nz), a%val(nz))
allocate(rscaling(m), cscaling(n), match(n))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be match(m), not match(n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about that. The docs state that it should be n (similarly for all matching algorithms). I am not exactly sure why to be honest though, but calling it with m in the unit test causes a segfault on my machine.

Copy link
Collaborator

@mjacobse mjacobse Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm you are right, I do get invalid write with valgrind for match(m). But when doing match(n), the last two entries in the example of that test are left uninitialized which surely is not intended either? Unless the idea is to use the info struct to obtain until where the values are initialized. Though the signature for hungarian_scale_unsym does use m:

integer, dimension(m), optional, intent(out) :: match

There seems to be another (unrelated?) issue here... :(

But the existing random unsymmetric tests also use an overallocated match(maxn), i.e. not leading to invalid writes but only uninitialized return, so agree with doing the same here. We can deal with that in a separate issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #207

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that match(m) is correct but that this happened to reveal a secondary bug which should probably be fixed before applying the changes proposed in this PR, see #200 (comment).

src/scaling.f90 Outdated
@@ -1619,7 +1619,7 @@ subroutine match_postproc(m, n, ptr, row, val, rscaling, cscaling, nmatch, &
real(wp), dimension(m), intent(inout) :: rscaling
real(wp), dimension(n), intent(inout) :: cscaling
integer, intent(in) :: nmatch
integer, dimension(m), intent(in) :: match
integer, dimension(m), intent(inout) :: match
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be avoided by doing the conversion from negative to zero entries at the callsite instead of here, see detailed comment

a%n = n
a%m = m

a%ptr(1:n+1) = (/ 1, 3, 5, 5, 5, 7 /)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 1, 3, 5, 6, 6, 7 as in #200? As it stands, this matrix would have a duplicate entry.

@chrhansk
Copy link
Contributor Author

Looking back at this: In particular in regards to #205: What is the proposed solution that you are converging on? As I understand #200 (comment) The lines setting the unmatched rows to negative values should be commented out, which to me implies that the values of unmatched rows should be set to zero instead. Am I correct in this regard?

Such a change would necessitate corresponding changes in the implementation of SSIDS, as mentioned here: #205 (comment)

Should I make those changes to the scaling and within SSIDS?

@chrhansk chrhansk force-pushed the feature-singular-scaling branch from a5e5bb2 to 8b27c63 Compare June 21, 2024 13:54
@jfowkes
Copy link
Contributor

jfowkes commented Jun 21, 2024

Yes we have come to the conclusion that whilst negative row indices (to signal which of the rows are unmatched) make sense for square matrices, this does not make sense for general rectangular matrices. As far as we can tell SSIDS does not make use of the negative row indices themselves, but merely checks if a row is unmatched, so in theory changing the values of unmatched rows to zero should work fine provided we update SSIDS to check for zero rather than negative values.

@chrhansk
Copy link
Contributor Author

I adjusted the scaling accordingly and added a test for the symmetric singular case. There is however still the problematic part here mentioned in #200. In this case it causes a segfault (if the match array has a size of m < n), so this needs to be addressed.

@jfowkes
Copy link
Contributor

jfowkes commented Jun 22, 2024

Many thanks @chrhansk, the suggestion has been to comment this problematic section out as it should no longer be required if we now return zero for unmatched entries. I guess it's a case of trying that and seeing if anything breaks?

@mjacobse
Copy link
Collaborator

Basically the suggestion was this mjacobse@b815fac and indeed, it seems to be all that's needed to fix all issues at once. Personally I would like to see randomly generated singular tests to confirm, since all the random tests right now are nonsingular.

@jfowkes
Copy link
Contributor

jfowkes commented Jun 24, 2024

Indeed many thanks! @chrhansk I suggest we apply mjacobse@b815fac to this PR and add a randomly generated singular matrix test to verify that things don't break. We think this should be all that is required now.

@chrhansk
Copy link
Contributor Author

Great, I appreciate your effort.

@jfowkes jfowkes force-pushed the feature-singular-scaling branch from cd86051 to 4205d0c Compare January 9, 2025 09:54
@jfowkes
Copy link
Contributor

jfowkes commented Jan 9, 2025

@mjacobse would you be able to re-review this PR at some point? I think this is now more or less ready to go in.

Copy link
Collaborator

@mjacobse mjacobse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, let's merge it. I had tried to allow singular cases within the functions that do the random tests for scaling, but it turned out to not be so easy so I kinda lost track of it, sorry about that. Perhaps at some point we can look into more extensive tests, but for now let's go with this, should be a useful fix.

tests/scaling.f90 Outdated Show resolved Hide resolved
tests/scaling.f90 Outdated Show resolved Hide resolved
@jfowkes
Copy link
Contributor

jfowkes commented Jan 10, 2025

No worries, generating random singular matrices is not so easy. I'll add your suggestions and merge.

@jfowkes jfowkes merged commit 88541c7 into ralna:master Jan 10, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory fault during unsymmetric scaling of singular matrix with Hungarian algorithm
3 participants