Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix undefined behavior in LDLT #222

Merged
merged 1 commit into from
Jan 7, 2025
Merged

Conversation

jwnimmer-tri
Copy link
Contributor

When m==n the failed_rect.data() is nullptr, but then we still subtract some small integer from it. Doing arithmetic on a null pointer is undefined behavior. Clang's undefined Behavior sanitizer says ldlt_app.cxx:2420:38: runtime error: applying non-zero offset 18446744073709551536 to null pointer.

The copy_failed_rect ends up being a no-op because m==rfrom, but it's still UB to do arithmetic on nullptr, even if never de-referenced.

@jfowkes
Copy link
Contributor

jfowkes commented Jan 7, 2025

Thank you very much for the fix, @mjacobse would you be able to review this one?

@jfowkes jfowkes requested a review from mjacobse January 7, 2025 09:49
@mjacobse
Copy link
Collaborator

mjacobse commented Jan 7, 2025

Thanks, this should definitely be fixed and as far as I can see the proposed change would do it. I was just a bit bothered by needing a somewhat magic looking conditional just for this when it feels like there should be a more natural way to handle this special case. So I thought about it a bit and I believe this extra call for copying the blocks that contain both diagonal and rectangular parts can be written more succinctly and intuitively:

copy_failed_rect(
      get_nrow(nblk-1, m, block_size) - get_nrow(nblk-1, n, block_size),
      get_ncol(jblk, n, block_size), 0, cdata[jblk],
      failed_rect.data() + jfail*(m-n), m-n,
      &a[jblk*block_size*lda+n], lda
      );

This should be exactly equivalent since (nblk-1)*block_size + get_nrow(nblk-1, n, block_size) == n as I convinced myself with an illustration (see below, the hatched part is what the extra call is for). And it has the additional advantage that for m==n, the nullptr will only be incremented by 0 which should be fine, so it also fixes this issue. Let me know if I overlooked anything, otherwise I think I'd prefer this solution for better readability.

PXL_20250107_144522493

When m==n the failed_rect.data() is nullptr, but then we still
subtract some small integer from it. Doing arithmetic on a null
pointer is undefined behavior. Clang's undefined Behavior sanitizer
says ldlt_app.cxx:2420:38: runtime error: applying non-zero offset
18446744073709551536 to null pointer

The copy_failed_rect ends up being a no-op because m==rfrom, but it's
still UB to do arithmetic on nullptr, even if never de-referenced.
@jwnimmer-tri
Copy link
Contributor Author

Yes, confirmed. I ran that new version through our UBSan test case that caught this, as well as our full Drake test suite, and everything passed. I've pushed that version now.

@mjacobse
Copy link
Collaborator

mjacobse commented Jan 7, 2025

Great, thanks! Seems good to me but would be happy if you also take a look into the new version @jfowkes

@jfowkes
Copy link
Contributor

jfowkes commented Jan 7, 2025

Very neat, many thanks @mjacobse!

@nimgould you may want to incorporate this fix into galahad.

@jfowkes jfowkes merged commit f720e16 into ralna:master Jan 7, 2025
16 checks passed
@jwnimmer-tri jwnimmer-tri deleted the clang-ubsan branch January 7, 2025 20:04
@nimgould
Copy link
Contributor

nimgould commented Jan 8, 2025

Thank you, all, mods have been incorporated in galahad's multiprecision cpu version, and tests succeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants