Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fcase() segfault #6452

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Sep 2, 2024

Closes #6448.

@aitap I didn't add any test -- any ideas for what would make sense as a regression test here? Technically this could have been caught by existing tests, under a stricter operating environment.

The test from OP did reliably segfault, but it's harder to turn that into a test for the absence of a segfault -- experimenting, I was getting it to run 1500+ times before segfaulting.

Copy link
Member Author

MichaelChirico commented Sep 2, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite Graphite

@MichaelChirico MichaelChirico marked this pull request as ready for review September 2, 2024 06:02
@MichaelChirico MichaelChirico force-pushed the 09-02-more_descriptive_variable_names_for_fcase_counting_variables branch from 84e0932 to 49bc983 Compare September 2, 2024 06:16
@MichaelChirico MichaelChirico force-pushed the 09-02-fix_fcase_segfault branch from 4e5a82d to 25896d2 Compare September 2, 2024 06:16
Copy link

github-actions bot commented Sep 2, 2024

Comparison Plot

Generated via commit a563ada

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 10 minutes and 59 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 41 seconds

@MichaelChirico
Copy link
Member Author

What @aitap suggested is a much simpler one-line fix here:

value0 = thens;

-      value0 = thens;
+      value0 = PROTECT(thens); nprotect++;

Looking again though, it does look wasteful to me that we repeatedly run getAttrib() on value0 -- and the only reason we keep value0 around to begin with is to make sure subsequent thens match the first one (re-named value0) on class & factor levels. So the approach taken here is to record those attributes instead, and not keep all of value0 protected.

It makes the code a bit lengthier, but should be more memory-efficient.

@MichaelChirico MichaelChirico force-pushed the 09-02-fix_fcase_segfault branch from 25896d2 to 3cb9108 Compare September 2, 2024 06:27
@HughParsonage
Copy link
Member

Are we sure this is the fix? I am surprised that a miscalculated UNPROTECT would not have triggered a stack imbalance warning message.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Sep 2, 2024

Are we sure this is the fix? I am surprised that a miscalculated UNPROTECT would not have triggered a stack imbalance warning message.

there's no missed unprotect -- because thens is REPROTECT'd.

I confirmed the examples from the issue which reliably segfault pretty quickly, no longer do so.

@aitap
Copy link
Contributor

aitap commented Sep 2, 2024

any ideas for what would make sense as a regression test here

Can we afford to enable gctorture for one test? With it enabled,With manual calls to gc(), it's possible to force value0 to point to memory allocated by one of the eval() calls and thus fail the class test:

library(data.table)
x <- 1:3
system.time(local({
 fcase(
  x<2, structure(list(1), class = "foo"),
  x<3, structure(list(2), class = "foo"),
  # after i=1, `value0` has been unprotected;
  # one of the temporary values from the following eval()s
  # has a chance to show up at its former address
  { gc(full = TRUE); replicate(10, FALSE); x<4 }, `attr<-`(list(3), "class", "foo")
 )
}))
# Error in fcase(x < 2, structure(list(1), class = "foo"), x < 3, structure(list(2),  : 
#   Argument #6 has different class than argument #2, Please make sure all output values have the same class.
# Timing stopped at: 0.01 0 0.02

Without an explicit gc, fcase returns the right answer (a list of 1...3 with the class set) because garbage collection and memory reuse doesn't naturally happen that frequently.

Edit: use replicate to produce an arbitrary number of temporaries and gc() to trigger the issue

@MichaelChirico
Copy link
Member Author

could you test making the input a factor with other attributes too? that will trigger the max # of branches in fcase with a getAttrib call

@MichaelChirico
Copy link
Member Author

Really slick idea to get gc() to happen in the middle of fcase() execution. Not clear what breadth of systems we can expect such a test to actually work as intended, but it's definitely better than nothing!

@MichaelChirico MichaelChirico force-pushed the 09-02-fix_fcase_segfault branch from 3cb9108 to a563ada Compare September 2, 2024 17:02
@aitap
Copy link
Contributor

aitap commented Sep 3, 2024

Here's a test with factors that errors for me on R-devel/data.table from Git on Linux and R-4.4.0/data.table-1.16.0 on Windows, all x86_64:

x <- 1:3
y <- I(as.factor(LETTERS[1:3]))
fcase(
  x<2, y[3],
  x<3, y[2],
  { gc(full = TRUE); for (i in 1:1000) list(factor(letters[3:6])); x<4 }, y[1]
)

It doesn't fail in a more interesting manner because value0 ends up not being a factor:

#2  0x00007ffff44cb764 in fcaseR (rho=0x555558b43390, args=0x555558884098) at fifelse.c:279
279                 error(_("Argument #2 and argument #%d are both factor but their levels are different."), i*2+2);
(gdb) p R_inspect(value0)
@555557edc868 16 STRSXP g0c1 [] (len=1, tl=0)
  @555555a6e668 09 CHARSXP g1c2 [MARK,REF(2109),gp=0x61] [ASCII] [cached] "unique.default"

Somehow, without the lists in the outputs, it takes more effort to get this particular SEXP reused, but it is never free'd, only given out by allocVector.

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me, thanks!
would we have caught this if we were using rchk? #3618
edit: looks like we do have an rchk ci job, why didn't it catch this? isn't rchk supposed to check the PROTECT etc?

@MichaelChirico
Copy link
Member Author

rchk is a static analyzer, it can't catch everything. possibly it doesn't know how to handle PROTECT_WITH_INDEX.

@MichaelChirico MichaelChirico merged commit 9e845c4 into 09-02-more_descriptive_variable_names_for_fcase_counting_variables Sep 7, 2024
4 of 5 checks passed
@MichaelChirico MichaelChirico deleted the 09-02-fix_fcase_segfault branch September 7, 2024 15:11
MichaelChirico added a commit that referenced this pull request Sep 7, 2024
* More descriptive variable names for fcase counting variables

* Fix fcase() segfault (#6452)
MichaelChirico added a commit that referenced this pull request Sep 7, 2024
* More descriptive variable names for fcase counting variables

* Fix fcase() segfault (#6452)
@tdhock
Copy link
Member

tdhock commented Sep 8, 2024

@kalibera you are the main author of rchk right?
This example may be interesting for you to investigate, for improving rchk?

@kalibera
Copy link

kalibera commented Sep 9, 2024

@kalibera you are the main author of rchk right? This example may be interesting for you to investigate, for improving rchk?

Unfortunately it is not feasible examples like this would be covered. The main challenge for rchk here is, in rchk terminology, that something gets implicitly protected, but later implicitly unprotected. rchk gets the first part, but isn't able to collect enough information to track the second part (value0 is implicitly protected by getting a value from protected thens, but later thens loses its protection in an additional iteration of the loop). It is not realistic for cases like this to be handled and in principle one should not assume that a lack of report from rchk means that the code is correct (or even lacks some kind of bugs) - it's just a bug finding tool based on static analysis, not a formal verification tool.

A higher-level explanation could be that rchk is modelling which "variables" are protected, but it is an approximation (often also of how we think about the code, but still an approximation) - in fact R objects on the heap are protected or not. For this abstraction to work better, one should not copy pointers to SEXPs. When that is still needed, one needs to be careful and perhaps could add a comment to the code about why it does not break protection.

Otherwise I confirm the reason of the bug in the code discussed above: value0 holds to an object initially protected via thens, but then no longer protected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants