Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver, timeutil: fix some Timer user-after-Stops #61373

Merged
merged 1 commit into from
Mar 4, 2021

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Mar 2, 2021

Two guys were continuing to use a Timer after Stop()ing it, which is
illegal.

Release note: None
Release justification: Bug fix.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the bug fix. Is there a way we would could have caught this easier? What if Stop operated on a **Timer and set the reference to nil? That would prevent the use-after-free.

I'm not sure about the StopExclusive() extension. I wouldn't want to complicate an already complicated API for limited benefit and I don't think we have any reason to believe that not adding to the pool during this small race is causing issues (unless you were seeing something in cpu/heap profiles). But can we do what you want unconditionally? Do we need to worry about concurrent writes to t.Read? Either way, it may be better to pull it into a separate PR so we can iterate on it without holding up the bug fix portion.

Reviewed 2 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

Two guys were continuing to use a Timer after Stop()ing it, which is
illegal.

Release note: None
Release justification: Bug fix.
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed everything but the fix.

Is there a way we would could have caught this easier? What if Stop operated on a **Timer and set the reference to nil? That would prevent the use-after-free.

Well but you have the same problem tho - a StopTimer(**Timer) could not be called concurrently with a receive (<-t.C). So, you wouldn't be able to simply generally replace Stop() with it (since Stop can be called concurrently with a receive[*]. So, you'd end up with an extra API.

[*] The comments on Stop say:

// Stop does not
// close the channel, to prevent a read from succeeding incorrectly.
// Note that a Timer must never be used again after calls to Stop as the timer
// object will be put into an object pool for reuse.

This comment is a bit ambiguous, because the two sentences seem to be at odds. The first sentence implies that it's possible for somebody to be receiving on t.C concurrently. The second sentence says that you can't use the channel after Stop, so depending on the definition of "after", one shouldn't be receiving on it. In any case, I believe we do have code Stopping and receiving at the same time.

I'm not sure about the StopExclusive() extension. I wouldn't want to complicate an already complicated API for limited benefit and I don't think we have any reason to believe that not adding to the pool during this small race is causing issues (unless you were seeing something in cpu/heap profiles). But can we do what you want unconditionally? Do we need to worry about concurrent writes to t.Read? Either way, it may be better to pull it into a separate PR so we can iterate on it without holding up the bug fix portion.

I think we do need to worry about concurrent t.Read = true, yeah. What could we do about that, if we wanted a Stop that always puts the timer back in the pool? We could make t.Read an atomic (set through a new method), and have Stop do:

res = t.timer.Stop()
if !res {
  select {
  case <-t.C:
    drained = true
  default:
  }
  if !drained {
    <spin until t.Read == true, which should be soon>
  }
}

WDYT? Doesn't make the heart joyous.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

@ajwerner
Copy link
Contributor

ajwerner commented Mar 3, 2021

#38055

I have regrets

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gimme a stamp pls

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Mar 4, 2021

Build succeeded:

@craig craig bot merged commit 37cdf01 into cockroachdb:master Mar 4, 2021
@andreimatei andreimatei deleted the small.timer-stop branch January 21, 2022 17:59
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Mar 6, 2024
119901: timeutil: stack-allocate Timer, remove pooling r=nvanbenschoten a=nvanbenschoten

Informs #119593.
Closes #38055.

This PR removes the pooling of `timeutil.Timer` structs and, in doing so, permits the structs to be stack allocated so that no pooling is necessary. This superfluous (and in hindsight, harmful) memory pooling was introduced in f11ec1c, which also added very necessary pooling for the internal time.Timer structs.

The pooling was harmful because it mandated a contract where Timer structs could not be used after their Stop method was called. This was surprising (time.Timer has no such limitation) and led to subtle use-after-free bugs over time (#61373 and #119595). It was also unnecessary because the outer Timer structs can be stack allocated. Ironically, the only thing that causes them to escape to the heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

----

The PR then improves the memory pooling of the inner `time.Timer` so that it is always recycled. This was originally identified by `@andreimatei` in #13466 (review).

Doing so has a positive impact on the microbenchmark introduced in the first commit, demonstrating that timers can be stack-allocated and require zero heap allocations:
```
name      old time/op    new time/op    delta
Timer-10     152µs ± 1%     153µs ± 1%      ~     (p=0.133 n=9+10)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
```

----

cc. `@andreimatei` `@ajwerner`

Epic: None
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants