Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Add background renewal for table descriptor leases #19005

Merged
merged 2 commits into from
Oct 24, 2017

Conversation

lgo
Copy link
Contributor

@lgo lgo commented Oct 3, 2017

This commit changes TableDescriptor lease acquisition to acquire a new
lease before the current lease expires. This prevents routines from
fully blocking when there are no valid leases.

If renewal continously fails, the error will become user-facing if no
lease is acquired yet by the time a lease is expired.

This patch also adds a benchmark for LeaseManager.AcquireByName
when cached. LeaseManager.AcquireByName is a critical path for
statements and generally when accessing the table descriptor. The
case when the lease is cached serves the majority of operations.

Closes #17227.

cc: @vivekmenezes @andreimatei

@lgo lgo requested a review from a team October 3, 2017 20:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 3, 2017

Ignore the first commit which is part of #18989.

Also, ignore the WIP log.Warnings thrown in several places.

I realized a large oversight in the past changes where the renewal check was being done, as it appears AcquireByName caches the results and only calls into tableState.acquire when it's expired. This happened because I was distinctly testing the renewal process of t.acquire, and I (somehow?!) didn't catch it when testing a running cockroach instance.

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch 2 times, most recently from 3472cb5 to 29e73ce Compare October 3, 2017 21:25
@lgo
Copy link
Contributor Author

lgo commented Oct 3, 2017

While acquiring a new lease and re-acquiring (not getting a new lease) seem to be performing about the same as before (might still be worth profiling these), it appears that initializing an async acquisition is far slower than I would hope.

ACQUIRING NEW LEASE
Time: 3.967999ms

RE-ACQUIRING (no new lease)
Time: 940.374µs

BEGIN ASYNC ACQUISITION
Time: 2.135845ms

I haven't thought about this yet, but this may be the result of handing over the lock to the async routine somewhere and getting blocked on that.

pkg/sql/lease.go Outdated
}
return nil, err
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy about the code above because it is likely to be non performant. Instead I think what we can do is add a timer to tableVersionState and modify m.tablename.get() above to check if the timer is set. If it is not set, it can create a timer using func AfterFunc(d Duration, f func()) *Timer from the time package. The function can renew the lease when it fires and reset the timer to nil.

Additionally when the tableVersionState is released the timer should be stopped. If AfterFunc() is called during the stop then it should wait until Afterfunc sets the timer to nil rather than setting it to nil on its own.

Thoughts?

Copy link
Contributor Author

@lgo lgo Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timer check and creation don't belong in m.tableNames.get because the method is currently is just focused on retrieving the lease from the table cache. Otherwise, a few additional parameters need to get passed in just to do the timer initialization to renew (LeaseManager, table ID).

Instead, exactly where the added code is now it could be replaced with

if ...timer == nil {
   timer = timer.AfterFunc(s.durationUntilRenewal(), func() { m.Acquire(ctx, hcl.Timestamp{}, tableVersion.ID) })
}

The uncertainty I have is where the timer should live. You suggest that we reset the timer when we release a tableVersionState, so in that case, it can live there. If we get a new lease, we will we have a new tableVersionState, so won't we end up initializing a second timer? (I'm not sure I completely follow how this timer will work out if we reset on tableVersionState release + acquisition)

(unrelated to your comment) I realized this code could instead be restructured to just call LeaseManager.Acquire and instead put the DoChan in tableState.acquire. The only addition to AcquireByName is to check for preemptive renewal and then call LeaseManager.Acquire. That's much cleaner because 1) it's not exposing the DoChan internals here, and 2) LeaseManager.Acquire already does the exact same table name lookup. I was also confused when tableState.acquire wasn't always called previously, so this would make more sense.

That's why I suggested the AfterFunc to be much simpler above.

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch 3 times, most recently from e74a4d4 to aa1cb3e Compare October 18, 2017 03:35
@lgo lgo requested review from a team October 18, 2017 03:35
@lgo lgo force-pushed the joey/table-lease-singleflight-async branch 2 times, most recently from 51afe76 to b9ec6a8 Compare October 18, 2017 04:14
@lgo
Copy link
Contributor Author

lgo commented Oct 18, 2017

Rebased the patch on the singleflight branch as that's now merged.

Previously this patch was a bit more complicated, which resulted in +1ms just when the lease renewal was being kicked off. Instead, this patch is now simplified and on first lease access, it creates a timer to fire later at the desired refresh time. It also uses a lockless manner to enter the timer initialization and broadcast that it has been initialized.

Sidenote: A lock is still involved in findTableState, although the lock is held for a very short period. Because of this, I created a benchmark to assess if this impacts anything. It seems that there is not an impactful performance change, as I think this code path is run about once per statement/plan. (I believe the methodology is correct, it uses a parallel benchmark to attempt to cause lock contention.)

name                        old time/op  new time/op  delta
LeaseAcquireByNameCached-8   406ns ± 0%   507ns ± 0%   ~     (p=1.000 n=1+1)

Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions.


pkg/sql/lease.go, line 1220 at r4 (raw file):

			if t := m.findTableState(tableVersion.ID, true); atomic.CompareAndSwapInt32(&t.mu.renewalTimerSet, 0, 1) {
				durationUntilRenewal := time.Duration(tableVersion.expiration.WallTime-hlc.UnixNano()) - m.LeaseStore.leaseRenewalTimeout
				// innerCtx, span := tracing.ForkCtxSpan(ctx, "[async] lease refresh")

@andreimatei any idea why ForkCtxSpan would not work? When this line is uncommented and innerCtx is used within the DoChan spawned it seems to always result in context canceled errors.

My best guess is that the current span is a noopSpan so a no new context is made and we just use the short-lived parent context. If this is true, then ForkCtxSpan is incorrect in stating it Returns the new context. Otherwise, I'm at a loss.

Any suggestions for a workaround? context.Background()?


Comments from Reviewable

@bdarnell
Copy link
Contributor

I haven't looked at the code, but should we take this opportunity to make table leases much shorter? We wanted a long duration because renewals are disruptive, but if that's no longer true it would be nice to have much shorter leases so that you don't have to wait 5 minutes for a schema change after killing a node (which is something that our users seem to run into with some regularity).


Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 18, 2017

Are there any other cons for shortening the table lease life?

By allowing table versions to increment faster (say 2 minutes), will this cause issues for long running queries or other processes? My assumption is no, as it seems schema changes only continue when those queries are finished -- from the table descriptor lease RFC

[schema changing will] wait for the table descriptor change to propagate to all nodes in the cluster and all uses of the previous version to finish.


Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 18, 2017

Another con is renewals can still disruptive but to a lesser extent. If the new lifespan is 2 minutes, and only one query happens every 2 minutes (or very close to 2 minutes), queries will still get blocked because of acquisition. (And not disruptive if you have any amount of queries prior to the expiry).


Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@andreimatei
Copy link
Contributor

If anybody has complained about not being able to do schema changes after a node was kill-9'ed as Ben claims, than a duration of a minute of so sounds good to me. However, if there's indications of leases not being released when a node was more gently killed, than that's likely a bug.

Long running queries are not supposed to be affected by schema changes except if they're running in SNAPSHOT isolation and they're pushed. However, that's theory. I'm not sure to what extent the code implements the desires of the latest version of that RFC.


Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/sql/lease.go, line 1220 at r4 (raw file):

Previously, lego (Joey Pereira) wrote…

@andreimatei any idea why ForkCtxSpan would not work? When this line is uncommented and innerCtx is used within the DoChan spawned it seems to always result in context canceled errors.

My best guess is that the current span is a noopSpan so a no new context is made and we just use the short-lived parent context. If this is true, then ForkCtxSpan is incorrect in stating it Returns the new context. Otherwise, I'm at a loss.

Any suggestions for a workaround? context.Background()?

ForkCtxSpan refers to creating a new span, different from the caller's span. It has nothing to do with the parent-child relationship between the two contexts - meaning that the returned ctx is derived from the caller's ctx and hence it inherits the cancellation.
If you don't want that (and you don't), then you have to pass in a context that's not derived from the caller's (like context.Background() or an annotation thereof).

The tricky thing in this case is that you have to integrate with the server's stopper (and also separately draining?) so that the acquisition doesn't happen if... something.
Not sure how to do that, exactly. For the stopper part, you probably have to start either a task or a worker that calls timer.Stop() when the stopper is quiescing. Also check out timer.WithCancel for stopping an acquisition if the acquisition actually started.


pkg/sql/lease.go, line 1224 at r4 (raw file):

				t.mu.preemptiveRenewalTimer = time.AfterFunc(durationUntilRenewal, func() {
					t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
						// defer tracing.FinishSpan(span)

who is supposed to finish span when the DoChan call joins an existing group?


pkg/sql/lease.go, line 1229 at r4 (raw file):

						token, err := t.acquireNodeLease(innerCtx, m, hlc.Timestamp{})
						if err != nil {
							log.Error(innerCtx, errors.Wrapf(err, "lease acquisition failed"))

nit: if you're logging an error, there's no reason to create a new error which will just be thrown away. Just use log.Errorf() to provide the extra message.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 19, 2017

Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/sql/lease.go, line 1220 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ForkCtxSpan refers to creating a new span, different from the caller's span. It has nothing to do with the parent-child relationship between the two contexts - meaning that the returned ctx is derived from the caller's ctx and hence it inherits the cancellation.
If you don't want that (and you don't), then you have to pass in a context that's not derived from the caller's (like context.Background() or an annotation thereof).

The tricky thing in this case is that you have to integrate with the server's stopper (and also separately draining?) so that the acquisition doesn't happen if... something.
Not sure how to do that, exactly. For the stopper part, you probably have to start either a task or a worker that calls timer.Stop() when the stopper is quiescing. Also check out timer.WithCancel for stopping an acquisition if the acquisition actually started.

acquireNodeLease` already handles draining, so I think it's fine on that account. (Or is more necessary?)

As for stopping the acquisition before the timer starts, we could potentially do one of two things. I think the first option is a better approach.

One option is retrofit RefreshLeases to do it:

case <-s.ShouldQuiesce():
  // Where m is the MeaseManager
  m.Lock()
  for _, t := m.mu.tables {
    if atomic.LoadInt32(&t.mu.renewalTimerSet) == 1 {
      t.mu.preemptiveRenewalTimer.Stop()
    }
  }
  m.Unlock()

This could also just be a separate goroutine we initialize alongside RefreshRoutines (or inside NewLeaseManager... not sure about the preferred style I'd imagine inside NewLeaseManager due to it being a "requirement" to stop correctly). Both strategies only need one goroutine to any amount of tables.

Alternatively, for each renewal we begin we can also start a cleanup routine which will stop the timer:

if  ... {
   // Use a channel to finish the cleanup worker if the timer fires before ShouldQuiesce is true.
   c := make(chan, struct{})
   time.AfterFunc(...,
      // Signal the cleanup worker that it's no longer needed.
      c <- struct{}{} 
      DoChan(..., func() {
         // ...
      })
   })
  // Start the clean up worker
  t.stopper.RunWorker(context.Background(), func(ctx context.Context) {
    select {
      <-t.stopper.ShouldQuiesce():
        t.mu.preemptiveRenewalTimer.Stop()
        return
      <-c:
        return
    }
  })
}

pkg/sql/lease.go, line 1224 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

who is supposed to finish span when the DoChan call joins an existing group?

Good catch. The context creation (and span cleanup) should happen entirely inside the DoChan closure, or in the AfterFunc closure.

I've also realized there is a problem: who sets renewalTimerSet back to 0 if this joins an existing group. To get around this we can instead set that value back after the DoChan results channel returns.

One last thing. Would it be preferable to replace the atomic values with enums? They will need to be wrapped with int32(...) though :/


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from b9ec6a8 to ad3d5cb Compare October 20, 2017 21:03
@lgo
Copy link
Contributor Author

lgo commented Oct 20, 2017

Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions.


pkg/sql/lease.go, line 1220 at r4 (raw file):

Previously, lego (Joey Pereira) wrote…

acquireNodeLease` already handles draining, so I think it's fine on that account. (Or is more necessary?)

As for stopping the acquisition before the timer starts, we could potentially do one of two things. I think the first option is a better approach.

One option is retrofit RefreshLeases to do it:

case <-s.ShouldQuiesce():
  // Where m is the MeaseManager
  m.Lock()
  for _, t := m.mu.tables {
    if atomic.LoadInt32(&t.mu.renewalTimerSet) == 1 {
      t.mu.preemptiveRenewalTimer.Stop()
    }
  }
  m.Unlock()

This could also just be a separate goroutine we initialize alongside RefreshRoutines (or inside NewLeaseManager... not sure about the preferred style I'd imagine inside NewLeaseManager due to it being a "requirement" to stop correctly). Both strategies only need one goroutine to any amount of tables.

Alternatively, for each renewal we begin we can also start a cleanup routine which will stop the timer:

if  ... {
   // Use a channel to finish the cleanup worker if the timer fires before ShouldQuiesce is true.
   c := make(chan, struct{})
   time.AfterFunc(...,
      // Signal the cleanup worker that it's no longer needed.
      c <- struct{}{} 
      DoChan(..., func() {
         // ...
      })
   })
  // Start the clean up worker
  t.stopper.RunWorker(context.Background(), func(ctx context.Context) {
    select {
      <-t.stopper.ShouldQuiesce():
        t.mu.preemptiveRenewalTimer.Stop()
        return
      <-c:
        return
    }
  })
}

For now, I implemented the first approach I've suggested.


pkg/sql/lease.go, line 1224 at r4 (raw file):

Previously, lego (Joey Pereira) wrote…

Good catch. The context creation (and span cleanup) should happen entirely inside the DoChan closure, or in the AfterFunc closure.

I've also realized there is a problem: who sets renewalTimerSet back to 0 if this joins an existing group. To get around this we can instead set that value back after the DoChan results channel returns.

One last thing. Would it be preferable to replace the atomic values with enums? They will need to be wrapped with int32(...) though :/

Fixed both the span issue and what I mentioned.


pkg/sql/lease.go, line 1229 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: if you're logging an error, there's no reason to create a new error which will just be thrown away. Just use log.Errorf() to provide the extra message.

Done.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from ad3d5cb to 7905053 Compare October 20, 2017 21:04
@andreimatei
Copy link
Contributor

as discussed, maybe we can get rid of the timer


Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/base/config.go, line 91 at r6 (raw file):

	// DefaultTableDescriptorLeaseRenewalTimeout is the default time
	// before a lease expires when acquisition to renew the lease begins.
	DefaultTableDescriptorLeaseRenewalTimeout = 1 * time.Minute

nit: 1 * is not necessary


pkg/sql/lease.go, line 1219 at r6 (raw file):

			// been set. The new timer is set to fire leaseRenewalTimeout
			// before the lease expires to execute a lease renewal function.
			if t := m.findTableState(tableVersion.ID, true); atomic.CompareAndSwapInt32(&t.mu.renewalTimerSet, 0, 1) {

please comment the true -> it's in the style guide now!


pkg/sql/lease_test.go, line 931 at r5 (raw file):

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			_, _, err := leaseManager.AcquireByName(context.TODO(), t.server.Clock().Now(), dbID, tableName)

line over 100 chars. Please set a limit in your editor. Probably above too.


Comments from Reviewable

LeaseManager.AcquireByName is a critical path for statements and
generally when accessing the table descriptor. The case when the lease
is cached serves the majority of operations.
@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from 7905053 to 165da70 Compare October 23, 2017 18:26
@lgo
Copy link
Contributor Author

lgo commented Oct 23, 2017

Done! :)


Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/base/config.go, line 91 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: 1 * is not necessary

Done.


pkg/sql/lease.go, line 1219 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment the true -> it's in the style guide now!

Done.


pkg/sql/lease_test.go, line 931 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

line over 100 chars. Please set a limit in your editor. Probably above too.

Done.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 23, 2017

Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions.


pkg/sql/lease.go, line 1219 at r6 (raw file):

Previously, lego (Joey Pereira) wrote…

Done.

(Also adjusted the bool to false. If a tableVersion is found, we should not need to create a tableState)


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 590 at r8 (raw file):

		// This is atomically updated to prevent multiple routines from
		// entering renewal initialization.
		renewalInProgress int32

if this is an atomic, it should be commented as such. If it can be accessed without holding mu, why is it under mu?


pkg/sql/lease.go, line 1202 at r8 (raw file):

// A transaction using this descriptor must ensure that its
// commit-timestamp < expiration-time. Care must be taken to not modify
// the returned descriptor. A timer may be set to renew a lease in the

there's a couple of comments about "timer" left in the commit. They should all go, right?


pkg/sql/lease.go, line 1216 at r8 (raw file):

			durationUntilExpiration := time.Duration(tableVersion.expiration.WallTime - hlc.UnixNano())
			durationUntilRenewal := durationUntilExpiration - m.LeaseStore.leaseRenewalTimeout
			if t := m.findTableState(tableVersion.ID, false /* create */); durationUntilRenewal < 0 && atomic.CompareAndSwapInt32(&t.mu.renewalInProgress, 0, 1) {

I don't even need to measure that line to know...


pkg/sql/lease.go, line 1216 at r8 (raw file):

			durationUntilExpiration := time.Duration(tableVersion.expiration.WallTime - hlc.UnixNano())
			durationUntilRenewal := durationUntilExpiration - m.LeaseStore.leaseRenewalTimeout
			if t := m.findTableState(tableVersion.ID, false /* create */); durationUntilRenewal < 0 && atomic.CompareAndSwapInt32(&t.mu.renewalInProgress, 0, 1) {

t can be nil and crash the condition, no? Or if it can't be nil for some reason, let's assert that.


pkg/sql/lease.go, line 1221 at r8 (raw file):

					innerCtx = t.stopper.WithCancel(innerCtx)
					resultChan, _ := t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
						defer tracing.FinishSpan(span)

who closes the span if this call is not the "group leader"? Or are we relying on this always being the group leader because of the renewalInProgress check, let's assert that we indeed were the leader using the DoChan bool result. But what if someone is acquiring a lease through another code path?


pkg/sql/lease.go, line 1222 at r8 (raw file):

					resultChan, _ := t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
						defer tracing.FinishSpan(span)
						log.Infof(innerCtx, "lease acquisition beginning")

Infof is way to high. Also below. Also please put more info in these messages - the table name or smth


pkg/sql/lease_test.go, line 931 at r5 (raw file):

Previously, lego (Joey Pereira) wrote…

Done.

Thanks. As a nit for the future - it's great if you can fix things in the right commit. What I do is address comments based on the revision on which they were left (assuming the reviewer reviewed commit by commit). Reviewable shows you the review on which the comment was left in the top-right corner of the comment.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from 165da70 to 4652a1d Compare October 23, 2017 19:44
@lgo
Copy link
Contributor Author

lgo commented Oct 23, 2017

Review status: 0 of 3 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 590 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if this is an atomic, it should be commented as such. If it can be accessed without holding mu, why is it under mu?

Done. (Yep, no reason it needed to be under there as it didn't use mu.)


pkg/sql/lease.go, line 1202 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

there's a couple of comments about "timer" left in the commit. They should all go, right?

Done here and elsewhere.


pkg/sql/lease.go, line 1216 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't even need to measure that line to know...

Yea go-vet was recombining this in what seems to be sane line breaks :/. I guess breaking on the && is the most reasonable allowed line break.


pkg/sql/lease.go, line 1216 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

t can be nil and crash the condition, no? Or if it can't be nil for some reason, let's assert that.

I don't believe t can be nil here right now, but in if we begin deleting tableState (e.g. if a table is deleted) then this check will be required, so I've added it.


pkg/sql/lease.go, line 1221 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

who closes the span if this call is not the "group leader"? Or are we relying on this always being the group leader because of the renewalInProgress check, let's assert that we indeed were the leader using the DoChan bool result. But what if someone is acquiring a lease through another code path?

Oof yea I shuffled the innerCtx again after removing the timer. The span can be initialized inside the DoChan func fine so it's only created and cleaned up if this call runs.


pkg/sql/lease.go, line 1222 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Infof is way to high. Also below. Also please put more info in these messages - the table name or smth

Done.


pkg/sql/lease_test.go, line 931 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Thanks. As a nit for the future - it's great if you can fix things in the right commit. What I do is address comments based on the revision on which they were left (assuming the reviewer reviewed commit by commit). Reviewable shows you the review on which the comment was left in the top-right corner of the comment.

TIL!

(The revisions don't seem to explicitly state a corresponding commit, besides the fact that there are now two "active" revisions. Is that right?)


Comments from Reviewable

@andreimatei
Copy link
Contributor

LGTM but one of the tests seems to have failed on TC


Review status: 0 of 3 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


pkg/sql/lease.go, line 1222 at r9 (raw file):

				t.stopper.RunWorker(context.Background(), func(ctx context.Context) {
					resultChan, _ := t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
						innerCtx, span := tracing.ForkCtxSpan(ctx, "[async] lease refresh")

nit: overwrite ctx to not have two contexts in scope.


pkg/sql/lease.go, line 1225 at r9 (raw file):

						innerCtx = t.stopper.WithCancel(innerCtx)
						defer tracing.FinishSpan(span)
						if log.V(2) {

VEventf() instead of this construct.

If you want if V(2) for whatever reason, there's a newer ExpensiveLogEnabled or smth that you should use (it's linked from V()).


pkg/sql/lease_test.go, line 931 at r5 (raw file):

Previously, lego (Joey Pereira) wrote…

TIL!

(The revisions don't seem to explicitly state a corresponding commit, besides the fact that there are now two "active" revisions. Is that right?)

each revision is a commit. If you look at the beginning of the review comments, you see the commit message for each revision.
Another pro tip: always rebase before pushing so that the current generation of commits form a suffix of the revisions (i.e. always force all commits to be rewritten by doing a rebase).


pkg/sql/lease_test.go, line 126 at r8 (raw file):

	nodeID uint32, dbID sqlbase.ID, name string,
) (*sqlbase.TableDescriptor, hlc.Timestamp, error) {
	return t.node(nodeID).AcquireByName(context.TODO(), t.server.Clock().Now(), dbID, name)

is this wrapper worth it? I had to look a bunch for it, and I didn't know what the literal 1 in all the calls means.


pkg/sql/lease_test.go, line 953 at r9 (raw file):

// This test makes sure the lease gets renewed automatically in the background
// if the lease is about to expire, without blocking.
func TestLeaseRefreshedAutomatically(testingT *testing.T) {

"refresh" is called "renewal" (async renewal?) elsewhere, isn't it?


pkg/sql/lease_test.go, line 1043 at r9 (raw file):

// This test makes sure that async lease refreshing doesn't block any routines
func TestLeaseRefreshDoesntBlock(testingT *testing.T) {

This is an unusual test. If I imaging correctly (and I have to imagine because the comment you've added doesn't really say anything - what's "any routines"?), it's testing that an acquisition is not blocked by a renewal. Is there a risk of that happening / is the risk worth this complicated test? What kind of programming mistake would make the async renewal sync?
Your choice.


pkg/sql/lease_test.go, line 1043 at r9 (raw file):

// This test makes sure that async lease refreshing doesn't block any routines
func TestLeaseRefreshDoesntBlock(testingT *testing.T) {

"refresh" is called "renewal" in other code, isn't it?


Comments from Reviewable

@lgo lgo changed the title sql: Table lease refresh done asynchronously sql: Add background renewal for table descriptor leases Oct 23, 2017
@lgo lgo force-pushed the joey/table-lease-singleflight-async branch 2 times, most recently from 598b09c to 5c8b722 Compare October 24, 2017 00:33
@lgo
Copy link
Contributor Author

lgo commented Oct 24, 2017

Review status: 0 of 3 files reviewed at latest revision, 15 unresolved discussions.


pkg/sql/lease.go, line 1222 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: overwrite ctx to not have two contexts in scope.

So having ctx blows a linter rule for shadowing the scope, so that was the breaking point for moving this code to a separate function.


pkg/sql/lease.go, line 1225 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

VEventf() instead of this construct.

If you want if V(2) for whatever reason, there's a newer ExpensiveLogEnabled or smth that you should use (it's linked from V()).

Done.


pkg/sql/lease_test.go, line 126 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is this wrapper worth it? I had to look a bunch for it, and I didn't know what the literal 1 in all the calls means.

Yea, that's a fair point. It's clearer with the direct call to node, and I also added a comment to node.


pkg/sql/lease_test.go, line 953 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"refresh" is called "renewal" (async renewal?) elsewhere, isn't it?

Fixed here and elsewhere.


pkg/sql/lease_test.go, line 1043 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"refresh" is called "renewal" in other code, isn't it?

Done. (a instances of refresh in the tests were updated)


pkg/sql/lease_test.go, line 1043 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This is an unusual test. If I imaging correctly (and I have to imagine because the comment you've added doesn't really say anything - what's "any routines"?), it's testing that an acquisition is not blocked by a renewal. Is there a risk of that happening / is the risk worth this complicated test? What kind of programming mistake would make the async renewal sync?
Your choice.

Uhh yea it is peculiar. There is no risk of it happening and is an attempted test to make sure it doesn't block. Something that the previous test also asserts.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from 5c8b722 to c6a0511 Compare October 24, 2017 00:35
Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

@@ -79,12 +79,16 @@ const (
// the jitter fraction. Jittering is done to prevent multiple leases
// from being renewed simultaneously if they were all acquired
// simultaneously.
DefaultTableDescriptorLeaseDuration = 5 * time.Minute
DefaultTableDescriptorLeaseDuration = 2 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason to reduce this other than the case where a node dies in the middle of a schema change. That's a pretty rare case, and the tradeoff here is that we renew leases more frequently.

pkg/sql/lease.go Outdated
durationUntilRenewal < 0 && atomic.CompareAndSwapInt32(&t.renewalInProgress, 0, 1) {
// Start the renewal. When it finishes, it will reset t.renewalInProgress.
t.stopper.RunWorker(context.Background(), t.renewalFn(m, tableVersion))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunWorker() is used to start goroutines that run forever. How about modifying this code as:

if time.Duration(tableVersion.expiration.WallTime - timestamp.WallTime) < m.LeaseStore.leaseRenewalTimeout {
   if t := m.findTableState(tableVersion.ID, false /* create */); t != nil && atomic.CompareAndSwapInt32(&t.renewalInProgress, 0, 1) {
 				// Start the renewal. When it finishes, it will reset t.renewalInProgress.
 				if err := t.stopper.RunTaskAsync(context.Background(), t.renewalFn(m, tableVersion)); err != {
return &tableVersion.TableDescriptor, tableVersion.expiration, err
}
 	}
}

pkg/sql/lease.go Outdated
resultChan, _ := t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
var span opentracing.Span
ctx, span = tracing.ForkCtxSpan(ctx, "[async] lease renewal")
ctx = t.stopper.WithCancel(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need all this opentracing stuff

@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 1222 at r9 (raw file):

Previously, lego (Joey Pereira) wrote…

So having ctx blows a linter rule for shadowing the scope, so that was the breaking point for moving this code to a separate function.

I see. As a matter of fact I wasn't thinking of shadowing ctx; I was thinking of overwriting it.

var span Span
ctx, span = tracing.Fork()

But I guess that wouldn't be so nice, since ctx is captured from the outer scope and it might be weird to change it (even though ctx is not currently used elsewhere). The fact that you're forced to capture is the real problem here; I think the DoChan() interface should take a ctx and pass it to the closure (like the stopper interface). @nvanbenschoten, right? I think we should do that now before the interface gets more used.

Short of that, moving the closure to a different method sounds good to me, but the function that returns a function is weird and I don't see why it's necessary.

t.stopper.RunWorker(context.Background(), func(ctx) {
  t.renewalFn(ctx, m, tableVersion))
})

pkg/sql/lease.go, line 989 at r10 (raw file):

		resultChan, _ := t.mu.group.DoChan(acquireGroupKey, func() (interface{}, error) {
			var span opentracing.Span
			ctx, span = tracing.ForkCtxSpan(ctx, "[async] lease renewal")

I think ForkCtxSpan is not gonna do what you want - it's not gonna create a span since there's no spans in contexts passed by RunWorker and RunAsyncTask. If you want to have a span here (which I think is a good thing, I'm not sure why Vivek doesn't like it), you should manually create a root span (see EnsureCtx and other functions around).


pkg/sql/lease.go, line 990 at r10 (raw file):

			var span opentracing.Span
			ctx, span = tracing.ForkCtxSpan(ctx, "[async] lease renewal")
			ctx = t.stopper.WithCancel(ctx)

since you've told me that acquireNodeLease will short-circuit if the server is draining, I'm not sure the stopper.WithCancel() is necessary / gives us anything any more.


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Oct 24, 2017 via email

@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from c6a0511 to e0bfa70 Compare October 24, 2017 15:05
@lgo
Copy link
Contributor Author

lgo commented Oct 24, 2017

Review status: 0 of 3 files reviewed at latest revision, 15 unresolved discussions, some commit checks pending.


pkg/base/config.go, line 82 at r10 (raw file):

Previously, vivekmenezes wrote…

No strong reason to reduce this other than the case where a node dies in the middle of a schema change. That's a pretty rare case, and the tradeoff here is that we renew leases more frequently.

Okay, I've changed it back to 5 minutes for now. It's easy to change in the future!


pkg/sql/lease.go, line 1222 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I see. As a matter of fact I wasn't thinking of shadowing ctx; I was thinking of overwriting it.

var span Span
ctx, span = tracing.Fork()

But I guess that wouldn't be so nice, since ctx is captured from the outer scope and it might be weird to change it (even though ctx is not currently used elsewhere). The fact that you're forced to capture is the real problem here; I think the DoChan() interface should take a ctx and pass it to the closure (like the stopper interface). @nvanbenschoten, right? I think we should do that now before the interface gets more used.

Short of that, moving the closure to a different method sounds good to me, but the function that returns a function is weird and I don't see why it's necessary.

t.stopper.RunWorker(context.Background(), func(ctx) {
  t.renewalFn(ctx, m, tableVersion))
})

Ah yea, I had done a very roundabout away to pass in the context variables. Thanks!


pkg/sql/lease.go, line 989 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think ForkCtxSpan is not gonna do what you want - it's not gonna create a span since there's no spans in contexts passed by RunWorker and RunAsyncTask. If you want to have a span here (which I think is a good thing, I'm not sure why Vivek doesn't like it), you should manually create a root span (see EnsureCtx and other functions around).

👍. What I added uses tracing.NewTracer(). Should this instead use the server's AmbientCtx to get the tracer? (I'm not too certain what the tracer will do, in particular, if we aren't crossing boundaries for this span.)


pkg/sql/lease.go, line 990 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

since you've told me that acquireNodeLease will short-circuit if the server is draining, I'm not sure the stopper.WithCancel() is necessary / gives us anything any more.

If both the stopper only stops when the server is stopping and draining happens when the server is gracefully closing, then I agree there's no purpose.

(removed it).


pkg/sql/lease.go, line 1255 at r10 (raw file):

Previously, vivekmenezes wrote…

RunWorker() is used to start goroutines that run forever. How about modifying this code as:

if time.Duration(tableVersion.expiration.WallTime - timestamp.WallTime) < m.LeaseStore.leaseRenewalTimeout {
   if t := m.findTableState(tableVersion.ID, false /* create */); t != nil && atomic.CompareAndSwapInt32(&t.renewalInProgress, 0, 1) {
 				// Start the renewal. When it finishes, it will reset t.renewalInProgress.
 				if err := t.stopper.RunTaskAsync(context.Background(), t.renewalFn(m, tableVersion)); err != {
return &tableVersion.TableDescriptor, tableVersion.expiration, err
}
 	}
}

Done.


Comments from Reviewable

pkg/sql/lease.go Outdated
if time.Duration(tableVersion.expiration.WallTime-timestamp.WallTime) < m.LeaseStore.leaseRenewalTimeout {
if t := m.findTableState(tableVersion.ID, false /* create */); t != nil && atomic.CompareAndSwapInt32(&t.renewalInProgress, 0, 1) {
// Start the renewal. When it finishes, it will reset t.renewalInProgress.
if err := t.stopper.RunAsyncTask(context.Background(), "lease renewal", func(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can pass in the current context here

@lgo
Copy link
Contributor Author

lgo commented Oct 24, 2017

Review status: 0 of 3 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 1247 at r11 (raw file):

Previously, vivekmenezes wrote…

I believe you can pass in the current context here

Unfortunately not. The task routine may not begin until the session is closed and by then the existing context is canceled. (Which occurs every time in practice.)


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 1247 at r11 (raw file):

Previously, lego (Joey Pereira) wrote…

Unfortunately not. The task routine may not begin until the session is closed and by then the existing context is canceled. (Which occurs every time in practice.)

Looking at the implementation of RunAsyncTask() it looks like the context is replaced with a new context before running the goroutine

ctx, span := tracing.ForkCtxSpan(ctx, taskName)


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 989 at r10 (raw file):

Previously, lego (Joey Pereira) wrote…

👍. What I added uses tracing.NewTracer(). Should this instead use the server's AmbientCtx to get the tracer? (I'm not too certain what the tracer will do, in particular, if we aren't crossing boundaries for this span.)

We Don't Like It When People Call NewTracer ™. If you create a new tracer, you gotta call Configure() on it; and in any case the tracer is supposed to be a singleton.
Indeed, you should get a tracer from an AmbientContext. Either use the server's AmbientContext or give a new one to the LeaseManager or to whomever.


pkg/sql/lease.go, line 1244 at r11 (raw file):

			// been set.

			if time.Duration(tableVersion.expiration.WallTime-timestamp.WallTime) < m.LeaseStore.leaseRenewalTimeout {

looooong. style guide bro.


pkg/sql/lease.go, line 1245 at r11 (raw file):

			if time.Duration(tableVersion.expiration.WallTime-timestamp.WallTime) < m.LeaseStore.leaseRenewalTimeout {
				if t := m.findTableState(tableVersion.ID, false /* create */); t != nil && atomic.CompareAndSwapInt32(&t.renewalInProgress, 0, 1) {

looong


pkg/sql/lease.go, line 1247 at r11 (raw file):

Previously, vivekmenezes wrote…

Looking at the implementation of RunAsyncTask() it looks like the context is replaced with a new context before running the goroutine

ctx, span := tracing.ForkCtxSpan(ctx, taskName)

True, but the ctx passed to the closure is still a child of the passed-in context (so it inherits the cancellation).
(I don't know if that behavior makes sense, but it is the current code).


Comments from Reviewable

This commit changes TableDescriptor lease acquisition to acquire a new
lease if the current lease is about to expire. This prevents routines
from blocking when there are no valid leases on a frequently accessed
table.

If renewal continously fails, the error will become user-facing if no
lease is acquired yet by the time a lease is expired.

Closes cockroachdb#17227.
@lgo lgo force-pushed the joey/table-lease-singleflight-async branch from e0bfa70 to d7b7942 Compare October 24, 2017 19:19
@lgo
Copy link
Contributor Author

lgo commented Oct 24, 2017

Going to merge later today if there are no more comments. Thanks a ton folks! 🎉


Review status: 0 of 5 files reviewed at latest revision, 9 unresolved discussions.


pkg/sql/lease.go, line 989 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

We Don't Like It When People Call NewTracer ™. If you create a new tracer, you gotta call Configure() on it; and in any case the tracer is supposed to be a singleton.
Indeed, you should get a tracer from an AmbientContext. Either use the server's AmbientContext or give a new one to the LeaseManager or to whomever.

👍 sgtm. Done. (pulled in the context into lease manager).


pkg/sql/lease.go, line 1244 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

looooong. style guide bro.

Erf... dealt with :'(


pkg/sql/lease.go, line 1245 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

looong

Done.


Comments from Reviewable

@andreimatei
Copy link
Contributor

:lgtm:


Review status: 0 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 24, 2017

Review status: 0 of 5 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 1244 at r11 (raw file):

Previously, lego (Joey Pereira) wrote…

Erf... dealt with :'(

TBH distributing a basic set of git hooks in the style guide would maybe be a good idea. e.g. I've since made a make lint pre-push and now a pre-commit line length check.


Comments from Reviewable

@jordanlewis
Copy link
Member

Review status: 0 of 5 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 1244 at r11 (raw file):

Previously, lego (Joey Pereira) wrote…

TBH distributing a basic set of git hooks in the style guide would maybe be a good idea. e.g. I've since added a make lint pre-push and now a pre-commit line length check.

@LEGO it'd probably be useful if you shared those hooks - git hooks are hard to write!

Also, we should consider adding long line fixes to crlfmt which exists just for this purpose.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants