Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Refactor TableDescriptor lease acquisition to use singleflight #18989

Merged
merged 2 commits into from
Oct 18, 2017

Conversation

lgo
Copy link
Contributor

@lgo lgo commented Oct 3, 2017

Previously, the global LeaseDuration value was modified during a test.
If this test is run in parallel, this would cause problems. The global
has now been modified to live on the LeaseStore where it can be modified
in tests safely for the life of the test.

This changes the acquisition of leases to use singleflight to reduce the
complexity of handling concurrent acquisitions. The behaviour is kept
the same as we block on the results of singleflight.DoChan. The change
is in preparation for adding a new codepath which acquires leases
asynchronously.

This also refactors out the jitter multiplier into a constant.

cc: @vivekmenezes @andreimatei

@lgo lgo requested a review from a team October 3, 2017 17:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch 2 times, most recently from be8a6d4 to 54a3bbc Compare October 3, 2017 18:53
@lgo
Copy link
Contributor Author

lgo commented Oct 3, 2017

@andreimatei PTAL. Took out the acquiring channel and acquireFreshestFromStoreLocked now uses the knowledge of function being run or not to busy-wait. No more acquireWait 🎉

NB: I didn't use the shared return value because singleflight.Do will return true to shared even if the function provided is called (c.dups > 0). Interestingly the behavior of singleflight.DoChan differs where it will always return false if the callers function is run.


Review status: 0 of 2 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice change!

pkg/sql/lease.go Outdated
// range of the actual lease duration will be
// [(1-LeaseJitterFraction) * LeaseDuration, (1+LeaseJitterFraction) * LeaseDuration]
// Exported for testing purposes only.
LeaseJitterFraction = 0.25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add this in the same PR that uses it. It makes sense to not make this change here.

pkg/sql/lease.go Outdated
table, err := m.LeaseStore.acquire(ctx, t.id, minExpirationTime)
t.mu.Lock()
defer t.mu.Unlock()
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lock and defer unlock() after the err is checked.

pkg/sql/lease.go Outdated
// underway.
var err error
hasAcquiredFreshest := false
for !hasAcquiredFreshest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for hasAcquiredFreshest := false; !hasAcquiredFreshest; {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not. See my comment below. Just break out of this loop once shared == false

pkg/sql/lease.go Outdated
minExpirationTime = newestTable.expiration.Add(int64(time.Millisecond), 0)
t.mu.Unlock()
_, _, err = t.mu.group.Do("acquire", func() (interface{}, error) {
hasAcquiredFreshest = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do without this hasAcquiredFreshest boolean. Let's just used the shared boolean returned from Do(). It's not quite the same semantics but when set to false we are guaranteed we were the only caller of the method which means that the lease is the freshest.

@lgo lgo force-pushed the joey/table-lease-singleflight branch from 54a3bbc to f59dacd Compare October 3, 2017 21:12
@lgo
Copy link
Contributor Author

lgo commented Oct 3, 2017

Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 59 at r2 (raw file):

Previously, vivekmenezes wrote…

I'd add this in the same PR that uses it. It makes sense to not make this change here.

Done.


pkg/sql/lease.go, line 714 at r2 (raw file):

Previously, vivekmenezes wrote…

maybe not. See my comment below. Just break out of this loop once shared == false

Done.


pkg/sql/lease.go, line 727 at r2 (raw file):

Previously, vivekmenezes wrote…

I think we can do without this hasAcquiredFreshest boolean. Let's just used the shared boolean returned from Do(). It's not quite the same semantics but when set to false we are guaranteed we were the only caller of the method which means that the lease is the freshest.

Done.


pkg/sql/lease.go, line 797 at r2 (raw file):

Previously, vivekmenezes wrote…

I'd lock and defer unlock() after the err is checked.

Done.


Comments from Reviewable

pkg/sql/lease.go Outdated
t.acquireWait()

// Move forward to acquire a fresh table lease.
// Continue to attempt to call to acquireNodeLease until our call was the one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one made by singleflight

pkg/sql/lease.go Outdated
if err := t.acquireNodeLease(ctx, m, hlc.Timestamp{}); err != nil {
// Acquire a lease if no lease exists or if the latest lease is about to
// expire. Repeatedly check to ensure either a lease is acquired or we receive
// an error. Checking we havea valid lease after acquisition is is required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/havea/have a/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is is/is/

pkg/sql/lease.go Outdated

// Move forward to acquire a fresh table lease.
// Continue to attempt to call to acquireNodeLease until our call was the one
// made by singleflight. This busy-waits while another acquisitions are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/another/other/

pkg/sql/lease.go Outdated

// Set the min expiration time to guarantee that the lease acquired is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO(vivek): the expiration time is no longer needed to sort the tableVersionState. Get rid of this.

pkg/sql/lease.go Outdated
t.mu.Lock()
if !shared {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil || !shared {
  return err
}

@vivekmenezes
Copy link
Contributor

Looks like you need to fix TestAcquireFreshestFromStoreRaces

@lgo
Copy link
Contributor Author

lgo commented Oct 4, 2017

Aha, it appears we can't just use shared and need to know if it was our function call that succeeded or not. I was afraid of this :P

What ends up happening is an almost guaranteed livelock if we get two simultaneous calls to acquireFreshestFromStoreLocked.

[1] is routine 1, [2] is routine 2.

[1] acquireFreshestFromStoreLocked calls Do
[2] acquireFreshestFromStoreLocked calls Do
[1] finishes Do, with shared = true. It will retry the Do call.
[2] finishes Do, with shared = true. It will also retry the Do call
    likely before [1] finishes as the lock is released during the lengthy part of the process.
... and so on, repeating forever.

The only thing that helps here is knowing whether it was our function that was called.

@lgo lgo force-pushed the joey/table-lease-singleflight branch 2 times, most recently from db66684 to 2cb59c2 Compare October 4, 2017 01:49
pkg/sql/lease.go Outdated
hasAcquiredFreshest = true
return nil, t.acquireNodeLease(ctx, m, minExpirationTime)
})
t.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before t.mu.Lock() is acquired t.mu.active.findNewest() can become nil again so we need to call
acquireFreshestFromStoreLocked() in a loop in acquireFreshestFromStore()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

(Hmm, this is a common problem with all uses of acquireNodeLease now that it's async and we're handing off the lock. I don't know if there's a better approach to this, but I don't know of one off the top of my head.)

Copy link
Contributor Author

@lgo lgo Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I've added two tests which catch this problem for both acquireFreshestFromStore and AcquireByName/acquire so I'm a lot more confident in that. The race happened ~50% of the time during testrace locally but is now fixed.

Aside: How does teamcity compare to tests locally, when it comes to encountering race conditions? Is it more frequent because of some property of the execution speed?

pkg/sql/lease.go Outdated
if err := t.acquireNodeLease(ctx, m, hlc.Timestamp{}); err != nil {
// Acquire a lease if no lease exists or if the latest lease is about to
// expire. Repeatedly check to ensure either a lease is acquired or we receive
// an error. Checking we have a valid lease after acquisition is is required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is is /is/

@lgo lgo force-pushed the joey/table-lease-singleflight branch 2 times, most recently from ce5a44e to 9a7473e Compare October 4, 2017 15:37
@andreimatei
Copy link
Contributor

The only thing that helps here is knowing whether it was our function that was called.

Isn't this what the bool return value of DoChan is for?

I don't know how to parse your last message. Has the bug you're talking about been fixed and so this is ready for a review (or do you want a review even if it hasn't been fixed)?


Review status: 0 of 3 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 4, 2017

Do is different than DoChan. Do's bool is shared, denoting if the function shared it's result (even if it was the function in question that was executed). DoChan's bool is whether or not the function in question was executed, regardless of it was shared.

The bug is fixed only by explicitly setting a bool, so we can't use the returned shared value (as previously suggested)


Review status: 0 of 3 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 577 at r3 (raw file):

Previously, vivekmenezes wrote…

s/is is/is/

Done.


pkg/sql/lease.go, line 701 at r3 (raw file):

Previously, vivekmenezes wrote…

only one made by singleflight

Done.


pkg/sql/lease.go, line 702 at r3 (raw file):

Previously, vivekmenezes wrote…

s/another/other/

Done.


pkg/sql/lease.go, line 577 at r4 (raw file):

Previously, vivekmenezes wrote…

s/is is /is/

Done.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch 2 times, most recently from 3a2bf08 to e3b0875 Compare October 4, 2017 16:31
@lgo
Copy link
Contributor Author

lgo commented Oct 4, 2017

Following up from an in-person convo, I changed the Do to DoChan to remove the closure variable as we can rely on the bool returned from that one.


Review status: 0 of 3 files reviewed at latest revision, 10 unresolved discussions.


pkg/sql/lease.go, line 722 at r3 (raw file):

Previously, vivekmenezes wrote…
if err != nil || !shared {
  return err
}

Done.


Comments from Reviewable

@andreimatei
Copy link
Contributor

A few comments; I'll check further when they're addressed.


Review status: 0 of 3 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 701 at r3 (raw file):

Previously, lego (Joey Pereira) wrote…

Done.

I don't know what "our call was the only one made by singleflight" means. What you want to say is that we wait until we didn't join an in-progress call.
But actually that's too pessimistic; we had discussed this but I haven't made myself understood: this code never needs to loop more than twice. The 2nd time Do() returns, regardless of whether we were the "leader" for that request or not, the result is good enough -> the Do() call that we joined must have been initiated after this method was called.

I suggest this comment:

//  We need to acquire a lease on a "fresh" descriptor, meaning that joining a potential
// in-progress lease acquisition is generally not good enough. If we are to join an
// in-progress acquisition, it needs to be an acquisition initiated after this point.
// So, we handle two cases:
// 1. The first DoChan() call tells us that we didn't join an in-progress acquisition.
//     Great, the lease that's being acquired is good.
// 2. The first DoChan() call tells us that we did join an in-progress acq. We have to 
//     wait this acquisition out; it's not good for us. But any future acquisition is good,
//     so the next time around the loop it doesn't matter if we initiate a request or join 
//     an in-progress one. 

pkg/sql/lease.go, line 548 at r6 (raw file):

	mu struct {
		syncutil.Mutex
		group singleflight.Group

needs a comment


pkg/sql/lease.go, line 577 at r6 (raw file):

Checking we have a valid lease after acquisition is required

s//Looping is necessary because lease acquisition is done without holding the tableState lock, so anything can happen in between lease acquisition and us getting control again.


pkg/sql/lease.go, line 725 at r6 (raw file):

		hasAcquiredFreshest := false
		_, _, err := t.mu.group.Do("acquire", func() (interface{}, error) {
			hasAcquiredFreshest = true

as discussed offline, get rid of this capture and use DoChan.


Comments from Reviewable

pkg/sql/lease.go Outdated
// Continue to attempt to call to acquireNodeLease until our call was the only
// one made by singleflight. This busy-waits while other acquisitions are
// underway.
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for wasCalled, s := false, t.mu.active.findNewest(); wasCalled && s != nil; s = t.mu.active.findNewest() {

and
s/newestTable/s/

@vivekmenezes
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 14 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 701 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't know what "our call was the only one made by singleflight" means. What you want to say is that we wait until we didn't join an in-progress call.
But actually that's too pessimistic; we had discussed this but I haven't made myself understood: this code never needs to loop more than twice. The 2nd time Do() returns, regardless of whether we were the "leader" for that request or not, the result is good enough -> the Do() call that we joined must have been initiated after this method was called.

I suggest this comment:

//  We need to acquire a lease on a "fresh" descriptor, meaning that joining a potential
// in-progress lease acquisition is generally not good enough. If we are to join an
// in-progress acquisition, it needs to be an acquisition initiated after this point.
// So, we handle two cases:
// 1. The first DoChan() call tells us that we didn't join an in-progress acquisition.
//     Great, the lease that's being acquired is good.
// 2. The first DoChan() call tells us that we did join an in-progress acq. We have to 
//     wait this acquisition out; it's not good for us. But any future acquisition is good,
//     so the next time around the loop it doesn't matter if we initiate a request or join 
//     an in-progress one. 

The expectation is that t.mu.active.findNewest() is non nil after the call so we potentially need to loop until that's true or we hit an error


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 14 unresolved discussions, some commit checks pending.


pkg/sql/lease.go, line 728 at r4 (raw file):

Previously, lego (Joey Pereira) wrote…

Regardless, I've added two tests which catch this problem for both acquireFreshestFromStore and AcquireByName/acquire so I'm a lot more confident in that. The race happened ~50% of the time during testrace locally but is now fixed.

Aside: How does teamcity compare to tests locally, when it comes to encountering race conditions? Is it more frequent because of some property of the execution speed?

teamcity tends to be slower than your own machine


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 701 at r3 (raw file):

Previously, vivekmenezes wrote…

The expectation is that t.mu.active.findNewest() is non nil after the call so we potentially need to loop until that's true or we hit an error

OK. But actually it seems to me t.mu.active.findNewest() != nil is not a sufficient condition, so looping on that is not good enough.
We also need t.mu.active.findNewest() to be a specific lease (the one acquired after this call). If a new lease is acquired and then released and t.mu.active.findNewest() is not nil because it has some older leases, that's no good.

So, I think we should change the Results of theDo() calls to actually return the lease that was acquired (or, better yet, some sort of opaque token representing it that can only be used for one purpose), and then validate that t.mu.active contains this particular lease.

The main point of my comment remains: what we care is not specifically that we were the "leader" of a acquisition, but that the acquisition was started after a particular time.

And since we're shaking things up, let me suggest something I see as related: I believe we're currently inserting leases in tableState with refCount == 0. And then everybody who was waiting on that lease is in charge of incrementing the refcount.
This is unfortunate; a lease with refcount 0 can, in principle, be released at any time. The reason why the code is such is, I guess, because until now it was hard to know what the initialize the refcount to. But now, since through the wonders of DoChan() we can discriminate between a "leader" and other plebs, I'd suggest we move code to a new protocol: we always insert leases with refCount = 1 reflecting the leader's reference. The leader would not increment the refCount again, but everybody else would.
This has the nice property that, if you're the leader, you no longer have to check that t.active is not empty - you know that it can't be empty because you've continuously held a counted reference to the lease you acquired.
Does that make sense? Interested?


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 4, 2017

Review status: 0 of 3 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 701 at r3 (raw file):

So, I think we should change the Results of theDo() calls to actually return the lease that was acquired (or, better yet, some sort of opaque token representing it that can only be used for one purpose), and then validate that t.mu.active contains this particular lease.

I don't think we can get away with a token since we will inevitably need to binary search tableSet to find the lease. On the other hand, if we acquire a lease, we can safely return it from acquireNodeLease without having to do extra work (aside from passing it around, that may be bad if we're piping it into all channels for callers of Do/DoChan). I might be misunderstanding what you imagine as a token?

If we return the least, we should be able to use s.leased to determine if the lease is in t.mu.active, as it's always set to false before removing it from active. (one piece of code violates the ordering can easily be adjusted)

This has the nice property that, if you're the leader, you no longer have to check that t.active is not empty - you know that it can't be empty because you've continuously held a counted reference to the lease you acquired.

So, even after this change we still have to loop around if we aren't the leader to check if the lease is in t.mu.active, which we may be able to do so using s.leased. We also end up conditionally incrementing the refcount if we aren't the leader.

That same logic you mentioned applies to acquire, but also if the lease we were the leader of is different than we receive from findForTimestamp we don't increment the refcount.

Sidenote: I don't thinkfindForTimestamp should be checking the LeaseStore if we can't find a valid lease in t.mu.active. If this happens, we would have looped around to do another acquisition to ensure there's a valid lease, so this is unnecessarily complex. I don't even think we need findForTimestamp at all.

Because methods are now conditionally incrementing the refcount, whether or not they were the leader I'm not sure how I feel about starting the refcount = 0.

More so, if we add async refreshing we do want to insert the lease with refcount == 0. This is easy by passing a parameter hasLeader bool to acquireNodeLease to conditionally start the refcount at 0, but it's yet another of complexity, and we no longer abide by the suggested refcount == 1 protocol.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 701 at r3 (raw file):

I don't think we can get away with a token since we will inevitably need to binary search tableSet to find the lease. On the other hand, if we acquire a lease, we can safely return it from acquireNodeLease without having to do extra work (aside from passing it around, that may be bad if we're piping it into all channels for callers of Do/DoChan). I might be misunderstanding what you imagine as a token?

What I had in mind is the desire for a restricted interface: ideally, an API should make it impossible to break its assumptions. In this case, the API contract would be that, once you've received a result from Do(), you have to go to tableState.active and see if the lease is indeed there. If you return the lease directly, a client may get its hand on the lease and run with it, forgetting to check anything.
So, returning a token (which can be, for example, a *tableVersionState but with another name) would achieve this purpose.

So, even after this change we still have to loop around if we aren't the leader to check if the lease is in t.mu.active, which we may be able to do so using s.leased. We also end up conditionally incrementing the refcount if we aren't the leader.

Correct.
I don't think I'd encourage you to use s.Leased to track about whether something was released or not. I have mixed thoughts on what we should do when we want to release a lease. Technically, we don't have to remove anything from the tableState.active, although doing so (as we currently do) may be convenient. If we didn't, then s.Leased == false would not tell you, by itself, that whatever descriptor was in the lease cannot be used at the timestamp requested.

That same logic you mentioned applies to acquire, but also if the lease we were the leader of is different than we receive from findForTimestamp we don't increment the refcount.

Sure. I'm hoping we use the same protocol across acquire/acquireFreshest...

Sidenote: I don't thinkfindForTimestamp should be checking the LeaseStore if we can't find a valid lease in t.mu.active. If this happens, we would have looped around to do another acquisition to ensure there's a valid lease, so this is unnecessarily complex. I don't even think we need findForTimestamp at all.

I don't fully follow this, but note that findForTS() doesn't only deals with leases. It reads historical versions of the descriptor and puts them in the tableState with leased=false.

Because methods are now conditionally incrementing the refcount, whether or not they were the leader I'm not sure how I feel about starting the refcount = 0.

I guess I don't have a strong opinion on this refcount increment.


pkg/sql/lease.go, line 717 at r7 (raw file):

		}

		t.mu.Unlock()

there's a race here. You want to release the lock after you've gotten the channel from DoChan() (but before reading from the channel).
Otherwise, you may initiate more "flights"" than you want.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 5, 2017

PTAL. They've been addressed.


Review status: 0 of 3 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 701 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think we can get away with a token since we will inevitably need to binary search tableSet to find the lease. On the other hand, if we acquire a lease, we can safely return it from acquireNodeLease without having to do extra work (aside from passing it around, that may be bad if we're piping it into all channels for callers of Do/DoChan). I might be misunderstanding what you imagine as a token?

What I had in mind is the desire for a restricted interface: ideally, an API should make it impossible to break its assumptions. In this case, the API contract would be that, once you've received a result from Do(), you have to go to tableState.active and see if the lease is indeed there. If you return the lease directly, a client may get its hand on the lease and run with it, forgetting to check anything.
So, returning a token (which can be, for example, a *tableVersionState but with another name) would achieve this purpose.

So, even after this change we still have to loop around if we aren't the leader to check if the lease is in t.mu.active, which we may be able to do so using s.leased. We also end up conditionally incrementing the refcount if we aren't the leader.

Correct.
I don't think I'd encourage you to use s.Leased to track about whether something was released or not. I have mixed thoughts on what we should do when we want to release a lease. Technically, we don't have to remove anything from the tableState.active, although doing so (as we currently do) may be convenient. If we didn't, then s.Leased == false would not tell you, by itself, that whatever descriptor was in the lease cannot be used at the timestamp requested.

That same logic you mentioned applies to acquire, but also if the lease we were the leader of is different than we receive from findForTimestamp we don't increment the refcount.

Sure. I'm hoping we use the same protocol across acquire/acquireFreshest...

Sidenote: I don't thinkfindForTimestamp should be checking the LeaseStore if we can't find a valid lease in t.mu.active. If this happens, we would have looped around to do another acquisition to ensure there's a valid lease, so this is unnecessarily complex. I don't even think we need findForTimestamp at all.

I don't fully follow this, but note that findForTS() doesn't only deals with leases. It reads historical versions of the descriptor and puts them in the tableState with leased=false.

Because methods are now conditionally incrementing the refcount, whether or not they were the leader I'm not sure how I feel about starting the refcount = 0.

I guess I don't have a strong opinion on this refcount increment.

The patch now covers the two cases you've mentioned (join an in-flight call, and when we don't). The condition also uses a token passed through to ensure the lease we expect is in active.

I refrained from making the changes to how we handle refcount because I don't yet know if this can be simplified by making that change, and more so once leases are refreshed but not used immediately.


pkg/sql/lease.go, line 577 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Checking we have a valid lease after acquisition is required

s//Looping is necessary because lease acquisition is done without holding the tableState lock, so anything can happen in between lease acquisition and us getting control again.

Done.


pkg/sql/lease.go, line 725 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

as discussed offline, get rid of this capture and use DoChan.

Done.


pkg/sql/lease.go, line 704 at r7 (raw file):

Previously, vivekmenezes wrote…

for wasCalled, s := false, t.mu.active.findNewest(); wasCalled && s != nil; s = t.mu.active.findNewest() {

and
s/newestTable/s/

Done.


pkg/sql/lease.go, line 717 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

there's a race here. You want to release the lock after you've gotten the channel from DoChan() (but before reading from the channel).
Otherwise, you may initiate more "flights"" than you want.

Done.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 10, 2017

Addressed the comments.

I'm still confused why this is failing on teamcity, and I'm almost positive it's a problem with the test. (https://teamcity.cockroachdb.com/viewLog.html?buildId=374164&buildTypeId=Cockroach_UnitTests_Test&tab=buildLog search sql/lease to find the routines related to the test).

The test is blocking on pkg/sql/lease_internal_test.go:706 preblock.Wait() which should continue after the two routines have reached pkg/sql/lease_internal_test.go:677. The logs show one routine is waiting there but the other routine isn't in the stack trace. I'm going to add logging into the tests (and I'll have to push) since this is not happening at all locally, which is really weird.


Review status: 0 of 3 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


pkg/sql/lease_internal_test.go, line 595 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's the flake about?

I've added more details (it's now below, just before TestSetLeaseJitterFraction). In short, if a second lease's expiration is randomly jittered to the same expiration as the first, then we have no way to tell if they are equal or not, and this would fail the test.

The same problem "can" happen in another test (TestLeaseManagerReacquire) but the lease duration is set to 5 nanoseconds, so if just 2 nanoseconds pass between acquisitions this problem doesn't happen.


pkg/sql/lease_internal_test.go, line 602 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

No bueno. You can't change globals like this. It'll race with any test running in parallel. Every time someone starts doing stuff like this, there unavoidably comes a later point where it's a pain to cleanup.
If you need to change it, make it a member of LeaseManager (or whomever) that gets initialized from the global and make a test-only method on the LeaseManager to change it.

And take the opportunity to make the global const.

Done :). This may have been what was causing the test failures earlier.


pkg/sql/lease_internal_test.go, line 605 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

line is over 100

Done.


pkg/sql/lease_internal_test.go, line 640 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you mean "thread 1" (also below)? Or is the test comment wrong?

Done.


pkg/sql/lease_internal_test.go, line 645 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why make this configurable if you're always setting it to the same thing?

Done.


pkg/sql/lease_internal_test.go, line 769 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Are these closes necessary?
If not, get rid of them.

Done.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch from 8ac91b0 to 5c24f5c Compare October 10, 2017 18:12
@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


pkg/sql/lease_internal_test.go, line 595 at r11 (raw file):

Previously, lego (Joey Pereira) wrote…

I've added more details (it's now below, just before TestSetLeaseJitterFraction). In short, if a second lease's expiration is randomly jittered to the same expiration as the first, then we have no way to tell if they are equal or not, and this would fail the test.

The same problem "can" happen in another test (TestLeaseManagerReacquire) but the lease duration is set to 5 nanoseconds, so if just 2 nanoseconds pass between acquisitions this problem doesn't happen.

Is checking the expiration really the only way to test that leases are "equal"? Can't we check the descriptor pointer?


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch 6 times, most recently from 559c14b to 9484291 Compare October 11, 2017 20:36
@lgo
Copy link
Contributor Author

lgo commented Oct 11, 2017

Huzza, tests finally work \o/.

Turns out there were a few issues with the tests.

I was underestimating concurrency here and didn't realize that during the test if the acquireNodeLease routine finished before thread 2 began acquisition, thread 2 would short circuit and return that lease immediately. That ended up breaking assumptions for the order of events, so another checkpoint was added by having a channel block in the async routine (via LeaseAcquiredEvent.

It's a bit unfortunate and that's made this test a little more brittle. Better than before but I'm a little less happy about the test.

The other issue was failing tests on master. Right now #19180 is causing a failure.


Review status: 0 of 3 files reviewed at latest revision, 13 unresolved discussions.


pkg/sql/lease_internal_test.go, line 595 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is checking the expiration really the only way to test that leases are "equal"? Can't we check the descriptor pointer?

Ah, thanks! I thought we had a value.


Comments from Reviewable

@lgo
Copy link
Contributor Author

lgo commented Oct 13, 2017

@andreimatei bump -- or were you 👍 earlier? just want to double check on that.

Thanks again for the help on this one :)

@lgo lgo force-pushed the joey/table-lease-singleflight branch from 9484291 to f620218 Compare October 16, 2017 14:29
@andreimatei
Copy link
Contributor

Review status: 0 of 3 files reviewed at latest revision, 33 unresolved discussions, all commit checks successful.


pkg/sql/lease.go, line 50 at r14 (raw file):

	// LeaseDuration is the mean duration a lease will be acquired for. The
	// actual duration is jittered in the range
	// [0.75,1.25]*LeaseDuration. Exported for testing purposes only.

does it need to be exported any more? See comment.


pkg/sql/lease.go, line 51 at r14 (raw file):

	// actual duration is jittered in the range
	// [0.75,1.25]*LeaseDuration. Exported for testing purposes only.
	LeaseDuration = 5 * time.Minute

this should now be called defaultLeaseDuration, as per convention used in many other places for similarly overridable values.


pkg/sql/lease.go, line 121 at r14 (raw file):

	nodeID *base.NodeIDContainer

	// leaseDuration is a constant initialized by NewLeaseManager. It is modified

This comment is less than ideal, in my opinion. First of all, "constant" doesn't mean much if you say next that it can be modified. Second of all, all the members here (at least the pointers) are "constant", in that they're not changed after construction. Also, there's no particular need to spell out that a constructor sets this. Finally, this comment does not say the one thing that it should say - what is this member?

leaseDuration is the mean duration a lease will be acquired for. The
	// actual duration is jittered in the range
	// [0.75,1.25]*LeaseDuration. 

pkg/sql/lease.go, line 131 at r14 (raw file):

// jitteredLeaseDuration returns a randomly jittered duration from the interval
// [0.75 * leaseDuration, 1.25 * leaseDuration].
func jitteredLeaseDuration(leaseDuration time.Duration) time.Duration {

If possible, I'd make this a method on LeaseStore and access the leaseDuration member instead of taking it as an argument.


pkg/sql/lease.go, line 1109 at r14 (raw file):

			clock:         clock,
			nodeID:        nodeID,
			leaseDuration: LeaseDuration,

If you're going through the pain of doing this (and I thank you that you are), please go all the way - make leaseDuration an arg to this function and plumb it through a ServerConfig. TestSetLeaseDuration() can then go away; tests will control the duration directly through the configs they create.
The default value will probably have to move to a different package. If that's the case, please name it DefaultTableDescriptorLeaseDuration.


pkg/sql/lease.go, line 990 at r15 (raw file):

with any operation error

I don't see an error being passed. Is the comment correct? Same below.


pkg/sql/lease.go, line 994 at r15 (raw file):

	// Called before waiting on a results from a DoChan call of acquireNodeLease
	// in tableState.acquireFreshestFromStoreLocked(), with any operation error.
	LeaseAcquireFreshestResultBlockEvent func()

If possible, I'd keep just LeaseAcquireResultBlockEvent and call it from both places. If the distinction is necessary for the test, add an enum param to the call to discriminate between the callers.
For most people that want to consume this event, the difference should not matter I think.


pkg/sql/lease_internal_test.go, line 596 at r15 (raw file):

	descID := sqlbase.ID(keys.LeaseTableID)

	// acquireBlock calls Acquire and then releases the lease.

I don't see the release part... Is this comment correct?


pkg/sql/lease_internal_test.go, line 608 at r15 (raw file):

	// acquireFreshestBlock calls acquireFreshestFromStore and then releases the
	// lease.
	acquireFreshestBlock := func(

I think you can inline this one in the test case.


pkg/sql/lease_internal_test.go, line 618 at r15 (raw file):

	testCases := []struct {
		// The name of the test

unnecessary. Also, missing final period.


pkg/sql/lease_internal_test.go, line 622 at r15 (raw file):

		// The routine being called during the test as the "thread 1" mentioned in
		// the comment preceding this test.
		routineFunc func(ctx context.Context, m *LeaseManager, acquireChan chan Result)

routineFunc is... a bad name :)
acquireFn.


pkg/sql/lease_internal_test.go, line 625 at r15 (raw file):

		// Whether the second routine is a call to LeaseManager.acquireFreshest or
		// not. This determines which channel we unblock.
		isSecondCallAcquireFreshest bool

I don't think you need both this and the previous member. Choose one.


pkg/sql/lease_internal_test.go, line 627 at r15 (raw file):

checks if the race condition occurs

This phrasing is unfortunate. It would likely be read as "checks whether the race occurs", but you want "checks what happens when the race occurs".
I'd say Race between acquire() and lease release. and Race between acquireFreshestFromStore() and lease release.


pkg/sql/lease_internal_test.go, line 644 at r15 (raw file):

	for _, test := range testCases {
		ctx := context.Background()

if this ctx stays, consider adding a log tag (and then my next comment about message prefixes goes away). See log.WithLogTag or such.
But actually, if you need this ctx just for logging from callbacks, I think what you want is to plumb a context to those callbacks (the LeaseManager's ctx used for the respective operation).


pkg/sql/lease_internal_test.go, line 672 at r15 (raw file):

						LeaseReleasedEvent:     removalTracker.LeaseRemovedNotification,
						LeaseAcquireFreshestResultBlockEvent: func() {
							log.Warningf(ctx, "AcquireFreshestPrelockEvent")

Warningf is wrong. Infof if you keep the messages. I'd also prefix all of them with "test: " or smth, as it's unusual for our tests to log (but I personally would like them to log more).

Also, I think using ctx is wrong. I think you want the LeaseStore to give a context to the callback.


pkg/sql/lease_internal_test.go, line 672 at r15 (raw file):

						LeaseReleasedEvent:     removalTracker.LeaseRemovedNotification,
						LeaseAcquireFreshestResultBlockEvent: func() {
							log.Warningf(ctx, "AcquireFreshestPrelockEvent")

"prelock" is stale?


pkg/sql/lease_internal_test.go, line 705 at r15 (raw file):

			// expiration. This prevents a false test failure if two leases happen to
			// have the same expiration due to randomness as expiration is used to
			// check if two leases are equal.

is this comment still true about expiration being used to test equality?


pkg/sql/lease_internal_test.go, line 726 at r15 (raw file):

			// continues, order does not matter.
			blockChan <- struct{}{}
			// Wait for the first routines results

routine's results.
Also please add trailing periods to all similar comments below.


pkg/sql/lease_internal_test.go, line 757 at r15 (raw file):

				t.Fatal(result2.err)
			}
			// Release and remove the lease

nit: consider hinting to the RemoveOnceDereferenced knob if that's at play here.


pkg/sql/lease_internal_test.go, line 759 at r15 (raw file):

			// Release and remove the lease
			tracker = removalTracker.TrackRemoval(result2.table)
			if err := leaseManager.Release(result2.table); err != nil {

why is this necessary?


pkg/sql/lease_internal_test.go, line 762 at r15 (raw file):

				t.Fatal(err)
			}
			// Wait until the lease is fully removed.

I don't know what "fully removed" means.


pkg/sql/lease_internal_test.go, line 769 at r15 (raw file):

			if result1.table == result2.table && result1.exp == result2.exp {
				t.Fatalf("Expected the leases to be different. TableDescriptor pointers are equal and both the same expiration")
			} else if count := atomic.LoadInt32(&leasesAcquiredCount); count != 2 {

else not needed


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch from f620218 to f52f0b2 Compare October 16, 2017 21:04
@lgo lgo requested review from a team October 16, 2017 21:04
@lgo
Copy link
Contributor Author

lgo commented Oct 16, 2017

Review status: 0 of 10 files reviewed at latest revision, 33 unresolved discussions.


pkg/sql/lease.go, line 50 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

does it need to be exported any more? See comment.

Done. 🎉


pkg/sql/lease.go, line 51 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this should now be called defaultLeaseDuration, as per convention used in many other places for similarly overridable values.

Done.


pkg/sql/lease.go, line 121 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This comment is less than ideal, in my opinion. First of all, "constant" doesn't mean much if you say next that it can be modified. Second of all, all the members here (at least the pointers) are "constant", in that they're not changed after construction. Also, there's no particular need to spell out that a constructor sets this. Finally, this comment does not say the one thing that it should say - what is this member?

leaseDuration is the mean duration a lease will be acquired for. The
	// actual duration is jittered in the range
	// [0.75,1.25]*LeaseDuration. 

Done.


pkg/sql/lease.go, line 131 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If possible, I'd make this a method on LeaseStore and access the leaseDuration member instead of taking it as an argument.

Done.


pkg/sql/lease.go, line 1109 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If you're going through the pain of doing this (and I thank you that you are), please go all the way - make leaseDuration an arg to this function and plumb it through a ServerConfig. TestSetLeaseDuration() can then go away; tests will control the duration directly through the configs they create.
The default value will probably have to move to a different package. If that's the case, please name it DefaultTableDescriptorLeaseDuration.

Done. This plumbing has been amended to the first commit.


pkg/sql/lease.go, line 990 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

with any operation error

I don't see an error being passed. Is the comment correct? Same below.

Done. (the comment was mistakenly added)


pkg/sql/lease.go, line 994 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If possible, I'd keep just LeaseAcquireResultBlockEvent and call it from both places. If the distinction is necessary for the test, add an enum param to the call to discriminate between the callers.
For most people that want to consume this event, the difference should not matter I think.

Done.


pkg/sql/lease_internal_test.go, line 596 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't see the release part... Is this comment correct?

Done, the comment wasn't correct as the "release" part was shuffled around quite a bit in trial and error.


pkg/sql/lease_internal_test.go, line 608 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think you can inline this one in the test case.

Done.


pkg/sql/lease_internal_test.go, line 618 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

unnecessary. Also, missing final period.

Done.


pkg/sql/lease_internal_test.go, line 622 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

routineFunc is... a bad name :)
acquireFn.

Done.


pkg/sql/lease_internal_test.go, line 625 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think you need both this and the previous member. Choose one.

Done.


pkg/sql/lease_internal_test.go, line 627 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

checks if the race condition occurs

This phrasing is unfortunate. It would likely be read as "checks whether the race occurs", but you want "checks what happens when the race occurs".
I'd say Race between acquire() and lease release. and Race between acquireFreshestFromStore() and lease release.

Done.


pkg/sql/lease_internal_test.go, line 644 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if this ctx stays, consider adding a log tag (and then my next comment about message prefixes goes away). See log.WithLogTag or such.
But actually, if you need this ctx just for logging from callbacks, I think what you want is to plumb a context to those callbacks (the LeaseManager's ctx used for the respective operation).

I'm not sure what you mean by the LeaseManager's ctx. The need of the context is just for providing LeaseManager functions with a context, which is otherwise associated with the current request handling the lease. So then we won't have a LeaseManager context to plumb.


pkg/sql/lease_internal_test.go, line 672 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Warningf is wrong. Infof if you keep the messages. I'd also prefix all of them with "test: " or smth, as it's unusual for our tests to log (but I personally would like them to log more).

Also, I think using ctx is wrong. I think you want the LeaseStore to give a context to the callback.

Uh yea a lot of these logs were for walking through the tests while there were problems. Removing.


pkg/sql/lease_internal_test.go, line 672 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"prelock" is stale?

Are you referring to the log or the waitgroup? (If it's the log, the logs are removed.)


pkg/sql/lease_internal_test.go, line 705 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is this comment still true about expiration being used to test equality?

Expiration is still checked but in addition to the pointers. I'm more confident if it is, but if you insist it be removed I can.


pkg/sql/lease_internal_test.go, line 726 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

routine's results.
Also please add trailing periods to all similar comments below.

Done.


pkg/sql/lease_internal_test.go, line 757 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: consider hinting to the RemoveOnceDereferenced knob if that's at play here.

Done.


pkg/sql/lease_internal_test.go, line 759 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why is this necessary?

Releasing the second lease? It doesn't assist with what we're testing here. It was more about having the test clean up things (and if this errors, that's bad).


pkg/sql/lease_internal_test.go, line 762 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't know what "fully removed" means.

Fixed.


pkg/sql/lease_internal_test.go, line 769 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

else not needed

Done.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch from f52f0b2 to 082cf0c Compare October 17, 2017 17:25
@lgo
Copy link
Contributor Author

lgo commented Oct 17, 2017

Oof, didn't check back and notice the test failure. Fixed the problem where defaults weren't initialized in non-test contexts.

@andreimatei
Copy link
Contributor

:lgtm:

Merge at will.


Review status: 0 of 10 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


pkg/base/config.go, line 491 at r17 (raw file):

	// randomly jitter the lease duration when acquiring a new lease and
	// the lease renewal timeout.
	// A pointer is used because the zero-value is a commonly cconfigured

cconfigured


pkg/base/config.go, line 491 at r17 (raw file):

	// randomly jitter the lease duration when acquiring a new lease and
	// the lease renewal timeout.
	// A pointer is used because the zero-value is a commonly cconfigured

I don't know what this sentence means. It suggests that nil is valid, but doesn't tell me what it means.

Take it or leave it:

I think it's unfortunate that this is a pointer, particularly since it looks like this pointer can be nil for some of the lifetime of the LMC, but not for the whole lifetime.
There's two things at tension here - the general desire (and go philosophy) of making zero-values useful, and the cruft that this can introduce. I know that you've been following precedent here with the SetDefaults() method, but there's also precedent the other way around -> not allowing uninitialized LeaseManagerConfigs. In this case, unless there's particular reason for the former, I'd prefer the latter: transform SetDefaults() into a ctor taking arguments for every field and let the caller pass in the defaults if it wants.
If you want to keep the zero value of LeaseManagerConfig valid then the struct could be written differently - for example, TableDescriptorLeaseJitterFraction could be a struct itself with an initialized bool, or the whole LMC could be a set of "options".

What I'd make nil-able, however, is the TestServerArgs.LeaseManagerConfig. It's more OK for a user of TestServerArgs to not care about its fields, and also there's no "can be nil up to some point, but not after it" issue with that struct.


pkg/base/test_server_args.go, line 39 at r17 (raw file):

	*cluster.Settings
	RaftConfig
	LeaseManagerConfig

please don't embed; I think that will only hurt readability. I see that other are, but I think that's unfortunate (also, others aren't!). More than one embedded thing is a bad idea, I think.


pkg/server/config.go, line 124 at r17 (raw file):

	base.RaftConfig
	base.LeaseManagerConfig

please don't embed.


pkg/sql/lease.go, line 123 at r18 (raw file):

	// range of the actual lease duration will be
	// [(1-leaseJitterFraction) * leaseDuration, (1+leaseJitterFraction) * leaseDuration]
	// Exported for testing purposes only.

exported comment is stale.


pkg/sql/lease_internal_test.go, line 672 at r15 (raw file):

Previously, lego (Joey Pereira) wrote…

Are you referring to the log or the waitgroup? (If it's the log, the logs are removed.)

the log


pkg/sql/lease_internal_test.go, line 759 at r15 (raw file):

Previously, lego (Joey Pereira) wrote…

Releasing the second lease? It doesn't assist with what we're testing here. It was more about having the test clean up things (and if this errors, that's bad).

but this cleanup clutters what's already a complicated test. Cleaning up this lease seems to me to not be any more needed than in every other test that directly and indirectly uses leases - failing to release in any of them would be bad. And yet we don't clutter all the tests with lease cleanup. Perhaps we should, but then the testing should be implicit -> the testserver cleanup (that does happen in every test, including in this one) should assert that the draining process managed to release all the leases.
Anyway, this is a nit.


pkg/sql/lease_test.go, line 312 at r18 (raw file):

	// Set the lease duration such that the next lease acquisition will
	// require the lease to be reacquired.
	params.LeaseManagerConfig.TableDescriptorLeaseDuration = 5 * time.Nanosecond

why 5?


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 10 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


pkg/sql/lease_test.go, line 312 at r18 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why 5?

What I wanted to say here is that with a good config strategy, I think even 0 would work - which would perfectly express what you want.


Comments from Reviewable

@lgo lgo force-pushed the joey/table-lease-singleflight branch 2 times, most recently from b6e95ad to cbc7901 Compare October 18, 2017 00:45
lgo added 2 commits October 17, 2017 21:44
Previously, the global LeaseDuration value was modified during a test.
If tests doing this are run in parallel, this would cause race problems.
The global has now been modified to live on the LeaseStore where it is
referenced from. It is also now configured during server initialization.

A LeaseJitterFraction constant is also refactored out. This is used in
later path.
This changes the acquisition of leases to use singleflight to reduce the
complexity of handling concurrent acquisitions. The behaviour is kept
the same as we block on the results of singleflight.DoChan. The change
is in preparation for adding a new codepath which acquires leases
asynchronously.

This also refactors out the jitter multiplier into a constant.
@lgo lgo force-pushed the joey/table-lease-singleflight branch from cbc7901 to 899d08f Compare October 18, 2017 01:54
@lgo
Copy link
Contributor Author

lgo commented Oct 18, 2017

TFTR @andreimatei!

I took the final tidbits of advice to clean things up :).


Review status: 0 of 10 files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


pkg/base/config.go, line 491 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

cconfigured

Done.


pkg/base/config.go, line 491 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't know what this sentence means. It suggests that nil is valid, but doesn't tell me what it means.

Take it or leave it:

I think it's unfortunate that this is a pointer, particularly since it looks like this pointer can be nil for some of the lifetime of the LMC, but not for the whole lifetime.
There's two things at tension here - the general desire (and go philosophy) of making zero-values useful, and the cruft that this can introduce. I know that you've been following precedent here with the SetDefaults() method, but there's also precedent the other way around -> not allowing uninitialized LeaseManagerConfigs. In this case, unless there's particular reason for the former, I'd prefer the latter: transform SetDefaults() into a ctor taking arguments for every field and let the caller pass in the defaults if it wants.
If you want to keep the zero value of LeaseManagerConfig valid then the struct could be written differently - for example, TableDescriptorLeaseJitterFraction could be a struct itself with an initialized bool, or the whole LMC could be a set of "options".

What I'd make nil-able, however, is the TestServerArgs.LeaseManagerConfig. It's more OK for a user of TestServerArgs to not care about its fields, and also there's no "can be nil up to some point, but not after it" issue with that struct.

Thanks for the suggestion! :)

The code comment was a byproduct of exactly what you mentioned -- a tension between two desires and not knowing the right pattern, which was very dissatisfying. (In hindsight, I should have called this out to get advice.)

I made the change to use a constructor instead of SetDefaults(), and it has worked out quite elegantly.


pkg/base/test_server_args.go, line 39 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please don't embed; I think that will only hurt readability. I see that other are, but I think that's unfortunate (also, others aren't!). More than one embedded thing is a bad idea, I think.

Done.


pkg/server/config.go, line 124 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please don't embed.

Done.


pkg/sql/lease.go, line 123 at r18 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

exported comment is stale.

Done.


pkg/sql/lease_internal_test.go, line 759 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but this cleanup clutters what's already a complicated test. Cleaning up this lease seems to me to not be any more needed than in every other test that directly and indirectly uses leases - failing to release in any of them would be bad. And yet we don't clutter all the tests with lease cleanup. Perhaps we should, but then the testing should be implicit -> the testserver cleanup (that does happen in every test, including in this one) should assert that the draining process managed to release all the leases.
Anyway, this is a nit.

Yea, I can agree with that.

Related: it doesn't appear that there is any test server cleanup for leases. The lease cleanup doesn't appear to happen in lease.go and I would guess it wouldn't further upstream. In order to do so, we would need to add this to the lease manager initialization correct?

server.Stopper().AddCloser(
  stop.CloserFn(func() {
    // ... clean up leases
  }))

pkg/sql/lease_test.go, line 312 at r18 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

What I wanted to say here is that with a good config strategy, I think even 0 would work - which would perfectly express what you want.

Huh, yea 0 does work fine for this. This was a pre-existing test but 👍 thanks for pointing this point.


Comments from Reviewable

@lgo lgo merged commit f3c7781 into cockroachdb:master Oct 18, 2017
@lgo lgo deleted the joey/table-lease-singleflight branch November 20, 2017 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants