Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestRequestsOnLaggingReplica failed #57932

Closed
cockroach-teamcity opened this issue Dec 15, 2020 · 7 comments
Closed

kv/kvserver: TestRequestsOnLaggingReplica failed #57932

cockroach-teamcity opened this issue Dec 15, 2020 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 15, 2020

(kv/kvserver).TestRequestsOnLaggingReplica failed on master@93615a6b16071749c8d412b29978f99e57944081:

=== RUN   TestRequestsOnLaggingReplica
    test_log_scope.go:72: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica002097259
    test_log_scope.go:73: use -show-logs to present logs inline
E201215 03:13:11.103509 212369 kv/kvserver/replica_range_lease.go:340  [n1,s1,r36/1:‹/{Table/Max-Max}›] failed to heartbeat own liveness record: heartbeat failed on epoch increment
    client_raft_test.go:1279: 
        	Error Trace:	client_raft_test.go:1279
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 1
        	Test:       	TestRequestsOnLaggingReplica
    panic.go:617: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica002097259
--- FAIL: TestRequestsOnLaggingReplica (4.23s)

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestRequestsOnLaggingReplica PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

Jira issue: CRDB-3459

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 15, 2020
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestRequestsOnLaggingReplica failed with artifacts on master @ e179c8efcb1cf78ed940c3a0290d84fd1a5590e9:

=== RUN   TestRequestsOnLaggingReplica
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica277650776
    test_log_scope.go:74: use -show-logs to present logs inline
    client_raft_test.go:1214: 
        	Error Trace:	client_raft_test.go:1214
        	Error:      	Expected nil, but got: &roachpb.Error{deprecatedMessage:"[NotLeaseHolderError] lease state couldn't be determined; r38: replica (n3,s3):3 not lease holder; lease holder unknown", deprecatedTransactionRestart:0, UnexposedTxn:(*roachpb.Transaction)(nil), OriginNode:0, deprecatedDetail:roachpb.ErrorDetail{Value:(*roachpb.ErrorDetail_NotLeaseHolder)(0xc000a6fc50)}, EncodedError:errorspb.EncodedError{Error:(*errorspb.EncodedError_Leaf)(0xc000a6fc40)}, Index:(*roachpb.ErrPosition)(nil), Now:hlc.ClockTimestamp{WallTime:1619033519774824319, Logical:0, Synthetic:false}}
        	Test:       	TestRequestsOnLaggingReplica
E210421 19:31:59.778081 386151 kv/kvserver/consistency_queue.go:191  [n1,consistencyChecker,s1,r30/1:/Table/3{4-5}] 1  failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease acquisition canceled because node is stopping; r30: replica (n1,s1):1 not lease holder; lease holder unknown
E210421 19:31:59.778203 386151 kv/kvserver/queue.go:1093  [n1,consistencyChecker,s1,r30/1:/Table/3{4-5}] 2  failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease acquisition canceled because node is stopping; r30: replica (n1,s1):1 not lease holder; lease holder unknown
E210421 19:31:59.778282 386157 kv/kvserver/replica_range_lease.go:360  [n1,s1,r30/1:/Table/3{4-5}] 3  failed to heartbeat own liveness record: context canceled
    panic.go:617: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica277650776
--- FAIL: TestRequestsOnLaggingReplica (3.42s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestRequestsOnLaggingReplica PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @andreimatei

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

@andreimatei any thoughts on this recent test failure?

@andreimatei
Copy link
Contributor

I'll try to take a look.

@andreimatei
Copy link
Contributor

I produced the following in 30min or stressrace

E210422 02:41:36.191109 3751 kv/kvserver/replica_range_lease.go:402  [n2,s2,r38/2:/{Table/Max-Max}] 1  failed to increment leaseholder's epoch: mismatch incrementing epoch for liveness(nid:1 epo:1 exp:1619059294.388294206,0); actual is liveness(nid:1 epo:1 exp:1619059295.876687580,0)
    client_raft_test.go:1214: 
                Error Trace:    client_raft_test.go:1214
                Error:          Expected nil, but got: &roachpb.Error{deprecatedMessage:"[NotLeaseHolderError] failed to manipulate liveness record: mismatch incrementing epoch for liveness(nid:1 epo:1 exp:1619059294.388294206,0); actual is liveness(nid:1 epo:1 exp:1619059295.876687580,0); r38: replica (n2,s2):2 not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 epo=1 pro=1619059291.932609191,0", deprecatedTransactionRestart:0, UnexposedTxn:(*roachpb.Transaction)(nil), OriginNode:0, deprecatedDetail:roachpb.ErrorDetail{Value:(*roachpb.ErrorDetail_NotLeaseHolder)(0xc00354c710)}, EncodedError:errorspb.EncodedError{Error:(*errorspb.EncodedError_Leaf)(0xc00354c6f8)}, Index:(*roachpb.ErrPosition)(nil), Now:hlc.ClockTimestamp{WallTime:1619059296196450331, Logical:0, Synthetic:false}}
                Test:           TestRequestsOnLaggingReplica
    panic.go:636: -- test log scope end --
test logs left over in: /tmp/logTestRequestsOnLaggingReplica581669057

andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 23, 2021
In some rare cases, the status of a lease in relation to a
request/timestamp can't be determined. For the request's client this
results in a NotLeaseholderError. This patch improves the message of
that error.

In particular, this test failure[1] seems to show that a node couldn't
verify that an existing lease is expired because its liveness gossiped
info was stale. This sounds interesting and if the test fails again this
improved message should help.

[1] cockroachdb#57932 (comment)

Release note: None
craig bot pushed a commit that referenced this issue Apr 30, 2021
64080: kvserver: improve error message for rare lease errors r=andreimatei a=andreimatei

In some rare cases, the status of a lease in relation to a
request/timestamp can't be determined. For the request's client this
results in a NotLeaseholderError. This patch improves the message of
that error.

In particular, this test failure[1] seems to show that a node couldn't
verify that an existing lease is expired because its liveness gossiped
info was stale. This sounds interesting and if the test fails again this
improved message should help.

[1] #57932 (comment)

Release note: None

64239: kv: bump kv.range_merge.queue_interval to 5s r=nvanbenschoten a=nvanbenschoten

Informs #62700.

This commit bumps the default value for the `kv.range_merge.queue_interval` cluster setting from 1s to 5s. This setting serves as a per-store rate limit on the frequency at which range merges will be initiated. We've seen in a few issues, including #62700, that excessive range merge traffic can cause instability in a cluster. There's very little reason to be aggressive about range merging, as range merges are rarely needed with any urgency. However, there are good reasons to be conservative about them.

This change can also be justified as a (late) reaction to the increased max range size from 64MB to 512MB. A range merge may need to rebalance replicas in a range, so its cost can be a function of the sizes of ranges. This means that if range merges are now more expensive, we should be running them less frequently.

64260: roachpb: remove EndTxn's DeprecatedCanCommitAtHigherTimestamp field r=nvanbenschoten a=nvanbenschoten

The field was replaced with the more general BatchRequest.CanForwardReadTimestamp
in 972915d. This commit completes the migration to remove the old flag.

We attempted this before, in 93d5eb9, but had to back that out in 4189938 because
we still needed compatibility with v20.1 nodes at the time. That is no longer the
case.

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestRequestsOnLaggingReplica failed with artifacts on master @ cfc525fe19e848f324baaf9937ce12a4d029fec2:

=== RUN   TestRequestsOnLaggingReplica
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica794565957
    test_log_scope.go:74: use -show-logs to present logs inline
E210730 22:11:53.488826 428254 kv/kvserver/replica_range_lease.go:365  [n1,s1,r35/1:/Table/{39-40}] 1  failed to heartbeat own liveness record: heartbeat failed on epoch increment
E210730 22:11:53.489214 428177 kv/kvserver/replica_range_lease.go:365  [n1,s1,r41/1:/{Table/Max-Max}] 2  failed to heartbeat own liveness record: heartbeat failed on epoch increment
    client_raft_test.go:1281: 
        	Error Trace:	client_raft_test.go:1281
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 1
        	Test:       	TestRequestsOnLaggingReplica
    panic.go:613: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestRequestsOnLaggingReplica794565957
--- FAIL: TestRequestsOnLaggingReplica (6.18s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestRequestsOnLaggingReplica PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/kv andreimatei

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Did 3000 stressrace runs of this over 50 minutes, no failures yet. Worth digging into this a bit. From a quick glance, it looks like the partitioned node (replica 1) took the lease, but running a request against that node returned a NotLeaseholderError pointing to itself?

@erikgrinaker
Copy link
Contributor

Closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants