Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: flake in TestMultiRangeScanReverseScanInconsistent #91856

Closed
knz opened this issue Nov 14, 2022 · 7 comments · Fixed by #98775
Closed

kvcoord: flake in TestMultiRangeScanReverseScanInconsistent #91856

knz opened this issue Nov 14, 2022 · 7 comments · Fixed by #98775
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-transactions Relating to MVCC and the transactional model. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-kv KV Team

Comments

@knz
Copy link
Contributor

knz commented Nov 14, 2022

$ ./dev test --stress //pkg/kv/kvclient/kvcoord --filter=TestMultiRangeScanReverseScanInconsistent
...
--- FAIL: TestMultiRangeScanReverseScanInconsistent (1.52s)
    test_log_scope.go:161: test logs captured to: /tmp/_tmp/e1d742d7ffb97ae3adc81944eac18765/logTestMultiRangeScanReverseScanInconsistent3519145941
    test_log_scope.go:79: use -show-logs to present logs inline
    --- FAIL: TestMultiRangeScanReverseScanInconsistent/INCONSISTENT (0.70s)
        dist_sender_server_test.go:1160: 0: expected 1 row; got 0
            []
    dist_sender_server_test.go:1168: -- test log scope end --
FAIL

Also found here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/7504394?buildTab=overview&showRootCauses=false&expandBuildProblemsSection=true&expandBuildTestsSection=true&expandBuildChangesSection=true&expandBuildDeploymentsSection=true#%2Ftmp

cc @nvanbenschoten for triage

Jira issue: CRDB-21459

@knz knz added C-test-failure Broken test (automatically or manually discovered). A-kv-transactions Relating to MVCC and the transactional model. A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Nov 14, 2022
@knz knz mentioned this issue Nov 14, 2022
craig bot pushed a commit that referenced this issue Nov 14, 2022
91840: server: remove unused apiv2 server field r=knz a=dhartunian

Epic: None
Resolves: #91829
Release note: None

91857: skip flaky tests r=andrewbaptist a=knz

Informs #91856
Informs #91858

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@blathers-crl
Copy link

blathers-crl bot commented Mar 1, 2023

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 2, 2023
@nvanbenschoten
Copy link
Member

I can still hit this on master (1a58583a0f911b6dd2a6b0bfca17ec510727b0d9) after about 1m of stress:

# unskip the test
dev test pkg/kv/kvclient/kvcoord -f=TestMultiRangeScanReverseScanInconsistent --stress

@andrewbaptist andrewbaptist added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Mar 15, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 15, 2023
@andrewbaptist
Copy link
Contributor

I'm going to need some help on this from the storage team. I added a ton of debugging statements through the stack to try and figure out why my writes are getting "lost" and created a more straightforward test to reproduce this TestMinimal in this branch.

https://github.com/andrewbaptist/cockroach/tree/23.03.15-pebble_invalid

From what I can tell we are putting the key correctly and see this log:

 I230315 23:11:18.082595 13 storage/mvcc.go:2196 ⋮ [T1,n1,s1,r5/1:‹{/Systemtse-b}›] 66042  Calling put ‹"a"/1678921878.081010000,0›, ‹/BYTES/value›

However, it fails with

 I230315 23:11:18.085546 1701 storage/pebble_mvcc_scanner.go:641 ⋮ [T1,n1,s1,r5/1:‹{/Systemtse-b}›] 66157  Starting scan, ‹"a"›, ‹"b"›
 I230315 23:11:18.085552 1701 storage/pebble_iterator.go:364 ⋮ [-] 66158  pebbleIterator
 I230315 23:11:18.085554 1701 storage/pebble_iterator.go:373 ⋮ [-] 66159  not valid? <nil>
 I230315 23:11:18.085557 1701 storage/intent_interleaving_iter.go:600 ⋮ [-] 66160  computePos invalid
 I230315 23:11:18.085559 1701 storage/pebble_mvcc_scanner.go:1648 ⋮ [-] 66161  iter not valid
 I230315 23:11:18.085561 1701 storage/pebble_mvcc_scanner.go:646 ⋮ [T1,n1,s1,r5/1:‹{/Systemtse-b}›] 66162  seek failed <nil>
 I230315 23:11:18.085569 1701 kv/kvserver/replica_evaluate.go:550 ⋮ [T1,n1,s1,r5/1:‹{/Systemtse-b}›] 66163  evaluated Scan command ‹header:<key:"a" end_key:"b" >›, txn=<nil> : resp=‹header:<> col_batches:<>›, err=<nil>    I230315 23:11:18.085584 13 kv/kvclient/kvcoord_test/dist_sender_server_test.go:1109 ⋮ [-] 66164  Scan complete at time &{‹0x14003422e70› 0 0 1678921878082825001 0 0 {{{0 0}} {1678921878085579000 0 false} false 0}}

Note that the scan time is greater than the put time, however, it never even gets the item back from pebble.
1678921878082825001 > 1678921878.081010000

This feels like a pebble bug, but I haven't created a non-KV test to reproduce this. Fortunately, the test reproduces the issue quickly.

I haven't bisected to figure out when this problem started.

@nvanbenschoten
Copy link
Member

@andrewbaptist nice job getting to a minimal reproduction. It's interesting that we see this flake when issuing an INCONSISTENT scan. INCONSISTENT scans don't acquire latches, so they bypass replica-level synchronization. One thing to check is that you can still reproduce even if you disable raft proposal acknowledgment ahead of application. You can do that using the StoreTestingKnobs.DisableCanAckBeforeApplication flag. We might have missed this in 175a9f8.

@andrewbaptist
Copy link
Contributor

I verified changing the know does fix the issue.

I don't fully understand what that knob does or why we would need it in tests. This seems like a real linearizability violation from the perspective that a write is missed in a future scan (at a later time with a later timestamp). Maybe we can chat more on this tomorrow. Also, note that the Put is called before the later scan, and we are not ever going directly through the engine, these are all BatchRequests from a client.

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Mar 16, 2023
A read request at READ_UNCOMMITTED or INCONSISTENT levels are not
guaranteed to return the latest data. This is especially true after
aysnc raft where a put operation on the leaseholder returns after the
data is appended to raft but not yet applied to the state machine.

This change adds retries to the scan to ensure that the data does get
there eventually and the scan stil returns the correct results.

Epic: none
Fixes: cockroachdb#91856
@jbowens
Copy link
Collaborator

jbowens commented Mar 16, 2023

Noticed the storage tag—Is there something for storage to look into here, or do we know what's going on?

@andrewbaptist
Copy link
Contributor

No - sorry - I will remove it - I thought it was a storage bug first when I didn't realize the difference between raft append and log application.

@andrewbaptist andrewbaptist removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Mar 16, 2023
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Mar 16, 2023
A read request at READ_UNCOMMITTED or INCONSISTENT levels are not
guaranteed to return the latest data. This is especially true after
aysnc raft where a put operation on the leaseholder returns after the
data is appended to raft but not yet applied to the state machine.

This change adds retries to the scan to ensure that the data does get
there eventually and the scan stil returns the correct results.

Epic: none
Fixes: cockroachdb#91856
craig bot pushed a commit that referenced this issue Mar 19, 2023
97845: sql: add RevertSpans and RevertSpansFanout r=msbutler a=stevendanna

This PR adds RevertSpans and RevertSpansFanout.

RevertSpans de-duplicates some code between import rollback and streaming cutover.

Along the way, it fixes a small bug that existed in RevertTables: Only a single RevertRangeRequest is permitted in a batch since RevertRangeRequest has the isAlone flag set. As a result, a future caller of RevertTables would have encountered a fatal error from KV.

RevertSpansFanout uses DistSQL's PartitionSpans to manually fan out multiple RevertRange requests. Since all users of revert range request currently set a limit on the number of keys touched, dist sender doesn't fanout such request.

Epic: none

Release note: None

98775: kvserver: add test retries for INCONSISTENT scan r=andrewbaptist a=andrewbaptist

A read request at READ_UNCOMMITTED or INCONSISTENT levels are not guaranteed to return the latest data. This is especially true after aysnc raft where a put operation on the leaseholder returns after the data is appended to raft but not yet applied to the state machine.

This change adds retries to the scan to ensure that the data does get there eventually and the scan stil returns the correct results.

Epic: none
Fixes: #91856

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
@craig craig bot closed this as completed in adf0045 Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-transactions Relating to MVCC and the transactional model. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants