Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

shahzadlone · 2023-02-28T17:49:36Z

Failure in ci run: https://github.com/sourcenetwork/defradb/actions/runs/4295741739/jobs/7486567708

Waiting for pushlog timed out:

2023-02-28T17:38:50.243Z, INFO, defra.test.net, Node 0 synced
    utils.go:532: 
        	Error Trace:	/home/runner/work/defradb/defradb/tests/integration/net/state/simple/peer/utils.go:532
        	            				/home/runner/work/defradb/defradb/tests/integration/net/state/simple/peer/asm_amd64.s:1594
        	Error:      	Received unexpected error:
        	            	waiting for pushlog timed out. Stack: /home/runner/work/defradb/defradb/node/node.go:330 (0x1bafed9)
        	            		(*Node).WaitForPushLogByPeerEvent: return errors.New("waiting for pushlog timed out")
        	            	/home/runner/work/defradb/defradb/tests/integration/net/state/utils.go:529 (0x1d6fc36)
        	            		waitForNodesToSync: err := nodes[targetIndex].WaitForPushLogByPeerEvent(nodes[sourceIndex].PeerID())
        	            	/opt/hostedtoolcache/go/1.19.6/x64/src/runtime/asm_amd64.s:1594 (0x4a90a1)
        	            		goexit: BYTE	$0x90	// NOP
        	Test:       	TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection
2023-02-28T17:38:59.954Z, INFO, defra.test.net, Node 1 synced
2023-02-28T17:38:59.959Z, INFO, defra.db, Closing DefraDB process...
2023-02-28T17:39:00.033Z, INFO, defra.db, Successfully closed running process
2023-02-28T17:39:00.047Z, INFO, defra.db, Closing DefraDB process...
2023-02-28T17:39:00.057Z, INFO, defra.net, Received new pubsub event, {"SenderId": "12D3KooWMfLakpXaC4zph1uZufNjySQuQPN6Ygmc717iEYLpXU4Y", "Topic": "bae-52b9170d-b77a-5887-b877-cbdbb99b009f", "Message": "LEFT"}
2023-02-28T17:39:00.057Z, INFO, defra.net, Received new pubsub event, {"SenderId": "12D3KooWMfLakpXaC4zph1uZufNjySQuQPN6Ygmc717iEYLpXU4Y", "Topic": "bae-53c0e2be-d1a2-5315-b892-c5be109c[1310](https://github.com/sourcenetwork/defradb/actions/runs/4295741739/jobs/7486567708#step:7:1311)", "Message": "LEFT"}
2023-02-28T17:39:00.091Z, INFO, defra.db, Successfully closed running process

DONE 993 tests, 1 failure in 278.630s
Error: Process completed with exit code 2.

The text was updated successfully, but these errors were encountered:

AndrewSisley · 2023-03-06T17:42:02Z

I think this might be caused by SeedDocuments, whilst reworking the framework I saw that when subscribing to a collection sync events will be sent/received for documents that existed prior to subscription, regardless as to whether they exist on all nodes. These events are not accounted for when waiting on sync events and will trigger an early progression of the test framework.

I'll close this ticked if/when my rework gets merged, as the issue appears to be solved there.

AndrewSisley · 2023-03-10T17:52:09Z

Issue still persists in #1160 - I got a single failure out of 50 runs (locally). Same test (TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection)

AndrewSisley · 2023-03-28T23:17:16Z

Just noting for future reference that it feels like this is getting worse and failures more frequent, although I do not recall any changes being merged within the last few weeks that should notably affect P2P or their tests.

AndrewSisley · 2023-03-29T16:06:52Z

Possibly the same issue as #1252

AndrewSisley · 2023-03-29T18:22:20Z

Possibly relevant - libP2P seems to have a significant test flakiness problem, we might not be able to fix our test issues in our codebase: https://github.com/libp2p/go-libp2p/issues?q=is%3Aissue+is%3Aopen+flaky

fredcarle · 2023-03-29T18:23:51Z

I'm not surprised given the nature of p2p systems.

AndrewSisley · 2023-03-29T18:39:55Z

Our go-libp2p version is currently 0.26.0, which does not include libp2p/go-libp2p#2173 - a PR that seems to have notably reduced the flakiness of their tests, and involves a sub-package logged quite heavily in at least a few of our flaky test failures (e.g. https://github.com/sourcenetwork/defradb/actions/runs/4556339760/jobs/8036565210)

EDIT: PR to bump the version: #1257

AndrewSisley · 2023-03-29T19:16:26Z

Lots of Locks being held for 9minutes (until test timeout) here: https://pipelines.actions.githubusercontent.com/serviceHosts/12c2c1ff-af53-47f1-82e9-ced4ec8b0bb2/_apis/pipelines/1/runs/16125/signedlogcontent/2?urlExpires=2023-03-29T19%3A07%3A38.7201808Z&urlSigningMethod=HMACV1&urlSignature=LIGn7vpERKeBzWuvDX1%2FL6BhxKi9Bz7TVXyxdEF2sos%3D

AndrewSisley · 2023-03-29T19:20:44Z

current theory - it is deadlocking with the closeLock in datastore.Close, with us trying to close the store before P2P has stopped trying to write to it

AndrewSisley · 2023-03-29T19:43:41Z

The vast majority of processes are stuck on that one lock, most are ds.Get calls, there is one txn.Put, and one ds.Close. ds.Close is the only non-read lock that I can spot. txn and ds share the same lock.

Interestingly the purgeOldVersions also seems to be stuck, although it should have exited via ln 292 on (d.closing)

AndrewSisley · 2023-03-29T22:27:32Z

This might be caused by multiple RLocks in the same routine prior to unlocking. ds Put and Delete call RLock, then call txn.Commit which tries to re-aquire the same mutex via RLock (before either unlock via defer). The documentation for RWMutex explicitly warns against doing this:

If a goroutine holds a RWMutex for reading and another goroutine might
// call Lock, no goroutine should expect to be able to acquire a read lock
// until the initial read lock is released. In particular, this prohibits
// recursive read locking. This is to ensure that the lock eventually becomes
// available; a blocked Lock call excludes new readers from acquiring the
// lock.

shahzadlone added bug Something isn't working area/testing Related to any test or testing suite labels Feb 28, 2023

shahzadlone added this to the DefraDB v0.5 milestone Feb 28, 2023

AndrewSisley mentioned this issue Mar 6, 2023

test: Migrate P2P/state tests to new framework #1160

Merged

AndrewSisley self-assigned this Mar 7, 2023

AndrewSisley removed their assignment Mar 10, 2023

fredcarle mentioned this issue Mar 29, 2023

[Tracking Issue] Flaky tests #1254

Open

AndrewSisley mentioned this issue Mar 29, 2023

TestP2PWithSingleDocumentSingleUpdateDoesNotSyncFromUnmappedNode occasionally fails #1252

Closed

AndrewSisley self-assigned this Mar 30, 2023

This was referenced Mar 30, 2023

fix: Fix deadlock on memory-datastore Close #1261

Closed

fix: Dont lock datastore lock within lock 3 #1273

Merged

fix: Dont lock datastore lock within lock 2 #1274

Closed

shahzadlone changed the title ~~flakey test - waiting for pushlog timed out~~ Flakey test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection Apr 2, 2023

AndrewSisley closed this as completed in #1273 Apr 4, 2023

jsimnz changed the title ~~Flakey test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection~~ Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

shahzadlone commented Feb 28, 2023 •

edited

Loading

AndrewSisley commented Mar 6, 2023

AndrewSisley commented Mar 10, 2023

AndrewSisley commented Mar 28, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023

fredcarle commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 •

edited

Loading

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

Comments

shahzadlone commented Feb 28, 2023 • edited Loading

AndrewSisley commented Mar 6, 2023

AndrewSisley commented Mar 10, 2023

AndrewSisley commented Mar 28, 2023 • edited Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023

fredcarle commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 • edited Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 • edited Loading

AndrewSisley commented Mar 29, 2023

AndrewSisley commented Mar 29, 2023 • edited Loading

shahzadlone commented Feb 28, 2023 •

edited

Loading

AndrewSisley commented Mar 28, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023 •

edited

Loading

AndrewSisley commented Mar 29, 2023 •

edited

Loading