Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

Closed
shahzadlone opened this issue Feb 28, 2023 · 11 comments · Fixed by #1273
Closed

Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection #1154

shahzadlone opened this issue Feb 28, 2023 · 11 comments · Fixed by #1273
Assignees
Labels
area/testing Related to any test or testing suite bug Something isn't working
Milestone

Comments

@shahzadlone
Copy link
Member

shahzadlone commented Feb 28, 2023

Failure in ci run: https://github.com/sourcenetwork/defradb/actions/runs/4295741739/jobs/7486567708

Waiting for pushlog timed out:

2023-02-28T17:38:50.243Z, INFO, defra.test.net, Node 0 synced
    utils.go:532: 
        	Error Trace:	/home/runner/work/defradb/defradb/tests/integration/net/state/simple/peer/utils.go:532
        	            				/home/runner/work/defradb/defradb/tests/integration/net/state/simple/peer/asm_amd64.s:1594
        	Error:      	Received unexpected error:
        	            	waiting for pushlog timed out. Stack: /home/runner/work/defradb/defradb/node/node.go:330 (0x1bafed9)
        	            		(*Node).WaitForPushLogByPeerEvent: return errors.New("waiting for pushlog timed out")
        	            	/home/runner/work/defradb/defradb/tests/integration/net/state/utils.go:529 (0x1d6fc36)
        	            		waitForNodesToSync: err := nodes[targetIndex].WaitForPushLogByPeerEvent(nodes[sourceIndex].PeerID())
        	            	/opt/hostedtoolcache/go/1.19.6/x64/src/runtime/asm_amd64.s:1594 (0x4a90a1)
        	            		goexit: BYTE	$0x90	// NOP
        	Test:       	TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection
2023-02-28T17:38:59.954Z, INFO, defra.test.net, Node 1 synced
2023-02-28T17:38:59.959Z, INFO, defra.db, Closing DefraDB process...
2023-02-28T17:39:00.033Z, INFO, defra.db, Successfully closed running process
2023-02-28T17:39:00.047Z, INFO, defra.db, Closing DefraDB process...
2023-02-28T17:39:00.057Z, INFO, defra.net, Received new pubsub event, {"SenderId": "12D3KooWMfLakpXaC4zph1uZufNjySQuQPN6Ygmc717iEYLpXU4Y", "Topic": "bae-52b9170d-b77a-5887-b877-cbdbb99b009f", "Message": "LEFT"}
2023-02-28T17:39:00.057Z, INFO, defra.net, Received new pubsub event, {"SenderId": "12D3KooWMfLakpXaC4zph1uZufNjySQuQPN6Ygmc717iEYLpXU4Y", "Topic": "bae-53c0e2be-d1a2-5315-b892-c5be109c[1310](https://github.com/sourcenetwork/defradb/actions/runs/4295741739/jobs/7486567708#step:7:1311)", "Message": "LEFT"}
2023-02-28T17:39:00.091Z, INFO, defra.db, Successfully closed running process

DONE 993 tests, 1 failure in 278.630s
Error: Process completed with exit code 2.
@shahzadlone shahzadlone added bug Something isn't working area/testing Related to any test or testing suite labels Feb 28, 2023
@shahzadlone shahzadlone added this to the DefraDB v0.5 milestone Feb 28, 2023
@AndrewSisley
Copy link
Contributor

I think this might be caused by SeedDocuments, whilst reworking the framework I saw that when subscribing to a collection sync events will be sent/received for documents that existed prior to subscription, regardless as to whether they exist on all nodes. These events are not accounted for when waiting on sync events and will trigger an early progression of the test framework.

I'll close this ticked if/when my rework gets merged, as the issue appears to be solved there.

@AndrewSisley
Copy link
Contributor

Issue still persists in #1160 - I got a single failure out of 50 runs (locally). Same test (TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection)

@AndrewSisley AndrewSisley removed their assignment Mar 10, 2023
@AndrewSisley
Copy link
Contributor

AndrewSisley commented Mar 28, 2023

Just noting for future reference that it feels like this is getting worse and failures more frequent, although I do not recall any changes being merged within the last few weeks that should notably affect P2P or their tests.

@AndrewSisley
Copy link
Contributor

Possibly the same issue as #1252

@AndrewSisley
Copy link
Contributor

Possibly relevant - libP2P seems to have a significant test flakiness problem, we might not be able to fix our test issues in our codebase: https://github.com/libp2p/go-libp2p/issues?q=is%3Aissue+is%3Aopen+flaky

@fredcarle
Copy link
Collaborator

I'm not surprised given the nature of p2p systems.

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Mar 29, 2023

Our go-libp2p version is currently 0.26.0, which does not include libp2p/go-libp2p#2173 - a PR that seems to have notably reduced the flakiness of their tests, and involves a sub-package logged quite heavily in at least a few of our flaky test failures (e.g. https://github.com/sourcenetwork/defradb/actions/runs/4556339760/jobs/8036565210)

EDIT: PR to bump the version: #1257

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Mar 29, 2023

current theory - it is deadlocking with the closeLock in datastore.Close, with us trying to close the store before P2P has stopped trying to write to it

@AndrewSisley
Copy link
Contributor

The vast majority of processes are stuck on that one lock, most are ds.Get calls, there is one txn.Put, and one ds.Close. ds.Close is the only non-read lock that I can spot. txn and ds share the same lock.

Interestingly the purgeOldVersions also seems to be stuck, although it should have exited via ln 292 on (d.closing)

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Mar 29, 2023

This might be caused by multiple RLocks in the same routine prior to unlocking. ds Put and Delete call RLock, then call txn.Commit which tries to re-aquire the same mutex via RLock (before either unlock via defer). The documentation for RWMutex explicitly warns against doing this:

If a goroutine holds a RWMutex for reading and another goroutine might
// call Lock, no goroutine should expect to be able to acquire a read lock
// until the initial read lock is released. In particular, this prohibits
// recursive read locking. This is to ensure that the lock eventually becomes
// available; a blocked Lock call excludes new readers from acquiring the
// lock.

@AndrewSisley AndrewSisley self-assigned this Mar 30, 2023
@shahzadlone shahzadlone changed the title flakey test - waiting for pushlog timed out Flakey test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection Apr 2, 2023
@jsimnz jsimnz changed the title Flakey test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection Flaky test - TestP2PWithMultipleDocumentUpdatesPerNodeWithP2PCollection Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Related to any test or testing suite bug Something isn't working
Projects
None yet
3 participants