Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/sql/schemachanger/scbuild/scbuild_test: TestBuildIsMemoryMonitored failed [node became unavailable during DROP DATABASE] #119941

Closed
cockroach-teamcity opened this issue Mar 5, 2024 · 2 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 5, 2024

pkg/sql/schemachanger/scbuild/scbuild_test.TestBuildIsMemoryMonitored failed with artifacts on master @ 1516e36763dbb1ed8413a12bf01082fef2fb03a1:

=== RUN   TestBuildIsMemoryMonitored
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/60c6172f28888ef3481178539d3aaf2a/logTestBuildIsMemoryMonitored321994250
    test_log_scope.go:81: use -show-logs to present logs inline
    conditional_wrap.go:190: 
        pkg/sql/schemachanger/scbuild/scbuild_test_test/pkg/sql/schemachanger/scbuild/builder_test.go:321: (TestBuildIsMemoryMonitored)
        	NOTICE: .ClusterSettings() called via implicit interface ApplicationLayerInterface;
        HINT: consider using .ApplicationLayer().ClusterSettings() instead.
    conditional_wrap.go:194: TIP: consider replacing the test server initialization from:
            ts, ... := serverutils.StartServer(t, ...)
            defer ts.Stopper().Stop(...)
        to:
            srv, ... := serverutils.StartServer(t, ...)
            defer srv.Stopper().Stop(...)
            ts := srv.ApplicationLayer()
        
        See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    conditional_wrap.go:190: 
        pkg/sql/schemachanger/scbuild/scbuild_test_test/pkg/sql/schemachanger/scbuild/builder_test.go:325: (TestBuildIsMemoryMonitored)
        	NOTICE: .NodeID() called via implicit interface StorageLayerInterface;
        HINT: consider using .StorageLayer().NodeID() instead.
    builder_test.go:329: 
        	Error Trace:	pkg/sql/schemachanger/scbuild/scbuild_test_test/pkg/sql/schemachanger/scbuild/builder_test.go:329
        	            				github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scdeps/sctestutils/sctestutils.go:88
        	            				pkg/sql/schemachanger/scbuild/scbuild_test_test/pkg/sql/schemachanger/scbuild/builder_test.go:325
        	Error:      	Error "context canceled" does not contain "test-sc-build-mon: memory budget exceeded:"
        	Test:       	TestBuildIsMemoryMonitored
        	Messages:   	got a memory usage of: 0
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/60c6172f28888ef3481178539d3aaf2a/logTestBuildIsMemoryMonitored321994250
--- FAIL: TestBuildIsMemoryMonitored (24.72s)

Parameters:

  • TAGS=bazel,gss
  • stress=true
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-36404

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 5, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 5, 2024
@rafiss
Copy link
Collaborator

rafiss commented Mar 5, 2024

It looks like a node became unavailable, and that caused a context to be canceled:

E240305 17:05:14.195459 175595 kv/kvserver/rangefeed/task.go:256 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›,rangefeed] 154  pushing old intents failed: context canceled
W240305 17:05:14.224089 175500 kv/kvserver/intentresolver/intent_resolver.go:646 ⋮ [-] 155  failed to cleanup transaction intents: failed to resolve intents: ‹kv/kvserver/intentresolver/intent_resolver.go›:1088: stopping
W240305 17:05:14.224421 173002 sql/sqlliveness/slinstance/slinstance.go:326 ⋮ [T1,Vsystem,n1] 156  exiting heartbeat loop
W240305 17:05:14.224611 173002 sql/sqlliveness/slinstance/slinstance.go:313 ⋮ [T1,Vsystem,n1] 157  exiting heartbeat loop with error: node unavailable; try another peer
E240305 17:05:14.224774 173002 server/server_sql.go:531 ⋮ [T1,Vsystem,n1] 158  failed to run update of instance with new session ID: node unavailable; try another peer
I240305 17:05:14.224975 174134 sql/stats/automatic_stats.go:608 ⋮ [T1,Vsystem,n1] 159  quiescing stats garbage collector
I240305 17:05:14.226482 172742 server/start_listen.go:110 ⋮ [T1,Vsystem,n1] 160  server shutting down: instructing cmux to stop accepting
W240305 17:05:14.727855 174597 kv/kvclient/kvcoord/dist_sender.go:2482 ⋮ [T1,Vsystem,n1,job=‹AUTO SPAN CONFIG RECONCILIATION id=948921600683343873›] 161  slow replica RPC: have been waiting 17.96s (0 attempts) for RPC Scan [/Table/3/1,/Table/3/2), [txn: 2e8eab8f] to replica (n1,s1):1; resp: ‹(err: context canceled: "unnamed" meta={id=2e8eab8f key=/Min iso=Serializable pri=0.02320353 epo=0 ts=1709658296.703981678,0 min=1709658296.703981678,0 seq=0} lock=false stat=PENDING rts=1709658296.703981678,0 wto=false gul=1709658297.203981678,0)›
E240305 17:05:14.734705 173666 jobs/registry.go:900 ⋮ [T1,Vsystem,n1] 162  error getting live session: node unavailable; try another peer
W240305 17:05:14.734480 172903 kv/kvserver/spanlatch/manager.go:655 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›,raft] 163  ResolveIntent [/Table/3/1/‹604›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹605›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹606›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹607›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹608›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹609›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹610›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹611›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹612›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹613›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹614›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹615›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹616›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹617›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹618›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹619›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹620›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹621›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹622›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹623›/‹2›/‹1›],... 75 skipped ..., ResolveIntent [/Table/3/1/‹699›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹700›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹701›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹702›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹703›/‹2›/‹1›], [max_span_request_keys: 0], [target_bytes: 4194304] has held latch for 10346819717 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.
I240305 17:05:14.753998 172903 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›,raft] 164  raft ready handling: 8.56s [append=0.00s, apply=8.55s, , other=0.01s], wrote [append-batch=49 B, apply=2.0 MiB (44)] pebble stats: [commit-wait 322ns sem 2.144µs]; node might be overloaded
W240305 17:05:14.758448 175575 kv/kvclient/kvcoord/dist_sender.go:2482 ⋮ [T1,Vsystem,n1,s1] 165  slow replica RPC: have been waiting 10.24s (0 attempts) for RPC ResolveIntent [/Table/3/1/‹3904›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3905›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3906›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3907›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3908›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3909›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3910›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3911›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3912›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3913›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3914›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3915›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3916›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3917›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3918›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3919›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3920›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3921›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3922›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹3923›/‹2›/‹1›],... 75 skipped ..., ResolveIntent [/Table/3/1/‹3999›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹4000›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹4001›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹4002›/‹2›/‹1›], ResolveIntent [/Table/3/1/‹4003›/‹2›/‹1›], [max_span_request_keys: 0], [target_bytes: 4194304] to replica (n1,s1):1; resp: ‹(err: result is ambiguous: after 10.24s of attempting command: context canceled)›

Moving this to KV to see if they can debug the cause of unavailability further.

@rafiss rafiss changed the title pkg/sql/schemachanger/scbuild/scbuild_test: TestBuildIsMemoryMonitored failed pkg/sql/schemachanger/scbuild/scbuild_test: TestBuildIsMemoryMonitored failed [node became unavailable during DROP DATABASE] Mar 5, 2024
@rafiss rafiss added T-kv KV Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 5, 2024
@nvanbenschoten
Copy link
Member

We saw a few other instances of context timeouts on this same CI run (#119945 and #119920).

We see in the logs before the unavailability indications that the node is overloaded:
I240305 17:05:04.347537 172903 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›,raft] 149 raft ready handling: 5.00s [append=0.00s, apply=5.00s, , other=0.00s], wrote [append-batch=49 B, apply=456 KiB (1)] pebble stats: [commit-wait 329ns sem 1.859µs]; node might be overloaded

@nvanbenschoten nvanbenschoten added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 6, 2024
@nvanbenschoten nvanbenschoten closed this as not planned Won't fix, can't repro, duplicate, stale Mar 11, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
No open projects
Status: Closed
Status: Done
Development

No branches or pull requests

3 participants