Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: weekly/tpcc/headroom failed #120978

Closed
cockroach-teamcity opened this issue Mar 24, 2024 · 5 comments
Closed

roachtest: weekly/tpcc/headroom failed #120978

cockroach-teamcity opened this issue Mar 24, 2024 · 5 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 24, 2024

roachtest.weekly/tpcc/headroom failed with artifacts on master @ 4a9385cacb82e7a8d6d37e5d9a26a6b7c845aab6:

(monitor.go:154).Wait: monitor failure: full command output in run_111114.835886915_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-36983

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Mar 24, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 24, 2024
@srosenberg
Copy link
Member

srosenberg commented Mar 25, 2024

More log spamming/pollution in SQL stats; the following lines are repeated ad nauseam,

W240324 09:10:11.496638 631 sql/sqlstats/persistedsqlstats/provider.go:217 ⋮ [T1,Vsystem,n2] 6833383  sql-stats-worker: unable to signal flush completion
I240324 09:10:11.497893 6867 sql/sql_activity_update_job.go:245 ⋮ [T1,Vsystem,n2,job=‹AUTO UPDATE SQL ACTIVITY id=103›] 6833384  sql stats activity found no rows at 2024-03-24 08:00:00 +0000 UTC

Similar to [1], this is making triage rather challenging. The workload eventually fails to the unavailability of a leaseholder on n2,

Error: error in newOrder: ERROR: replica unavailable: (n2,s2):3 unable to serve request to r358:/Table/114/1/2{85/5/-1446/10-96/1/-1256/10} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=28, sticky=1711192518.082281645,0]: closed timestamp: 1711271312.802185231,0 (2024-03-24 09:08:32); raft status: {"id":"3","term":18,"vote":"0","commit":396523,"lead":"1","raftState":"StateFollower","applied":396523,"progress":{},"leadtransferee":"0"}: encountered poisoned latch /Table/114/1/291/6/-4633/1/[email protected],0 (SQLSTATE XXUUU)

Prior to the fatal failure, circuit breaker is triggered several times, e.g.,

E240324 09:09:41.471646 1160978975 kv/kvserver/replica_circuit_breaker.go:167 ⋮ [T1,Vsystem,n2,s2,r114/3:‹/Table/108/1/6{71/8/…-86/4/…}›] 6823987  breaker: tripped with error: replica unavailable: (n2,s2):3 unable to serve request to r114:‹/Table/108/1/6{71/8/945-86/4/411}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=16, sticky=1711195332.119851231,0]: closed timestamp: 1711271312.896702123,0 (2024-03-24 09:08:32); raft status: {"id":"3","term":14,"vote":"0","commit":541640,"lead":"1","raftState":"StateFollower","applied":541640,"progress":{},"leadtransferee":"0"}: have been waiting 61.00s for slow proposal Put [/Table/108/1/‹673›/‹1›/‹579›/‹0›], [txn: 164a6761]

It appears that n2 is flapping for > 1h, eventually becoming unavailable.

[1] #120179

@srosenberg
Copy link
Member

The logs on n2 were truncated owing to the above. CPU profiles confirm the saw-tooth pattern observed in the CPU utilization; ~23% of time is spent GCing,

cpu

Heap profiles confirm that SQL stats are taking up a significant chunk of the inuse heap,

heap_inuse

Given the frequency of sql-stats-worker: unable to signal flush completion, yet no other error context, it's unclear why sql stats cannot be flushed to disk.

Reassigning to @cockroachdb/cluster-observability for further investigation.

@xinhaoz
Copy link
Member

xinhaoz commented Mar 26, 2024

It seems like this PR is the culprit and is causing writes to sql stats to be halted after processing 100,000 statement and 100,000 transaction fingerprints. Specifically, [this section].(https://github.com/cockroachdb/cockroach/pull/116444/files#diff-eb9d7df12ea768da5a41031449150b2473270b047efd8a191dca9a38516dd580R608-R620).

What's happening in the test
According to system.sql_stats_cardinality.txt, at some point in the hour after 2024-03-24 01:00:00+00, we stop writing to sql stats in-memory containers. However at this time, we see through logs that:

  1. sql stats flush worker is unable to deliver its 'done' signal to the sql activity updater job, signalling that this job is running when flush is done. The frequency of this log line suggests that the flush seems to be looping without respecting the flush interval of 10m, and the flush operation itself is finishing in a short amount of time.
  2. The sql activity update job reports no rows in the current and previous hours in the persisted tables (which matches what's seen in the cardinality output above).

Explanation
Each node is allowed to track up to 100,000 fingerprint stats entries across all per-application sql stats containers. If that limit is reached before flushing to disk, we send a memory pressure signal to flush early. After flushing, we're supposed to reset the in-memory container by decrementing each application container size from the total tracked fingerprint count, and then we 'prep' the app container for reuse by clearing it and shrinking its capacity by 1/2.
The PR above switched the ordering of the counter decrementing and container clearing steps so that we never actually reset the counter after each flush. Eventually, we hit the max fingerprints threshold and there's no way to come back down from it, so we start sending the memory pressure signal to flush on every statement. When flushing, we're attempting to flush empty containers so it completes immediately. The flush then tries to trigger the update job. Due to the flush frequency, we see a lot of the signals being dropped since the previous update job is still running, having been triggered not too long ago.

To fix the reset issue, it seems that we can replace the chunk in the PR with Container.Clear which performs the operations in the correct order.

In practice we should never see the log line being spammed here at such a frequency, because if we flush due to memory pressure, the flush should take a while to complete and would otherwise follow the flush interval.

Just as an aside/followup, the obs team is actively working on getting some roachprod tests set up to test sql stats system operating at maximum capacity, to catch things like this. I think for this specific case we can also get a unit test to verify the 'reset' behaviour is completed as expected after flush.

@xinhaoz
Copy link
Member

xinhaoz commented Mar 26, 2024

#121134
Will fix today.

@dhartunian
Copy link
Collaborator

Fix to #121134 has been merged. I'm closing this issue and we'll circle back with the headroom test during next week's run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability
Projects
None yet
Development

No branches or pull requests

4 participants