roachtest: weekly/tpcc/headroom failed #120978

cockroach-teamcity · 2024-03-24T09:13:03Z

roachtest.weekly/tpcc/headroom failed with artifacts on master @ 4a9385cacb82e7a8d6d37e5d9a26a6b7c845aab6:

(monitor.go:154).Wait: monitor failure: full command output in run_111114.835886915_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-36983}

The text was updated successfully, but these errors were encountered:

srosenberg · 2024-03-25T16:03:57Z

More log spamming/pollution in SQL stats; the following lines are repeated ad nauseam,

W240324 09:10:11.496638 631 sql/sqlstats/persistedsqlstats/provider.go:217 ⋮ [T1,Vsystem,n2] 6833383  sql-stats-worker: unable to signal flush completion
I240324 09:10:11.497893 6867 sql/sql_activity_update_job.go:245 ⋮ [T1,Vsystem,n2,job=‹AUTO UPDATE SQL ACTIVITY id=103›] 6833384  sql stats activity found no rows at 2024-03-24 08:00:00 +0000 UTC

Similar to [1], this is making triage rather challenging. The workload eventually fails to the unavailability of a leaseholder on n2,

Error: error in newOrder: ERROR: replica unavailable: (n2,s2):3 unable to serve request to r358:/Table/114/1/2{85/5/-1446/10-96/1/-1256/10} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=28, sticky=1711192518.082281645,0]: closed timestamp: 1711271312.802185231,0 (2024-03-24 09:08:32); raft status: {"id":"3","term":18,"vote":"0","commit":396523,"lead":"1","raftState":"StateFollower","applied":396523,"progress":{},"leadtransferee":"0"}: encountered poisoned latch /Table/114/1/291/6/-4633/1/[email protected],0 (SQLSTATE XXUUU)

Prior to the fatal failure, circuit breaker is triggered several times, e.g.,

E240324 09:09:41.471646 1160978975 kv/kvserver/replica_circuit_breaker.go:167 ⋮ [T1,Vsystem,n2,s2,r114/3:‹/Table/108/1/6{71/8/…-86/4/…}›] 6823987  breaker: tripped with error: replica unavailable: (n2,s2):3 unable to serve request to r114:‹/Table/108/1/6{71/8/945-86/4/411}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=16, sticky=1711195332.119851231,0]: closed timestamp: 1711271312.896702123,0 (2024-03-24 09:08:32); raft status: {"id":"3","term":14,"vote":"0","commit":541640,"lead":"1","raftState":"StateFollower","applied":541640,"progress":{},"leadtransferee":"0"}: have been waiting 61.00s for slow proposal Put [/Table/108/1/‹673›/‹1›/‹579›/‹0›], [txn: 164a6761]

It appears that n2 is flapping for > 1h, eventually becoming unavailable.

[1] #120179

srosenberg · 2024-03-25T20:33:58Z

The logs on n2 were truncated owing to the above. CPU profiles confirm the saw-tooth pattern observed in the CPU utilization; ~23% of time is spent GCing,

Heap profiles confirm that SQL stats are taking up a significant chunk of the inuse heap,

Given the frequency of sql-stats-worker: unable to signal flush completion, yet no other error context, it's unclear why sql stats cannot be flushed to disk.

Reassigning to @cockroachdb/cluster-observability for further investigation.

xinhaoz · 2024-03-26T17:38:38Z

It seems like this PR is the culprit and is causing writes to sql stats to be halted after processing 100,000 statement and 100,000 transaction fingerprints. Specifically, [this section].(https://github.com/cockroachdb/cockroach/pull/116444/files#diff-eb9d7df12ea768da5a41031449150b2473270b047efd8a191dca9a38516dd580R608-R620).

What's happening in the test
According to system.sql_stats_cardinality.txt, at some point in the hour after 2024-03-24 01:00:00+00, we stop writing to sql stats in-memory containers. However at this time, we see through logs that:

sql stats flush worker is unable to deliver its 'done' signal to the sql activity updater job, signalling that this job is running when flush is done. The frequency of this log line suggests that the flush seems to be looping without respecting the flush interval of 10m, and the flush operation itself is finishing in a short amount of time.
The sql activity update job reports no rows in the current and previous hours in the persisted tables (which matches what's seen in the cardinality output above).

Explanation
Each node is allowed to track up to 100,000 fingerprint stats entries across all per-application sql stats containers. If that limit is reached before flushing to disk, we send a memory pressure signal to flush early. After flushing, we're supposed to reset the in-memory container by decrementing each application container size from the total tracked fingerprint count, and then we 'prep' the app container for reuse by clearing it and shrinking its capacity by 1/2.
The PR above switched the ordering of the counter decrementing and container clearing steps so that we never actually reset the counter after each flush. Eventually, we hit the max fingerprints threshold and there's no way to come back down from it, so we start sending the memory pressure signal to flush on every statement. When flushing, we're attempting to flush empty containers so it completes immediately. The flush then tries to trigger the update job. Due to the flush frequency, we see a lot of the signals being dropped since the previous update job is still running, having been triggered not too long ago.

To fix the reset issue, it seems that we can replace the chunk in the PR with Container.Clear which performs the operations in the correct order.

In practice we should never see the log line being spammed here at such a frequency, because if we flush due to memory pressure, the flush should take a while to complete and would otherwise follow the flush interval.

Just as an aside/followup, the obs team is actively working on getting some roachprod tests set up to test sql stats system operating at maximum capacity, to catch things like this. I think for this specific case we can also get a unit test to verify the 'reset' behaviour is completed as expected after flush.

xinhaoz · 2024-03-26T17:58:43Z

#121134
Will fix today.

dhartunian · 2024-03-27T16:10:04Z

Fix to #121134 has been merged. I'm closing this issue and we'll circle back with the headroom test during next week's run.

cockroach-teamcity added this to the 24.1 milestone Mar 24, 2024

srosenberg removed the T-testeng TestEng Team label Mar 25, 2024

blathers-crl bot added the T-cluster-observability label Mar 25, 2024

dhartunian added T-observability and removed T-cluster-observability labels Mar 25, 2024

xinhaoz self-assigned this Mar 25, 2024

exalate-issue-sync bot removed the T-observability label Mar 26, 2024

exalate-issue-sync bot unassigned xinhaoz Mar 26, 2024

exalate-issue-sync bot added the T-observability label Mar 26, 2024

jlinder assigned xinhaoz Mar 26, 2024

dhartunian closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: weekly/tpcc/headroom failed #120978

roachtest: weekly/tpcc/headroom failed #120978

cockroach-teamcity commented Mar 24, 2024 •

edited by cockroach-jira-scripts

Loading

srosenberg commented Mar 25, 2024 •

edited

Loading

srosenberg commented Mar 25, 2024

xinhaoz commented Mar 26, 2024 •

edited

Loading

xinhaoz commented Mar 26, 2024

dhartunian commented Mar 27, 2024

roachtest: weekly/tpcc/headroom failed #120978

roachtest: weekly/tpcc/headroom failed #120978

Comments

cockroach-teamcity commented Mar 24, 2024 • edited by cockroach-jira-scripts Loading

srosenberg commented Mar 25, 2024 • edited Loading

srosenberg commented Mar 25, 2024

xinhaoz commented Mar 26, 2024 • edited Loading

xinhaoz commented Mar 26, 2024

dhartunian commented Mar 27, 2024

cockroach-teamcity commented Mar 24, 2024 •

edited by cockroach-jira-scripts

Loading

srosenberg commented Mar 25, 2024 •

edited

Loading

xinhaoz commented Mar 26, 2024 •

edited

Loading