-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: weekly/tpcc/headroom failed #120978
Comments
More log spamming/pollution in SQL stats; the following lines are repeated ad nauseam,
Similar to [1], this is making triage rather challenging. The workload eventually fails to the unavailability of a leaseholder on
Prior to the fatal failure, circuit breaker is triggered several times, e.g.,
It appears that [1] #120179 |
The logs on ![]() Heap profiles confirm that SQL stats are taking up a significant chunk of the inuse heap, ![]() Given the frequency of Reassigning to @cockroachdb/cluster-observability for further investigation. |
It seems like this PR is the culprit and is causing writes to sql stats to be halted after processing 100,000 statement and 100,000 transaction fingerprints. Specifically, [this section].(https://github.com/cockroachdb/cockroach/pull/116444/files#diff-eb9d7df12ea768da5a41031449150b2473270b047efd8a191dca9a38516dd580R608-R620). What's happening in the test
Explanation To fix the reset issue, it seems that we can replace the chunk in the PR with In practice we should never see the log line being spammed here at such a frequency, because if we flush due to memory pressure, the flush should take a while to complete and would otherwise follow the flush interval. Just as an aside/followup, the obs team is actively working on getting some roachprod tests set up to test sql stats system operating at maximum capacity, to catch things like this. I think for this specific case we can also get a unit test to verify the 'reset' behaviour is completed as expected after flush. |
#121134 |
Fix to #121134 has been merged. I'm closing this issue and we'll circle back with the headroom test during next week's run. |
roachtest.weekly/tpcc/headroom failed with artifacts on master @ 4a9385cacb82e7a8d6d37e5d9a26a6b7c845aab6:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-36983
The text was updated successfully, but these errors were encountered: