Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: perturbation/metamorphic/backfill failed #137392

Closed
cockroach-teamcity opened this issue Dec 13, 2024 · 19 comments · Fixed by #139219
Closed

roachtest: perturbation/metamorphic/backfill failed #137392

cockroach-teamcity opened this issue Dec 13, 2024 · 19 comments · Fixed by #139219
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 13, 2024

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 603ff88e54d3e3f6e49b0f673abd5ec564bf418b. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/468.

(test_runner.go:1371).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullNormalElasticRepl
  • arch=amd64
  • blockSize=1024
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=expiration
  • localSSD=true
  • mem=standard
  • numNodes=30
  • numWorkloadNodes=2
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-7744531226375341113
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-45558

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Dec 13, 2024
@miraradeva
Copy link
Contributor

@kvoli @pav-kv or @sumeerbhola can you also take a look at this one? It seems like many goroutines from n3's stack trace are stuck in google.golang.org/grpc/internal/transport.(*recvBufferReader) for over 100 minutes.

E.g. from the top of this log:

I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +goroutine 4858 [select, 179 minutes]:
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +google.golang.org/grpc/internal/transport.(*recvBufferReader).read(0xc017e09130, {0xc0177634b0, 0x5, 0x5})
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +	external/org_golang_google_grpc/internal/transport/transport.go:183 +0x8a
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc017e09130, {0xc0177634b0?, 0xc014767230?, 0xc010c02f78?})
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +	external/org_golang_google_grpc/internal/transport/transport.go:177 +0x15b
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc017ba57d0, {0xc0177634b0?, 0xc010c02ff0?, 0xbf7e05?})
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +	external/org_golang_google_grpc/internal/transport/transport.go:514 +0x2c
I241213 10:11:33.553619 16 util/log/clog.go:441 â‹® [-] 5018 +io.ReadAtLeast({0x870b920, 0xc017ba57d0}, {0xc0177634b0, 0x5, 0x5}, 0x5)

@kvoli kvoli added A-replication-admission-control-v2 Related to introduction of replication AC v2 P-1 Issues/test failures with a fix SLA of 1 month labels Dec 13, 2024
@sumeerbhola
Copy link
Collaborator

recvBufferReader wasn't touched by any RACv2 changes, AFAIK, including #136969

@miraradeva
Copy link
Contributor

Yup, sorry for the noise. I misread the grpc transport as Raft transport. Been staring at too many failed tests ...

@miraradeva miraradeva removed P-1 Issues/test failures with a fix SLA of 1 month A-replication-admission-control-v2 Related to introduction of replication AC v2 labels Dec 13, 2024
@andrewbaptist
Copy link
Contributor

Looking at this briefly it looks like the backfill never completed. It ran for 2+ hours:

2024/12/13 08:03:16 framework.go:640: test status: T3: inducing perturbation
2024/12/13 10:11:08 test_impl.go:470: test failure #1: full stack retained in failure_1.log: (test_runner.go:1371).runTest: test timed out (3h0m0s)

This looks likely that the backfill is not making progress and is likely a @cockroachdb/sql-foundations issue.

It's definitely possible that the diskBandwidthLimit=350MiB is preventing it from completing and might be related to #137017

@andrewbaptist
Copy link
Contributor

Looking a bit more, the elastic tokens outstanding on n9 are maxed out the entire interval from 8:00 until the end of the test on n9:

https://grafana.testeng.crdb.io/d/J-yAVzkVddd/detailed?from=1734077162757&to=1734084975057&var-cluster=teamcity-18162949-1734073120-05-n32cpu16sm&orgId=1

@aadityasondhi is there some way to tell if this is an accounting bug or if the tokens are really stuck? We do see the elastic tokens being deducted and returned on n9 during this entire interval so its a little hard to tell.

@sumeerbhola
Copy link
Collaborator

@aadityasondhi: We discussed in our weekly RAC weekly. This looks entirely expected given elastic work is throttled earlier.

@andrewbaptist: Please look at log entries printed by kvflowcontroller_metrics at n9 to identify the receiver stores that were overloaded and then pull the io_load_listener logs for those stores. I can take a quick look at those log entries to see if they look sane over the interval that is concerning to you.

@andrewbaptist
Copy link
Contributor

andrewbaptist commented Dec 17, 2024

Thanks @sumeerbhola

The slow node is n9 and from the logs we see:

I241213 08:30:28.810229 417 kv/kvserver/kvflowcontrol/rac2/store_stream.go:308 ⋮ [-] 3778  eval stream t1/s60 was blocked: durations: elastic 30.00377879s tokens delta: regular 0 B (0 B - 0 B) elastic 0 B (0 B - 0 B)
W241213 08:30:28.810256 417 kv/kvserver/kvflowcontrol/rac2/store_stream.go:245 ⋮ [-] 3779  1 blocked eval elastic replication stream(s): t1/s60

Looking at the logs on s60 after the backfill started - they look like this:

I241213 08:16:52.054013 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47976  IO overload: compaction score 0.050 (6 ssts, 1 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46871 (29902 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.0 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.67x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 13.25x+1 B (smoothed 14.78x+1 B) + at-admission-tokens 1.7 KiB, compacted 68 MiB [≈71 MiB], flushed 3.4 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.99, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 938 MiB (prev 1.2 GiB)), writeBW 69 MiB/s, readBW 218 MiB/s, provisioned 350 MiB/s)
I241213 08:17:07.054218 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47978  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46854 (29825 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 953 MiB adjusted-disk-writes + write-model 0.86x+1 B (smoothed 1.26x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 24.27x+1 B (smoothed 19.53x+1 B) + at-admission-tokens 1.3 KiB, compacted 50 MiB [≈60 MiB], flushed 3.3 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.19, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.1 GiB) tokens (write 1.4 GiB (prev 938 MiB)), writeBW 64 MiB/s, readBW 148 MiB/s, provisioned 350 MiB/s)
I241213 08:17:37.054774 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47983  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47048 (29909 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.71x+1 B (smoothed 1.60x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.53x+1 B (smoothed 16.08x+1 B) + at-admission-tokens 1.6 KiB, compacted 78 MiB [≈74 MiB], flushed 3.8 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.14, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.6 GiB (prev 1.0 GiB)), writeBW 75 MiB/s, readBW 128 MiB/s, provisioned 350 MiB/s)
I241213 08:18:07.055438 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47987  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46932 (29900 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 0.86x+1 B (smoothed 1.26x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 29.45x+1 B (smoothed 21.79x+1 B) + at-admission-tokens 1.3 KiB, compacted 48 MiB [≈60 MiB], flushed 4.1 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.89, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.0 GiB) tokens (write 644 MiB (prev 1.2 GiB)), writeBW 77 MiB/s, readBW 237 MiB/s, provisioned 350 MiB/s)
I241213 08:18:22.055866 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47988  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46731 (29745 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.20x+1 B (smoothed 18.00x+1 B) + at-admission-tokens 1.5 KiB, compacted 78 MiB [≈69 MiB], flushed 3.6 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.93, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.5 GiB (prev 644 MiB)), writeBW 74 MiB/s, readBW 122 MiB/s, provisioned 350 MiB/s)
I241213 08:18:37.056331 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47991  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46804 (29990 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.74x+1 B (smoothed 1.62x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.89x+1 B (smoothed 16.45x+1 B) + at-admission-tokens 1.7 KiB, compacted 79 MiB [≈74 MiB], flushed 3.4 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.81, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.2 GiB (prev 1.5 GiB)), writeBW 78 MiB/s, readBW 197 MiB/s, provisioned 350 MiB/s)
I241213 08:18:52.056520 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47992  IO overload: compaction score 0.050 (6 ssts, 1 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46704 (29808 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 952 MiB adjusted-disk-writes + write-model 1.74x+1 B (smoothed 1.68x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 12.09x+1 B (smoothed 14.27x+1 B) + at-admission-tokens 1.7 KiB, compacted 69 MiB [≈72 MiB], flushed 3.7 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.97, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.4 GiB (prev 1.2 GiB)), writeBW 64 MiB/s, readBW 178 MiB/s, provisioned 350 MiB/s)
I241213 08:19:22.062241 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47994  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46854 (29882 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.74x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.98x+1 B (smoothed 17.34x+1 B) + at-admission-tokens 1.5 KiB, compacted 79 MiB [≈69 MiB], flushed 3.4 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 4.04, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.1 GiB) tokens (write 1.2 GiB (prev 280 MiB)), writeBW 79 MiB/s, readBW 129 MiB/s, provisioned 350 MiB/s)
I241213 08:19:37.062511 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47996  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46872 (29840 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.71x+1 B (smoothed 1.61x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.18x+1 B (smoothed 16.26x+1 B) + at-admission-tokens 1.7 KiB, compacted 78 MiB [≈74 MiB], flushed 3.4 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.93, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 0 B (prev 1.2 GiB)), writeBW 79 MiB/s, readBW 290 MiB/s, provisioned 350 MiB/s)
I241213 08:19:52.062927 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47997  IO overload: compaction score 0.050 (7 ssts, 1 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46731 (29777 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 982 MiB adjusted-disk-writes + write-model 1.73x+1 B (smoothed 1.67x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 12.49x+1 B (smoothed 14.37x+1 B) + at-admission-tokens 1.7 KiB, compacted 69 MiB [≈71 MiB], flushed 3.6 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization +Inf, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 77 MiB (prev 0 B)), writeBW 65 MiB/s, readBW 260 MiB/s, provisioned 350 MiB/s)
I241213 08:20:07.063275 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 47999  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46936 (29948 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 986 MiB adjusted-disk-writes + write-model 0.87x+1 B (smoothed 1.27x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 25.00x+1 B (smoothed 19.69x+1 B) + at-admission-tokens 1.3 KiB, compacted 49 MiB [≈60 MiB], flushed 3.7 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 14.15, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.1 GiB) tokens (write 1.4 GiB (prev 77 MiB)), writeBW 66 MiB/s, readBW 114 MiB/s, provisioned 350 MiB/s)
I241213 08:20:22.063758 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48001  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46707 (29782 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.74x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.52x+1 B (smoothed 17.10x+1 B) + at-admission-tokens 1.5 KiB, compacted 79 MiB [≈69 MiB], flushed 3.8 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.81, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.1 GiB) tokens (write 2.7 GiB (prev 1.4 GiB)), writeBW 76 MiB/s, readBW 79 MiB/s, provisioned 350 MiB/s)
I241213 08:21:52.066025 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48010  IO overload: compaction score 0.050 (11 ssts, 1 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46624 (29699 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 857 MiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.67x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 11.06x+1 B (smoothed 13.56x+1 B) + at-admission-tokens 1.7 KiB, compacted 56 MiB [≈65 MiB], flushed 3.6 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.39, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.5 GiB (prev 845 MiB)), writeBW 57 MiB/s, readBW 128 MiB/s, provisioned 350 MiB/s)
I241213 08:22:22.069781 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48012  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46763 (29769 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.75x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.14x+1 B (smoothed 18.18x+1 B) + at-admission-tokens 1.6 KiB, compacted 79 MiB [≈71 MiB], flushed 3.7 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.05, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 2.0 GiB (prev 1.1 GiB)), writeBW 80 MiB/s, readBW 80 MiB/s, provisioned 350 MiB/s)
I241213 08:24:37.072590 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48027  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46848 (29816 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.61x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.54x+1 B (smoothed 16.84x+1 B) + at-admission-tokens 1.6 KiB, compacted 78 MiB [≈74 MiB], flushed 3.6 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.24, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.3 GiB) tokens (write 909 MiB (prev 1.0 GiB)), writeBW 76 MiB/s, readBW 219 MiB/s, provisioned 350 MiB/s)
I241213 08:24:52.073044 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48028  IO overload: compaction score 0.050 (15 ssts, 1 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46933 (29962 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 787 MiB adjusted-disk-writes + write-model 1.73x+1 B (smoothed 1.67x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 9.99x+1 B (smoothed 13.42x+1 B) + at-admission-tokens 1.7 KiB, compacted 54 MiB [≈64 MiB], flushed 3.6 GiB [≈1.1 GiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.35, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.3 GiB (prev 909 MiB)), writeBW 52 MiB/s, readBW 160 MiB/s, provisioned 350 MiB/s)
I241213 08:25:22.073683 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48030  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 80 MiB (write 80 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46548 (29492 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 80 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.77x+1 B (smoothed 1.52x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.61x+1 B (smoothed 17.65x+1 B) + at-admission-tokens 1.6 KiB, compacted 80 MiB [≈72 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.81, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.4 GiB (prev 658 MiB)), writeBW 78 MiB/s, readBW 127 MiB/s, provisioned 350 MiB/s)
I241213 08:25:37.074017 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48032  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46436 (29647 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.5 GiB adjusted-disk-writes + write-model 1.76x+1 B (smoothed 1.64x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 20.00x+1 B (smoothed 18.82x+1 B) + at-admission-tokens 1.7 KiB, compacted 79 MiB [≈76 MiB], flushed 3.5 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.82, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.3 GiB (prev 1.4 GiB)), writeBW 105 MiB/s, readBW 192 MiB/s, provisioned 350 MiB/s)
I241213 08:25:52.074355 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48033  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47116 (30261 bypassed) with 46 MiB acc-write (29 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 724 MiB adjusted-disk-writes + write-model 0.86x+1 B (smoothed 1.25x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 18.51x+1 B (smoothed 18.67x+1 B) + at-admission-tokens 1.3 KiB, compacted 39 MiB [≈57 MiB], flushed 3.7 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.07, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.4 GiB) tokens (write 64 MiB (prev 1.3 GiB)), writeBW 48 MiB/s, readBW 276 MiB/s, provisioned 350 MiB/s)
I241213 08:26:07.074609 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48035  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 79 MiB (write 79 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47149 (30009 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 79 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.73x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.53x+1 B (smoothed 16.60x+1 B) + at-admission-tokens 1.5 KiB, compacted 79 MiB [≈68 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 16.68, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.0 GiB) tokens (write 1.3 GiB (prev 64 MiB)), writeBW 77 MiB/s, readBW 108 MiB/s, provisioned 350 MiB/s)
I241213 08:26:22.075104 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48036  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46834 (29878 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.3 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.60x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 16.76x+1 B (smoothed 16.68x+1 B) + at-admission-tokens 1.6 KiB, compacted 78 MiB [≈73 MiB], flushed 3.7 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.85, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.1 GiB) tokens (write 1.0 GiB (prev 1.3 GiB)), writeBW 87 MiB/s, readBW 210 MiB/s, provisioned 350 MiB/s)
I241213 08:26:37.075564 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48039  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47094 (30044 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.4 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.66x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 18.49x+1 B (smoothed 17.58x+1 B) + at-admission-tokens 1.7 KiB, compacted 78 MiB [≈76 MiB], flushed 3.5 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.17, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.2 GiB (prev 1.0 GiB)), writeBW 97 MiB/s, readBW 180 MiB/s, provisioned 350 MiB/s)
I241213 08:26:52.075936 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48040  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46396 (29573 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 939 MiB adjusted-disk-writes + write-model 0.88x+1 B (smoothed 1.27x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 23.81x+1 B (smoothed 20.70x+1 B) + at-admission-tokens 1.3 KiB, compacted 39 MiB [≈58 MiB], flushed 3.7 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.03, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.3 GiB) tokens (write 304 MiB (prev 1.2 GiB)), writeBW 63 MiB/s, readBW 260 MiB/s, provisioned 350 MiB/s)
I241213 08:27:07.076204 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48042  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 77 MiB (write 77 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47241 (30238 bypassed) with 46 MiB acc-write (29 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 77 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.69x+1 B (smoothed 1.48x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.35x+1 B (smoothed 18.02x+1 B) + at-admission-tokens 1.5 KiB, compacted 77 MiB [≈68 MiB], flushed 2.9 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 3.96, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.3 GiB (prev 304 MiB)), writeBW 79 MiB/s, readBW 130 MiB/s, provisioned 350 MiB/s)
I241213 08:27:22.076790 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48044  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46680 (29692 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.4 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.60x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 18.01x+1 B (smoothed 18.02x+1 B) + at-admission-tokens 1.6 KiB, compacted 78 MiB [≈73 MiB], flushed 3.4 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.94, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 0 B (prev 1.3 GiB)), writeBW 94 MiB/s, readBW 286 MiB/s, provisioned 350 MiB/s)
I241213 08:27:37.077058 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48049  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47182 (30077 bypassed) with 46 MiB acc-write (29 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.70x+1 B (smoothed 1.65x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.20x+1 B (smoothed 16.61x+1 B) + at-admission-tokens 1.7 KiB, compacted 78 MiB [≈75 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization +Inf, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.3 GiB) tokens (write 0 B (prev 0 B)), writeBW 79 MiB/s, readBW 287 MiB/s, provisioned 350 MiB/s)
I241213 08:27:52.077290 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48052  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46581 (29637 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 738 MiB adjusted-disk-writes + write-model 0.87x+1 B (smoothed 1.26x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 18.87x+1 B (smoothed 17.74x+1 B) + at-admission-tokens 1.3 KiB, compacted 39 MiB [≈57 MiB], flushed 3.8 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization +Inf, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.2 GiB (prev 0 B)), writeBW 49 MiB/s, readBW 114 MiB/s, provisioned 350 MiB/s)
I241213 08:28:07.077652 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48055  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 47080 (29992 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.87x+1 B (smoothed 16.81x+1 B) + at-admission-tokens 1.5 KiB, compacted 78 MiB [≈68 MiB], flushed 3.9 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.85, tokensUsed (elastic 0 B, snapshot 0 B, regular 1020 MiB) tokens (write 2.6 GiB (prev 1.2 GiB)), writeBW 83 MiB/s, readBW 88 MiB/s, provisioned 350 MiB/s)
I241213 08:29:37.080277 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48064  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46506 (29647 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.72x+1 B (smoothed 1.62x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.69x+1 B (smoothed 16.38x+1 B) + at-admission-tokens 1.7 KiB, compacted 78 MiB [≈74 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.43, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1.4 GiB (prev 863 MiB)), writeBW 76 MiB/s, readBW 146 MiB/s, provisioned 350 MiB/s)
I241213 08:29:52.080700 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48066  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 39 MiB (write 39 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46997 (30047 bypassed) with 46 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 39 MiB adjusted-LSM-writes + 811 MiB adjusted-disk-writes + write-model 0.86x+1 B (smoothed 1.24x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 20.69x+1 B (smoothed 18.54x+1 B) + at-admission-tokens 1.3 KiB, compacted 39 MiB [≈57 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.84, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.2 GiB) tokens (write 1002 MiB (prev 1.4 GiB)), writeBW 54 MiB/s, readBW 213 MiB/s, provisioned 350 MiB/s)
I241213 08:30:07.080933 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48068  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46631 (29782 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 1.73x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 14.27x+1 B (smoothed 16.40x+1 B) + at-admission-tokens 1.5 KiB, compacted 78 MiB [≈68 MiB], flushed 3.5 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.04, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.0 GiB) tokens (write 2.0 GiB (prev 1002 MiB)), writeBW 74 MiB/s, readBW 77 MiB/s, provisioned 350 MiB/s)
I241213 08:31:22.082710 755 util/admission/io_load_listener.go:665 ⋮ [T1,Vsystem,n30,s60] 48075  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 78 MiB (write 78 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 46905 (30040 bypassed) with 45 MiB acc-write (28 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 78 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.71x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+0 B (smoothed 0.56x+1 B) + write-amp-model 15.19x+1 B (smoothed 19.46x+1 B) + at-admission-tokens 1.5 KiB, compacted 78 MiB [≈71 MiB], flushed 3.6 GiB [≈808 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 0.83, tokensUsed (elastic 0 B, snapshot 0 B, regular 1.3 GiB) tokens (write 2.0 GiB (prev 1.6 GiB)), writeBW 79 MiB/s, readBW 120 MiB/s, provisioned 350 MiB/s)

Note that the logs stop at 08:31, although the system stays like this until 10:00 when the test times out.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 20, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 270ba1e8fe656fab0a182643fb77a6a5be64c1b0. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/491.

(test_runner.go:1363).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=leader
  • localSSD=true
  • mem=standard
  • numNodes=5
  • numWorkloadNodes=1
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=9187854380536044437
  • splits=100
  • ssd=1
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ e5ef95eead313b215d8acec85a9ed124d6a1a193:

(framework.go:818).waitForIOOverloadToEnd: dial tcp 34.139.50.14:26257: connect: connection refused
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=elasticOnlyBoth
  • arch=amd64
  • blockSize=1024
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=leader
  • localSSD=true
  • mem=standard
  • numNodes=30
  • numWorkloadNodes=2
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-7837896781131178139
  • splits=1
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/497.

(test_runner.go:1363).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=elasticOnlyBoth
  • arch=amd64
  • blockSize=1024
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=30
  • numWorkloadNodes=2
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=2784578325356580550
  • splits=1
  • ssd=1
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 5207cd59f4cfd8444cb3d7739f53063bed4ae1a6:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/index_backfill.go:91
	            				pkg/cmd/roachtest/tests/perturbation/framework.go:643
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: polling for queued jobs to complete: poll-show-jobs: root: memory budget exceeded: 10240 bytes requested, 3979964995 currently allocated, 3979966464 bytes in budget
	Test:       	perturbation/metamorphic/backfill
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullNormalElasticRepl
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=expiration
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-6837933855694557550
  • splits=1
  • ssd=1
  • validationDuration=5m0s
  • vcpu=4
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ bc6d6e05a7c0f9ffd8103740239fdbc83fa78e3f. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/508.

(test_runner.go:1363).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=expiration
  • localSSD=true
  • mem=high
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-6052411448360532762
  • splits=100
  • ssd=1
  • validationDuration=5m0s
  • vcpu=4
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:

(framework.go:818).waitForIOOverloadToEnd: dial tcp 34.139.130.103:26257: connect: connection refused
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=elasticOnlyBoth
  • arch=amd64
  • blockSize=1024
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=leader
  • localSSD=true
  • mem=standard
  • numNodes=30
  • numWorkloadNodes=2
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=7197251714929239604
  • splits=100
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/backfill
	Messages:   	FAILURE: follower-read  : Increase 511122.0876 > 40.0000 BASE: 7.043327ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 517738.2890 > 40.0000 BASE: 6.95332ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 289125.6471 > 40.0000 BASE: 12.451334ms SCORE: 1h0m0s
	            	
	            	FAILURE: follower-read  : Increase 511122.0876 > 40.0000 BASE: 7.043327ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 517738.2890 > 40.0000 BASE: 6.95332ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 289125.6471 > 40.0000 BASE: 12.451334ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-6167743221055224206
  • splits=10000
  • ssd=1
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/512.

(test_runner.go:1364).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullNormalElasticRepl
  • arch=amd64
  • blockSize=1024
  • cloud=gce
  • coverageBuild=false
  • cpu=32
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=expiration
  • localSSD=true
  • mem=standard
  • numNodes=5
  • numWorkloadNodes=1
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=7337049727513209251
  • splits=10000
  • ssd=1
  • validationDuration=5m0s
  • vcpu=32
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 31e84cb3a57c52a779ff0982c95fb26646b54926:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/backfill
	Messages:   	FAILURE: follower-read  : Increase 53.8826 > 40.0000 BASE: 10.789188ms SCORE: 581.348992ms
	            	
	            	FAILURE: read           : Increase 73.8200 > 40.0000 BASE: 7.557572ms SCORE: 557.899904ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=elasticOnlyBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • diskBandwidthLimit=0
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=expiration
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-2981753407619752889
  • splits=1
  • ssd=1
  • validationDuration=5m0s
  • vcpu=8
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 097438ac38e411b0fde101ebcae5cf97a798d1db. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/513.

(test_runner.go:1364).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=350MiB
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=leader
  • localSSD=true
  • mem=standard
  • numNodes=30
  • numWorkloadNodes=2
  • perturbationDuration=10s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=-4564004692011702394
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 0b4d620740733ec61cf50ca26d19814299d91f8e. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/517.

(test_runner.go:1364).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • acMode=fullBoth
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • diskBandwidthLimit=350MiB
  • disks=1
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=leader
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=30m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=6735881403210988339
  • splits=10000
  • ssd=1
  • validationDuration=5m0s
  • vcpu=4
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@stevendanna stevendanna self-assigned this Jan 15, 2025
@andrewbaptist
Copy link
Contributor

It might make sense to disable the disk bandwidth limiter for backfill tests. It seems to prevent the backfills from completing.

There are three different failure modes:

  1. Node crashes from OOM's (these look like "connection refused") errors
  2. Backfill never completes - I think these all have the disk bandwidth limiter enabled.
  3. Slowdown of foreground traffic during the backfill - these mostly have the disk bandwidth limiter enabled as well.

The OOM's are the most interesting to look at first.

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Jan 16, 2025
The perturbation/*/backfill tests are flaky and are failing at least
once a week with the default configuration. This change temporarily
disables the check to allow easier investigation of the other failure
modes such as backfill failing to complete and node OOMs. Once those are
closed, and the test is running more stably, this threshold can be
dropped.

Fixes: cockroachdb#137093
Fixes: cockroachdb#137392

Informs: cockroachdb#133114

Release note: None
craig bot pushed a commit that referenced this issue Jan 16, 2025
138688: server: fix admin server Settings RPC redaction logic r=kyle-a-wong a=kyle-a-wong

Previously admin.Settings only allowed admins to view all cluster settings without redaction. If the
requester was not an admin, would use the isReportable field on settings to determine if the setting should be redacted or not. This API also had outdated logic, as users with the MODIFYCLUSTERSETTINGS should also be able to view all cluster settings (See #115356 for more discussions on this).

This patch respects this new role, and no longer uses the `isReportable` setting flag to determine if a setting should be redacted. This is implemented by query `crdb_internal.cluster_settings` directly, allowing the sql layer to permission check.

This commit also removes the `unredacted_values` from the request entity as well, since it is no longer necessary.

Ultimately, this commit updates the Settings RPC to have the same redaction logic as querying `crdb_internal.cluster_settings` or using `SHOW CLUSTER SETTINGS`.

Epic: None
Fixes: #137698
Release note (general change): The /_admin/v1/settings API now returns cluster settings using the same redaction logic as querying `SHOW CLUSTER SETTINGS` and `crdb_internal.cluster_settings`. This means that only settings flagged as "sensitive" will be redacted, all other settings will be visible. The same authorization is required for this endpoint, meaning the user must be an admin or have MODIFYCLUSTERSETTINGS or VIEWCLUSTERSETTINGS roles to hit this API. The exception is that if the user has VIEWACTIVITY or VIEWACTIVITYREDACTED, they will see console only settings.

138967: crosscluster/physical: return job id in SHOW TENANT WITH REPLICATION STATUS r=dt a=msbutler

Fixes #138548

Release note (sql change): SHOW TENANT WITH REPLICATION STATUS will now display the `ingestion_job_id` column after the `name` column.

139043: crosscluster/logical: ensure offline scan procs shut down before next phase r=dt a=msbutler

This patch adds a check that attempts to wait for the offline scan processors to spin down before transitioning to steady state ingestion or OnFailOrCancel during an offline scan.

Epic: none

Release note: none

139219: roachtest: disable backfill success check r=stevendanna a=andrewbaptist

The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped.

Fixes: #137093
Fixes: #137392

Informs: #133114

Release note: None

139259: sql: deflake TestIndexBackfillFractionTracking r=rafiss a=rafiss

Recent changes added some concurrency to index backfills, so the testing hook needs a mutex to prevent concurrent access.

fixes #139213
Release note: None

Co-authored-by: Kyle Wong <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 1a6bbf3 Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants