-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: perturbation/metamorphic/backfill failed #137392
Comments
@kvoli @pav-kv or @sumeerbhola can you also take a look at this one? It seems like many goroutines from n3's stack trace are stuck in E.g. from the top of this log:
|
|
Yup, sorry for the noise. I misread the grpc transport as Raft transport. Been staring at too many failed tests ... |
Looking at this briefly it looks like the backfill never completed. It ran for 2+ hours:
This looks likely that the backfill is not making progress and is likely a @cockroachdb/sql-foundations issue. It's definitely possible that the |
Looking a bit more, the elastic tokens outstanding on n9 are maxed out the entire interval from 8:00 until the end of the test on n9: @aadityasondhi is there some way to tell if this is an accounting bug or if the tokens are really stuck? We do see the elastic tokens being deducted and returned on n9 during this entire interval so its a little hard to tell. |
@aadityasondhi: We discussed in our weekly RAC weekly. This looks entirely expected given elastic work is throttled earlier. @andrewbaptist: Please look at log entries printed by |
Thanks @sumeerbhola The slow node is n9 and from the logs we see:
Looking at the logs on s60 after the backfill started - they look like this:
Note that the logs stop at 08:31, although the system stays like this until 10:00 when the test times out. |
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 270ba1e8fe656fab0a182643fb77a6a5be64c1b0. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/491.
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ e5ef95eead313b215d8acec85a9ed124d6a1a193:
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/497.
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 5207cd59f4cfd8444cb3d7739f53063bed4ae1a6:
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ bc6d6e05a7c0f9ffd8103740239fdbc83fa78e3f. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/508.
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/512.
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 31e84cb3a57c52a779ff0982c95fb26646b54926:
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 097438ac38e411b0fde101ebcae5cf97a798d1db. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/513.
Parameters:
|
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 0b4d620740733ec61cf50ca26d19814299d91f8e. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/517.
Parameters:
|
It might make sense to disable the disk bandwidth limiter for backfill tests. It seems to prevent the backfills from completing. There are three different failure modes:
The OOM's are the most interesting to look at first. |
The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: cockroachdb#137093 Fixes: cockroachdb#137392 Informs: cockroachdb#133114 Release note: None
138688: server: fix admin server Settings RPC redaction logic r=kyle-a-wong a=kyle-a-wong Previously admin.Settings only allowed admins to view all cluster settings without redaction. If the requester was not an admin, would use the isReportable field on settings to determine if the setting should be redacted or not. This API also had outdated logic, as users with the MODIFYCLUSTERSETTINGS should also be able to view all cluster settings (See #115356 for more discussions on this). This patch respects this new role, and no longer uses the `isReportable` setting flag to determine if a setting should be redacted. This is implemented by query `crdb_internal.cluster_settings` directly, allowing the sql layer to permission check. This commit also removes the `unredacted_values` from the request entity as well, since it is no longer necessary. Ultimately, this commit updates the Settings RPC to have the same redaction logic as querying `crdb_internal.cluster_settings` or using `SHOW CLUSTER SETTINGS`. Epic: None Fixes: #137698 Release note (general change): The /_admin/v1/settings API now returns cluster settings using the same redaction logic as querying `SHOW CLUSTER SETTINGS` and `crdb_internal.cluster_settings`. This means that only settings flagged as "sensitive" will be redacted, all other settings will be visible. The same authorization is required for this endpoint, meaning the user must be an admin or have MODIFYCLUSTERSETTINGS or VIEWCLUSTERSETTINGS roles to hit this API. The exception is that if the user has VIEWACTIVITY or VIEWACTIVITYREDACTED, they will see console only settings. 138967: crosscluster/physical: return job id in SHOW TENANT WITH REPLICATION STATUS r=dt a=msbutler Fixes #138548 Release note (sql change): SHOW TENANT WITH REPLICATION STATUS will now display the `ingestion_job_id` column after the `name` column. 139043: crosscluster/logical: ensure offline scan procs shut down before next phase r=dt a=msbutler This patch adds a check that attempts to wait for the offline scan processors to spin down before transitioning to steady state ingestion or OnFailOrCancel during an offline scan. Epic: none Release note: none 139219: roachtest: disable backfill success check r=stevendanna a=andrewbaptist The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: #137093 Fixes: #137392 Informs: #133114 Release note: None 139259: sql: deflake TestIndexBackfillFractionTracking r=rafiss a=rafiss Recent changes added some concurrency to index backfills, so the testing hook needs a mutex to prevent concurrent access. fixes #139213 Release note: None Co-authored-by: Kyle Wong <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 603ff88e54d3e3f6e49b0f673abd5ec564bf418b. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/468.
Parameters:
acMode=fullNormalElasticRepl
arch=amd64
blockSize=1024
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=350MiB
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=expiration
localSSD=true
mem=standard
numNodes=30
numWorkloadNodes=2
perturbationDuration=30m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=-7744531226375341113
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-45558
The text was updated successfully, but these errors were encountered: