Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: declarative schema changes break core changefeeds #85008

Closed
amruss opened this issue Jul 25, 2022 · 6 comments · Fixed by #85458
Closed

cdc: declarative schema changes break core changefeeds #85008

amruss opened this issue Jul 25, 2022 · 6 comments · Fixed by #85458
Assignees
Labels
A-cdc Change Data Capture branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc

Comments

@amruss
Copy link
Contributor

amruss commented Jul 25, 2022

In 22.2 , we're planning on switching to the declarative schema changer. Some schema changes cause restarts - and will thus cause an error for core changefeeds.

Jira issue: CRDB-17999

Epic CRDB-8719

@amruss amruss added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Jul 25, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 25, 2022

cc @cockroachdb/cdc

@ajwerner
Copy link
Contributor

Will the resolution to #84511 fix this? Should we mark this as a release blocker?

@amruss
Copy link
Contributor Author

amruss commented Jul 25, 2022

Seems like a 22.2 release blocker to me

@ajwerner ajwerner added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Jul 25, 2022
@amruss
Copy link
Contributor Author

amruss commented Jul 27, 2022

More context: primary index changes cause this break. The declarative schema changer makes most schema changes into this type of schema change. This means most schema changes won't work with core changefeeds post 22.2.

@amruss
Copy link
Contributor Author

amruss commented Jul 27, 2022

Possible fix: allow core changefeeds to be restarted w/ a retry loop

@ajwerner have any thoughts on how we should address this?

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Aug 3, 2022
Previously, core changefeeds would stop entirely
due to transient errors or certain schema changes.
This change adds a retry loop to the core changefeed
distributed SQL workflow.

This change updates related tests which
omitted sinkless feeds since they could not handle
schema changes.

Fixes cockroachdb#85008
Release note (general change): Changefeeds without
a specified sink will not longer terminate when
schema changes occur.
craig bot pushed a commit that referenced this issue Aug 9, 2022
84514: sql/schemachanger: implemented ALTER PRIMARY KEY for vanilla case  r=postamar a=Xiang-Gu

The PR implements `ALTER PRIMARY KEY` under the declarative schema
changer framework that handles the simplest, "vanilla" case like
```
CREATE TABLE t (i INT PRIMARY KEY, j INT NOT NULL)
ALTER TABLE t ALTER PRIMARY KEY USING COLUMNS (j)
```

This is the first of a series PRs where followup PRs will expand its
capabilities to be able to handle more complex cases, including
  1. Allow the requested new primary key to be hash-sharded;
  2. Consider the case where altering primary key requires us to
     modify existing secondary indexes(see the legacy schema change
     about in what cases we should rewrite)
  3. Consider the case where the old primary index is on the implicitly
     created `rowid` column, in which case we also need to drop that
     column;
  5. Consider partitioning and locality (I'm not sure what they are,
     and why they play a role when `ALTER PRIMARY KEY` but I've seen
     them in the old schema changer, so I assume we ought to do
     something about them too here).
  6. Support `ALTER PRIMARY KEY` with concurrent schema change
      statements. E.g.
 ```ALTER TABLE t ADD COLUMN k INT NOT NULL DEFAULT 30, ALTER PRIMARY KEY USING COLUMNS (j);```

related: #83932
Release note: None

84718: sql: populate query-level stats earlier & add contention to telemetry log r=THardy98 a=THardy98

Addresses: #71328

This change adds contention time (measured in nanoseconds) to the
`SampledQuery` telemetry log.

To accomodate this change, we needed to collect query-level statistics
earlier. Previously, query-level statistics were fetched when we called
`Finish` under the `instrumentationHelper`, however this occurred after
we had already emitted our query execution logs. Now, we collect
query-level stats in `dispatchToExecutionEngine` after we've executed
the query.

As a tradeoff to collecting query-level stats earlier, we need to fetch
the trace twice:
- once when populating query-level stats (trace is required to compute
  query-level stats) at `populateQueryLevelStats` in
`dispatchToExecutionEngine` after query execution
- once during the instrumentation helper's `Finish` (as we do currently)

This allows us to collect query-level stats earlier without omitting any
tracing events we record currently. This approach is safer, with the
additional overhead of fetching the trace twice only occuring at the
tracing sampling rate of 1-2%, which is fairly conservative. The concern
with only fetching the trace at query-level stats population was the
ommission of a number of events that occur in
`commitSQLTransactionInternal` (or any execution paths that don't lead
to `dispatchToExecutionEngine`).

Release note (sql change): Add `ContentionTime` field to the
`SampledQuery` telemetry log. Query-level statistics are collected
earlier to facilitate the adding of contention time to query execution
logs. The earlier collection of query-level statistics requires the
additional overhead of fetching the query's trace twice (instead of
previously once).

85280: sql, server: add new system privileges for observability r=Santamaura a=Santamaura

This patch introduces 2 new system privileges
VIEWDEBUG and VIEWCLUSTERMETADATA. VIEWDEBUG
will now be used to gate taking traces and viewing
debug endpoints. VIEWCLUSTERMETADATA will now be
used to gate the node and range reports.

Resolves #83844, #83856, #83857, #83858, #83861

Release note (sql change): add VIEWDEBUG and
VIEWCLUSTERMETADATA system privileges.

85458: changefeedccl: add retries to sinkless changefeeds r=jayshrivastava a=jayshrivastava

This change updates core/sinkless changefeeds to run in a retry loop, allowing for changefeed restarts in case of transient errors or declarative schema changes. 

See commit notes for more details.

Fixes #85008

85819: kv: use max timestamp during below-Raft scan to gossip liveness r=nvanbenschoten a=nvanbenschoten

Related to https://github.com/cockroachlabs/support/issues/1573.

This commit switches `MaybeGossipNodeLivenessRaftMuLocked` to evaluate its scan
at the maximum timestamp instead of at the local node's HLC time. This ensures
that we gossip the most recent liveness record, regardless of what timestamp it
is written at.

85822: colbuilder: fall back to row-by-row processor wrapping for many renders r=yuzefovich a=yuzefovich

**colbuilder: add a microbenchmark for running many render expressions**

This commit adds a microbenchmark of queries with many render
expressions. It'll be used in the following commit to tune when we fall
back to wrapping a row-by-row processor to handle those renders.

Release note: None

**colbuilder: fall back to row-by-row processor wrapping for many renders**

This commit introduces a mechanism to handle render expressions by
wrapping a row-by-row processor into the vectorized flow when
1. the estimated number of rows to go through the renders is relatively
small
2. the number of renders is relatively high.

The idea is that the vectorized projection operators have higher
overhead when many of them are planned AND there is not enough data to
amortize the overhead, so to improve the performance in those cases
we'll use the row-by-row noop processor. Both of the thresholds are
controlled by cluster settings and the defaults were chosen based on
a representative microbenchmark.

It's worth pointing out that we only have the estimated row count for
the scan operators, so the change has limited applicability.

```
RenderPlanning/rows=1/renders=1-24           407µs ± 2%     408µs ± 2%     ~     (p=0.684 n=10+10)
RenderPlanning/rows=1/renders=8-24           516µs ± 1%     537µs ± 1%   +4.05%  (p=0.000 n=10+10)
RenderPlanning/rows=1/renders=32-24          832µs ± 1%     811µs ± 1%   -2.59%  (p=0.000 n=10+10)
RenderPlanning/rows=1/renders=64-24         1.22ms ± 0%    1.14ms ± 1%   -6.62%  (p=0.000 n=9+10)
RenderPlanning/rows=1/renders=128-24        2.02ms ± 0%    1.80ms ± 1%  -11.18%  (p=0.000 n=8+9)
RenderPlanning/rows=1/renders=512-24        7.75ms ± 1%    5.75ms ± 1%  -25.77%  (p=0.000 n=10+9)
RenderPlanning/rows=1/renders=4096-24        160ms ± 1%      62ms ± 1%  -61.51%  (p=0.000 n=10+9)
RenderPlanning/rows=4/renders=1-24           438µs ± 2%     438µs ± 1%     ~     (p=0.853 n=10+10)
RenderPlanning/rows=4/renders=8-24           603µs ± 1%     633µs ± 2%   +5.06%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=32-24         1.08ms ± 1%    1.08ms ± 1%     ~     (p=0.105 n=10+10)
RenderPlanning/rows=4/renders=64-24         1.72ms ± 0%    1.62ms ± 0%   -5.83%  (p=0.000 n=9+9)
RenderPlanning/rows=4/renders=128-24        3.01ms ± 1%    2.75ms ± 1%   -8.78%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=512-24        11.6ms ± 1%     9.6ms ± 2%  -17.58%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=4096-24        192ms ± 2%      91ms ± 2%  -52.58%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=1-24          494µs ± 1%     499µs ± 1%   +1.03%  (p=0.006 n=10+8)
RenderPlanning/rows=16/renders=8-24          855µs ± 1%     901µs ± 1%   +5.37%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=32-24        2.03ms ± 1%    2.04ms ± 1%     ~     (p=0.190 n=10+10)
RenderPlanning/rows=16/renders=64-24        3.58ms ± 1%    3.42ms ± 1%   -4.56%  (p=0.000 n=10+9)
RenderPlanning/rows=16/renders=128-24       6.74ms ± 1%    6.31ms ± 1%   -6.37%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=512-24       26.9ms ± 1%    24.7ms ± 1%   -8.24%  (p=0.000 n=9+10)
RenderPlanning/rows=16/renders=4096-24       329ms ± 2%     218ms ± 2%  -33.66%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=1-24          666µs ± 1%     659µs ± 2%   -1.07%  (p=0.007 n=10+10)
RenderPlanning/rows=64/renders=8-24         1.79ms ± 1%    1.84ms ± 1%   +3.01%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=32-24        5.53ms ± 1%    5.79ms ± 2%   +4.74%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=64-24        10.8ms ± 1%    11.0ms ± 1%   +1.87%  (p=0.000 n=10+9)
RenderPlanning/rows=64/renders=128-24       21.2ms ± 1%    21.7ms ± 1%   +2.71%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=512-24       83.6ms ± 0%    84.9ms ± 0%   +1.47%  (p=0.000 n=10+7)
RenderPlanning/rows=64/renders=4096-24       824ms ± 1%     751ms ± 2%   -8.88%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=1-24         853µs ± 1%     851µs ± 1%     ~     (p=0.481 n=10+10)
RenderPlanning/rows=128/renders=8-24        2.98ms ± 1%    3.11ms ± 1%   +4.32%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=32-24       10.4ms ± 1%    10.9ms ± 1%   +5.44%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=64-24       20.1ms ± 1%    21.3ms ± 1%   +5.99%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=128-24      39.7ms ± 1%    42.1ms ± 2%   +5.98%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=512-24       160ms ± 1%     168ms ± 2%   +5.13%  (p=0.000 n=9+10)
RenderPlanning/rows=128/renders=4096-24      1.44s ± 1%     1.48s ± 2%   +3.15%  (p=0.000 n=9+10)
RenderPlanning/rows=256/renders=1-24        1.22ms ± 1%    1.21ms ± 1%   -1.01%  (p=0.000 n=10+10)
RenderPlanning/rows=256/renders=8-24        5.22ms ± 0%    5.19ms ± 1%   -0.54%  (p=0.011 n=8+9)
RenderPlanning/rows=256/renders=32-24       19.9ms ± 1%    20.0ms ± 1%     ~     (p=0.182 n=9+10)
RenderPlanning/rows=256/renders=64-24       39.0ms ± 0%    38.9ms ± 0%   -0.33%  (p=0.023 n=10+10)
RenderPlanning/rows=256/renders=128-24      76.8ms ± 1%    76.7ms ± 1%     ~     (p=0.739 n=10+10)
RenderPlanning/rows=256/renders=512-24       316ms ± 1%     319ms ± 1%   +1.15%  (p=0.001 n=9+10)
RenderPlanning/rows=256/renders=4096-24      2.75s ± 1%     2.73s ± 1%   -0.64%  (p=0.002 n=8+9)
```

Fixes: #85632.

Release note: None

Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Santamaura <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 3b40478 Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants