migrations: Support AWS DMS as a source #84505

ajstorm · 2022-07-15T20:30:48Z

#34766 provided support for CRDB to impersonate PG and act as a target for migrations. There is still work remaining if we ever want to support DMS with CRDB as a source. A working prototype is available in #93404.

Initial Load

To get the initial load to work, we need the following:

Support the log_timezone session variable. I believe this should just be UTC. sql: implement the log_timezone session variable #94123
Correctly populate the pg_timezone_names pg_catalog table sql: implement the pg_timezone_names table #94122
Implement DECLARE CURSOR ... WITH HOLD.
- Note I think we can get away with a cheat here - allow WITH HOLD to be specific, but error on COMMIT/ROLLBACK if the cursor is not closed. From my inspection of the DMS logs, all cursors are used within the transaction they are declared in! sql: allow cursor WITH HOLD inside of txn as long as it gets closed #94127

Replication

Requisite plpgsql and event trigger

CREATE OR REPLACE FUNCTION objects_schema.awsdms_intercept_ddl()
  RETURNS event_trigger
LANGUAGE plpgsql
SECURITY DEFINER
  AS $$
  declare _qry text;
BEGIN
  if (tg_tag='CREATE TABLE' or tg_tag='ALTER TABLE' or tg_tag='DROP TABLE') then
         SELECT current_query() into _qry;
         insert into objects_schema.awsdms_ddl_audit
         values
         (
         default,current_timestamp,current_user,cast(TXID_CURRENT()as varchar(16)),tg_tag,0,'',current_schema,_qry
         );
         delete from objects_schema.awsdms_ddl_audit;
end if;
END;
$$$

CREATE EVENT TRIGGER awsdms_intercept_ddl ON ddl_command_end 
EXECUTE PROCEDURE objects_schema.awsdms_intercept_ddl();

Jira issue: CRDB-17695

Epic CC-8911

The text was updated successfully, but these errors were encountered:

otan · 2022-12-08T19:18:37Z

Investigation from the logs

Full log here, involves creating and dropping replication slots multiple times. (anything calling SELECT * was me, anything using DECLARE CURSOR was DMS)

From logging PG statements, we have to support the following queries:

for initial load - looks like we're missing basically WITH HOLD:

 BEGIN;declare "SQL_CUR0x146eb80ff1a0" cursor with hold for SELECT "a","b"  FROM "public"."good_table";fetch 10000 in "SQL_CUR0x146eb80ff1a0"
 close "SQL_CUR0x146eb80ff1a0";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for select CAST (version() as varchar(512));fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for select cast(setting as integer) from pg_settings where name = 'server_version_num';fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit
 BEGIN;declare "SQL_CUR0x146ebc0e4090" cursor with hold for SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();fetch 10000 in "SQL_CUR0x146ebc0e4090"
 close "SQL_CUR0x146ebc0e4090";commit

for cdc, PG uses replication slots (examples):

 BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for SELECT pg_drop_replication_slot('kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc');fetch 10000 in "SQL_CUR0x146ed00bada0"
 BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for SELECT lsn FROM pg_create_logical_replication_slot('kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc', 'test_decoding');fetch 10000 in "SQL_CUR0x146ed00bada0"

BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for
        select cast(tn.is_dst as varchar(8))
        from pg_timezone_names tn, pg_settings s
         where tn.name = s.setting
         and   s.name = 'log_timezone'
        ;fetch 10000 in "SQL_CUR0x146ed00bada0"
BEGIN;declare "SQL_CUR0x146ed00bada0" cursor with hold for
        select restart_lsn from pg_replication_slots
          where slot_name='kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc'
          and database   ='replicationload'
          and plugin     ='test_decoding'
        ;fetch 10000 in "SQL_CUR0x146ed00bada0"

we would then need to implement this stream: https://www.postgresql.org/docs/current/protocol-replication.html

DMS uses the replication protocol mentioned above to listen for updates (instead of pg_logical_slot_get_changes which is much nicer and recommended by PG):

2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:LOG:  starting logical decoding for slot "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc"
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:DETAIL:  Streaming transactions committing after 1CE/A00005A0, reading WAL from 1CE/A0000568.
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:STATEMENT:  START_REPLICATION SLOT "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc" LOGICAL 000001CE/A0000568 ("include-timestamp" 'on')
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:LOG:  logical decoding found consistent point at 1CE/A0000568
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:DETAIL:  There are no running transactions.
2022-12-08 19:04:35 UTC:172.31.55.39(37356):postgres@replicationload:[672]:STATEMENT:  START_REPLICATION SLOT "kl3yh77fl5mhvcyz_00016402_6d5a24d1_0a92_4efc_80e7_9d0c0d409dbc" LOGICAL 000001CE/A0000568 ("include-timestamp" 'on')

which is super ew. can't even debug it using psql:

 psql "postgres://postgres:[email protected]:5432/replicationload?replication=database" -c 'START_REPLICATION SLOT "otan_test" LOGICAL 1CE/B8000810;'
unexpected PQresultStatus: 8

ajstorm · 2022-12-12T17:52:16Z

Awesome start! After wrapping my head around what you've done, I'm hoping to get the first cut of INSERT/DELETE working today.

otan · 2022-12-12T20:21:51Z

Some discussion from an internal slack thread regarding the replication slot stuff:

We should probably look to leverage sinkless cdc-sink to do achieve what we want, instead of mashing that global variable we so far have in the prototype. However, sinkless cdc-sink uses their own custom protocol instead of using the postgres protocol.
We have discussed imitating WAL already in cdc: support Debezium (via Postgres WAL) #68333. We don't think it's easy.
We are definitely missing support for database-level CDC. See changefeedccl: add targets for changefeeds beyond tables #73435.

otan · 2022-12-13T05:11:05Z

The replication now gets to START_REPLICATION on DMS ~~but seems to exit for some reason and I don't know why~~

Instructions:

Create a cluster, create a table test_table in a source database. Run SET CLUSTER SETTING kv.rangefeed.enabled = true. As the decoding function doesn't work, ensure all columns are ints.
On DMS, create a source endpoint. Set PluginName on the source endpoint for CockroachDB for test_decoding. Without this you may see a weird error about querying pglogical.node, which I'm not sure how to resolve.

Create a target endpoint, setup DMS as normal, only replication test_table.
Ensure all changefeed jobs are currently cancelled. You may have to stop and restart the server after cancelling them due to the global state we inherit
Command succeeded: "START_REPLICATION SLOT "o67mdfyzzca6l5zr_00000104_9cf6388a_6451_48f9_908d_4915f0f7bc1a" LOGICAL AAAAAAAA/AA000020 ("include-timestamp" 'on')" with status code: "" (postgres_test_decoding.c:167) is not normal. Not sure why. Probably because my LSNs are screwed up. For some reason cdc sink seems to play every change from the start, which is not normal, so I reckon it's cursor related. Cursors have to start with A. resolved
The feed will last up to 24 hours.

after you set it up, whenever restarting cockroachdb for everything to work:

drop all changefeed jobs
restart the database

Lot more hacks here with imitating certain triggers / functions exist.

If you want a faster iteration cycle, you can use https://github.com/otan-cockroach/repltest (follow the readme) to inspect the replication stream.

otan · 2022-12-13T05:31:43Z

It works now when I changed the table to include the schema name on the replication log. Branch now up to date. Modified issue with write up.

ajstorm · 2022-12-14T22:07:23Z

Added a few changes to avoid some of the hard-coding, handle multiple tables, and fix the types. Branch has been updated.

94110: roachprod: include storage workload metadata on snapshot r=jbowens a=coolcom200 Currently when a snapshot is taken of a volume that has been used for storage workload collection, the snapshot only contains the user provided information--name of the snapshot and description. Which could lead to data not being included about which cluster this ran on, machine type, crdb version, etc. As a result, we encode this metadata in the labels / tags when we create a snapshot allowing the user to provide both a name and a description while also capturing metadata that can be used for searching and further reference. There are some limitations with the maximum length of the labels (aws key: 128 chars value: 256 chars; gcp: both 63 chars) and which characters are allowed to be used (gcp: lowercase, digits, _, -; aws: letters, digits, spaces, ., :, +, =, `@,` _, /, -) Alternatively, the metadata could be encoded into the description field which would allow for more data to be saved at the cost of it being harder to search / filter. Fixes: #94075 Release note: None 94123: sql: implement the `log_timezone` session variable r=rafiss a=otan Informs #84505 Release note (sql change): Add the `log_timezone` session variable, which is read only and always UTC. 94154: cloud: set orchestration version updated to 22.2.1 r=absterr08 a=absterr08 links Epic: https://cockroachlabs.atlassian.net/browse/REL-228 Release note: none 94178: descs: remove GetAllTableDescriptorsInDatabase r=postamar a=postamar Recent changes in #93543 had modified the contract of this method (it no longer returns dropped tables) and made it unsuitable for its main use case, the SQLTranslator. This commit fixes this regression by removing this deprecated method entirely and using correct alternatives instead. Fixes #93614. Release note: None Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Abby Hersh <[email protected]> Co-authored-by: Marius Posta <[email protected]>

93757: trigram: support multi-byte string trigrams; perf improvements r=jordanlewis a=jordanlewis Fixes #93744 Related to #93830 - Add multi-byte character support - Improve performance ``` name old time/op new time/op delta Similarity-32 1.72µs ± 0% 0.60µs ± 3% -64.98% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Similarity-32 1.32kB ± 0% 0.37kB ± 0% -72.10% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Similarity-32 15.0 ± 0% 6.0 ± 0% -60.00% (p=0.000 n=10+10) ``` Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected. 94122: sql: implement the pg_timezone_names table r=rafiss a=otan Informs #84505 Release note (sql change): Implement the `pg_timezone_names` pg_catalog table, which lists all supported timezones. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

106242: pg_class: populate pg_class.relreplident r=rafiss a=otan Release note (sql change): pg_class's relreplident field was previously unpopulated. It is now populated with `d` for all tables (as each table has a primary key) and n otherwise. Informs: #84505 106546: flowinfra: clean up flow stats propagation in row-based flows r=yuzefovich a=yuzefovich Previously, we would attach `FlowStats` (like max memory usage) to the "stream component" stats object. I don't really understand why that was the case, probably it was due to misunderstanding how tracing works (in particular, the TODOs that are now removed mentioned "flow level span", but we don't need to attach the metadata to a particular tracing span). This simplifies the code a bit but also simplifies the work on adding region information to `ComponentID` object. Epic: None Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

cucxabong · 2023-10-11T08:56:50Z

Hi guys
Do we have any update on this feature?

ajstorm · 2023-10-11T19:53:16Z

Hi @cucxabong. We currently don't have an update as to when this feature will be completed. We do have a working prototype, but it'll take some effort to get it over the line.

Is there a particular reason why you're interested in this feature? To where would you be hoping to migrate the data?

ajstorm added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-tools-aws-dms Blocking support for AWS Database Migration Service A-migrations Migrating to CRDB from another database vendor labels Jul 15, 2022

otan mentioned this issue Dec 12, 2022

*: AWS DMS Source Database prototype #93404

Closed

exalate-issue-sync bot added the T-migrations label Dec 21, 2022

This was referenced Dec 22, 2022

sql: implement the pg_timezone_names table #94122

Merged

sql: implement the log_timezone session variable #94123

Merged

otan mentioned this issue Jun 19, 2023

pglogical: SQL "frontend" related changes meta issue #105130

Open

35 tasks

otan mentioned this issue Jul 6, 2023

pg_class: populate pg_class.relreplident #106242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations: Support AWS DMS as a source #84505

migrations: Support AWS DMS as a source #84505

ajstorm commented Jul 15, 2022 •

edited by otan

Loading

otan commented Dec 8, 2022 •

edited

Loading

ajstorm commented Dec 12, 2022

otan commented Dec 12, 2022 •

edited

Loading

otan commented Dec 13, 2022 •

edited

Loading

otan commented Dec 13, 2022 •

edited

Loading

ajstorm commented Dec 14, 2022

cucxabong commented Oct 11, 2023

ajstorm commented Oct 11, 2023

migrations: Support AWS DMS as a source #84505

migrations: Support AWS DMS as a source #84505

Comments

ajstorm commented Jul 15, 2022 • edited by otan Loading

Initial Load

Replication

Requisite plpgsql and event trigger

otan commented Dec 8, 2022 • edited Loading

Investigation from the logs

ajstorm commented Dec 12, 2022

otan commented Dec 12, 2022 • edited Loading

otan commented Dec 13, 2022 • edited Loading

otan commented Dec 13, 2022 • edited Loading

ajstorm commented Dec 14, 2022

cucxabong commented Oct 11, 2023

ajstorm commented Oct 11, 2023

ajstorm commented Jul 15, 2022 •

edited by otan

Loading

otan commented Dec 8, 2022 •

edited

Loading

otan commented Dec 12, 2022 •

edited

Loading

otan commented Dec 13, 2022 •

edited

Loading

otan commented Dec 13, 2022 •

edited

Loading