-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrations: Support AWS DMS as a source #84505
Comments
Investigation from the logsFull log here, involves creating and dropping replication slots multiple times. (anything calling SELECT * was me, anything using From logging PG statements, we have to support the following queries: for initial load - looks like we're missing basically WITH HOLD:
for cdc, PG uses replication slots (examples):
we would then need to implement this stream: https://www.postgresql.org/docs/current/protocol-replication.html DMS uses the replication protocol mentioned above to listen for updates (instead of
which is super ew. can't even debug it using
|
Awesome start! After wrapping my head around what you've done, I'm hoping to get the first cut of INSERT/DELETE working today. |
Some discussion from an internal slack thread regarding the replication slot stuff:
|
The replication now gets to START_REPLICATION on DMS Instructions:
after you set it up, whenever restarting cockroachdb for everything to work:
Lot more hacks here with imitating certain triggers / functions exist. If you want a faster iteration cycle, you can use https://github.com/otan-cockroach/repltest (follow the readme) to inspect the replication stream. |
It works now when I changed the table to include the schema name on the replication log. Branch now up to date. Modified issue with write up. |
Added a few changes to avoid some of the hard-coding, handle multiple tables, and fix the types. Branch has been updated. |
94110: roachprod: include storage workload metadata on snapshot r=jbowens a=coolcom200 Currently when a snapshot is taken of a volume that has been used for storage workload collection, the snapshot only contains the user provided information--name of the snapshot and description. Which could lead to data not being included about which cluster this ran on, machine type, crdb version, etc. As a result, we encode this metadata in the labels / tags when we create a snapshot allowing the user to provide both a name and a description while also capturing metadata that can be used for searching and further reference. There are some limitations with the maximum length of the labels (aws key: 128 chars value: 256 chars; gcp: both 63 chars) and which characters are allowed to be used (gcp: lowercase, digits, _, -; aws: letters, digits, spaces, ., :, +, =, `@,` _, /, -) Alternatively, the metadata could be encoded into the description field which would allow for more data to be saved at the cost of it being harder to search / filter. Fixes: #94075 Release note: None 94123: sql: implement the `log_timezone` session variable r=rafiss a=otan Informs #84505 Release note (sql change): Add the `log_timezone` session variable, which is read only and always UTC. 94154: cloud: set orchestration version updated to 22.2.1 r=absterr08 a=absterr08 links Epic: https://cockroachlabs.atlassian.net/browse/REL-228 Release note: none 94178: descs: remove GetAllTableDescriptorsInDatabase r=postamar a=postamar Recent changes in #93543 had modified the contract of this method (it no longer returns dropped tables) and made it unsuitable for its main use case, the SQLTranslator. This commit fixes this regression by removing this deprecated method entirely and using correct alternatives instead. Fixes #93614. Release note: None Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Abby Hersh <[email protected]> Co-authored-by: Marius Posta <[email protected]>
93757: trigram: support multi-byte string trigrams; perf improvements r=jordanlewis a=jordanlewis Fixes #93744 Related to #93830 - Add multi-byte character support - Improve performance ``` name old time/op new time/op delta Similarity-32 1.72µs ± 0% 0.60µs ± 3% -64.98% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Similarity-32 1.32kB ± 0% 0.37kB ± 0% -72.10% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Similarity-32 15.0 ± 0% 6.0 ± 0% -60.00% (p=0.000 n=10+10) ``` Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected. 94122: sql: implement the pg_timezone_names table r=rafiss a=otan Informs #84505 Release note (sql change): Implement the `pg_timezone_names` pg_catalog table, which lists all supported timezones. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
106242: pg_class: populate pg_class.relreplident r=rafiss a=otan Release note (sql change): pg_class's relreplident field was previously unpopulated. It is now populated with `d` for all tables (as each table has a primary key) and n otherwise. Informs: #84505 106546: flowinfra: clean up flow stats propagation in row-based flows r=yuzefovich a=yuzefovich Previously, we would attach `FlowStats` (like max memory usage) to the "stream component" stats object. I don't really understand why that was the case, probably it was due to misunderstanding how tracing works (in particular, the TODOs that are now removed mentioned "flow level span", but we don't need to attach the metadata to a particular tracing span). This simplifies the code a bit but also simplifies the work on adding region information to `ComponentID` object. Epic: None Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Hi guys |
Hi @cucxabong. We currently don't have an update as to when this feature will be completed. We do have a working prototype, but it'll take some effort to get it over the line. Is there a particular reason why you're interested in this feature? To where would you be hoping to migrate the data? |
#34766 provided support for CRDB to impersonate PG and act as a target for migrations. There is still work remaining if we ever want to support DMS with CRDB as a source. A working prototype is available in #93404.
Initial Load
To get the initial load to work, we need the following:
log_timezone
session variable. I believe this should just be UTC. sql: implement thelog_timezone
session variable #94123pg_timezone_names
pg_catalog table sql: implement the pg_timezone_names table #94122DECLARE CURSOR ... WITH HOLD
.Replication
replication slot
protocol, including the pg_catalog tables and associated builtins to create a replication slot. In the prototype, we used a global buffer which is populated by CDC using a newly addedreplication://
source URI to do so using the "normal connection protocol", which is wrong as the replication slot protocol has its own parser.test_decoding
plugin, but others may work too with DMS. Note this means settingPluginName
as an additional parameter on the source endpoint in DMS.relreplident
inpg_catalog.pg_class
.Requisite plpgsql and event trigger
Jira issue: CRDB-17695
Epic CC-8911
The text was updated successfully, but these errors were encountered: