Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yugabyte/yugabyte-db#26107] Parellel streaming changes #172

Merged
merged 10 commits into from
Feb 25, 2025

Conversation

vaibhav-yb
Copy link
Collaborator

@vaibhav-yb vaibhav-yb commented Feb 17, 2025

This PR introduces the changes to stream changes in parallel using multiple tasks for a table given the user provides the hash_code ranges for it to stream. The following changes have been introduced in this PR:

  1. New configurations:
    a. streaming.mode: This values takes the input as default or parallel which is then used to decide whether or not parallel streaming mode is supposed to be used.
    b. slot.names: A list of comma separated values for all the slot names which should be used by each task.
    c. publication.names: A list of comma separated values for all the publication names which should be used by each task.
    d. slot.ranges: A list of semi-colon separated values for slot ranges in the format a,b;b,c;c,d.
  2. Validations in the class YBValidate have been introduced:
    a. To validate that the complete hash range is provided by the user and nothing is missing.
    b. To validate that the number of slot names is equal to the publication names as well as the number of slot ranges.
    c. To ensure that there's only one table provided in the table.include.list as parallel streaming will not work with multiple tables.
  3. Support for snapshot with streaming.mode parallel.
    a. This will require providing the hash part of the primary key columns to the configuration parameter primary.key.hash.columns.
  4. The PostgresPartition object will now also use the slot name to uniquely identify the source partition.

Usage example

If the connector configuration contains the following properties:

{
  ...
  "streaming.mode":"parallel",
  "slot.names":"rs1,rs1",
  "publication.names":"pb1,pb2",
  "slot.ranges":"0,32768;32768,65536"
  ...
}

then we will have 2 tasks created:

  1. task 0: slot=rs1 publication=pb1 hash_range=0,32768
  2. task 1: slot=rs2 publication=pb2 hash_range=32768,65536

Note:

It is currently the user's responsibility to provide full hash ranges and maintain the order given in the configs for slot.names, publication.names and slot.ranges as the values will be picked sequentially and divided into tasks. Thus, in order to ensure that the task with a slot gets the same hash_range every time, the user needs to be careful with the order provided.

This closes yugabyte/yugabyte-db#26107.

@suranjan
Copy link
Collaborator

d. slot.ranges: A list of semi-colon separated values for slot ranges in the format a,b;b,c;c,d.
Lets just call it ranges, in case of hashrange, it will be hash and in case of range sharding, it can be range column values

@vaibhav-yb vaibhav-yb changed the title [wip] parellel streaming changes [yugabyte/yugabyte-db#26107] parellel streaming changes Feb 19, 2025
@vaibhav-yb vaibhav-yb changed the title [yugabyte/yugabyte-db#26107] parellel streaming changes [yugabyte/yugabyte-db#26107] Parellel streaming changes Feb 25, 2025
@vaibhav-yb vaibhav-yb merged commit 13c3b13 into yugabyte:ybdb-debezium-2.5.2 Feb 25, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DBZ-PGYB] Implement parallel streaming support
2 participants