extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

postamar · 2024-08-20T15:30:50Z

Replaces #44398

Fixes airbytehq/airbyte-internal-issues#9093

This PR makes the Bulk CDK's JDBC toolkit useful for JDBC source connectors which aren't Oracle.

The current JDBC toolkit's greatest shortcoming (and blocker for porting source-mysql) is how tightly it's coupled to the stream state value model defined by

data class DefaultJdbcStreamStateValue(
    @JsonProperty("primary_key") val primaryKey: Map<String, JsonNode> = mapOf(),
    @JsonProperty("cursors") val cursors: Map<String, JsonNode> = mapOf(),
)

This model is going to make implementing CTID or XMIN states really annoying at best! Also forget about re-using the existing state values serialized by the existing source-mysql connector...

Therefore, it's necessary to make the connector fully own the state value model and translating stream state values into SQL queries. For this purpose this PR introduces the JdbcPartition and JdbcPartitionFactory interfaces:

A JdbcPartition object represents a specific subset of data within a table and provides a query to access it as well as the state value to use as a checkpoint for having emitted said data.
A JdbcPartitionFactory object will map the input state value into a JdbcPartition object representing the remaining data to read in the table; the factory can optionally subdivide that into smaller partitions.
This PR also provides default implementations of these two interfaces which makes use of DefaultJdbcStreamStateValue which can be used by naive JDBC source connectors.

A pleasant consequence of this design is that the PartitionsCreatorFactory, PartitionsCreator and PartitionReader implementations for JDBC connectors are much more database-agnostic now:

JdbcPartitionReader is the PartitionReader implementation for JDBC source connectors and it focuses on running an arbitrary query and packaging its results; it comes in a resumable- and in a non-resumable flavour.
JdbcPartitionsCreator is the PartitionsCreator implementation for JDBC source connectors and it focuses on computing cursor column upper bounds and fetchSize values, it comes in a sequential- and in a concurrent (i.e. splitting) flavour.
JdbcPartitionsCreatorFactory is the PartitionsCreatorFactory implementation for JDBC source connectors and merely glues JdbcPartitionFactory to the read operation machinery.

What this means for porting legacy JDBC source connectors like source-mysql to the bulk CDK is this:

The existing stream state value object definition from source-mysql/src/resources/internal_models.yaml needs to be kept if we are to maintain compatibility (and we really should).
We need to implement class MySqlPartitionFactory : JdbcPartitionFactory<MySqlSharedState,MySqlStreamState,MySqlPartition> to deserialize the above state values into MySqlPartition objects
A MySqlPartition subclass may or may not be thin wrapper around a DefaultJdbcPartition subclass, with mysql-specific fairy-dust sprinkled around it, especially to serialize internal_models.yaml objects.
MySqlSharedState and MySqlStreamState may be even more bare-bones, who knows.

Review guide

I considered breaking down this PR into more commits but the changes mainly consists of deleting and adding whole files so I'm not sure how useful that would be. Here's a suggested file reading order:

JdbcPartitionsCreatorFactory.kt which has the PartitionsCreatorFactory implementations for JDBC source connectors.
considering that (1) heavily depends on JdbcPartitionFactory, I would look at JdbcPartitionFactory.kt and JdbcPartition.kt next, which should give an idea of what a JdbcPartition is.
Either supplement that with Default*.kt files for the implementations, or with JdbcPartitionsCreator.kt and JdbcPartitionReader.kt which have the PartitionsCreator and PartitionReader implementations for source connectors.
Look at the tests.

vercel · 2024-08-20T15:30:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Aug 20, 2024 8:57pm

postamar · 2024-08-20T15:31:00Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @postamar and the rest of your teammates on Graphite

postamar · 2024-08-21T00:39:39Z

Closing because I'm merging the top of the stack #44482 into master using slash-approve-and-merge.

This was referenced Aug 20, 2024

extract-jdbc: rename PartitionsCreator and PartitionReader impls #44457

Closed

bulk-cdk: refactor tests #44459

Closed

octavia-squidington-iii added the CDK Connector Development Kit label Aug 20, 2024

postamar marked this pull request as ready for review August 20, 2024 15:40

postamar requested a review from a team as a August 20, 2024 15:40

postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from 456b7b0 to e82f2fe Compare August 20, 2024 15:44

postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from b377b33 to 6a4767b Compare August 20, 2024 15:44

postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from e82f2fe to f3f5fb0 Compare August 20, 2024 18:48

postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from 6a4767b to 76136ec Compare August 20, 2024 18:48

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory

a37b4b3

postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from f3f5fb0 to 4152603 Compare August 20, 2024 20:56

postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from 76136ec to a37b4b3 Compare August 20, 2024 20:56

postamar mentioned this pull request Aug 21, 2024

extract-jdbc: refactor constants #44482

Merged

postamar closed this Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

postamar commented Aug 20, 2024 •

edited

Loading

vercel bot commented Aug 20, 2024 •

edited

Loading

postamar commented Aug 20, 2024 •

edited

Loading

postamar commented Aug 21, 2024

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

Conversation

postamar commented Aug 20, 2024 • edited Loading

Review guide

vercel bot commented Aug 20, 2024 • edited Loading

postamar commented Aug 20, 2024 • edited Loading

postamar commented Aug 21, 2024

postamar commented Aug 20, 2024 •

edited

Loading

vercel bot commented Aug 20, 2024 •

edited

Loading

postamar commented Aug 20, 2024 •

edited

Loading