Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458

Conversation

postamar
Copy link
Contributor

@postamar postamar commented Aug 20, 2024

Replaces #44398

Fixes airbytehq/airbyte-internal-issues#9093

This PR makes the Bulk CDK's JDBC toolkit useful for JDBC source connectors which aren't Oracle.

The current JDBC toolkit's greatest shortcoming (and blocker for porting source-mysql) is how tightly it's coupled to the stream state value model defined by

data class DefaultJdbcStreamStateValue(
    @JsonProperty("primary_key") val primaryKey: Map<String, JsonNode> = mapOf(),
    @JsonProperty("cursors") val cursors: Map<String, JsonNode> = mapOf(),
)

This model is going to make implementing CTID or XMIN states really annoying at best! Also forget about re-using the existing state values serialized by the existing source-mysql connector...

Therefore, it's necessary to make the connector fully own the state value model and translating stream state values into SQL queries. For this purpose this PR introduces the JdbcPartition and JdbcPartitionFactory interfaces:

  • A JdbcPartition object represents a specific subset of data within a table and provides a query to access it as well as the state value to use as a checkpoint for having emitted said data.
  • A JdbcPartitionFactory object will map the input state value into a JdbcPartition object representing the remaining data to read in the table; the factory can optionally subdivide that into smaller partitions.
  • This PR also provides default implementations of these two interfaces which makes use of DefaultJdbcStreamStateValue which can be used by naive JDBC source connectors.

A pleasant consequence of this design is that the PartitionsCreatorFactory, PartitionsCreator and PartitionReader implementations for JDBC connectors are much more database-agnostic now:

  • JdbcPartitionReader is the PartitionReader implementation for JDBC source connectors and it focuses on running an arbitrary query and packaging its results; it comes in a resumable- and in a non-resumable flavour.
  • JdbcPartitionsCreator is the PartitionsCreator implementation for JDBC source connectors and it focuses on computing cursor column upper bounds and fetchSize values, it comes in a sequential- and in a concurrent (i.e. splitting) flavour.
  • JdbcPartitionsCreatorFactory is the PartitionsCreatorFactory implementation for JDBC source connectors and merely glues JdbcPartitionFactory to the read operation machinery.

What this means for porting legacy JDBC source connectors like source-mysql to the bulk CDK is this:

  • The existing stream state value object definition from source-mysql/src/resources/internal_models.yaml needs to be kept if we are to maintain compatibility (and we really should).
  • We need to implement class MySqlPartitionFactory : JdbcPartitionFactory<MySqlSharedState,MySqlStreamState,MySqlPartition> to deserialize the above state values into MySqlPartition objects
  • A MySqlPartition subclass may or may not be thin wrapper around a DefaultJdbcPartition subclass, with mysql-specific fairy-dust sprinkled around it, especially to serialize internal_models.yaml objects.
  • MySqlSharedState and MySqlStreamState may be even more bare-bones, who knows.

Review guide

I considered breaking down this PR into more commits but the changes mainly consists of deleting and adding whole files so I'm not sure how useful that would be. Here's a suggested file reading order:

  1. JdbcPartitionsCreatorFactory.kt which has the PartitionsCreatorFactory implementations for JDBC source connectors.
  2. considering that (1) heavily depends on JdbcPartitionFactory, I would look at JdbcPartitionFactory.kt and JdbcPartition.kt next, which should give an idea of what a JdbcPartition is.
  3. Either supplement that with Default*.kt files for the implementations, or with JdbcPartitionsCreator.kt and JdbcPartitionReader.kt which have the PartitionsCreator and PartitionReader implementations for source connectors.
  4. Look at the tests.

Copy link

vercel bot commented Aug 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Aug 20, 2024 8:57pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 20, 2024
@postamar postamar marked this pull request as ready for review August 20, 2024 15:40
@postamar postamar requested a review from a team as a August 20, 2024 15:40
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from 456b7b0 to e82f2fe Compare August 20, 2024 15:44
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from b377b33 to 6a4767b Compare August 20, 2024 15:44
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from e82f2fe to f3f5fb0 Compare August 20, 2024 18:48
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from 6a4767b to 76136ec Compare August 20, 2024 18:48
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_rename_partitionscreator_and_partitionreader_impls branch from f3f5fb0 to 4152603 Compare August 20, 2024 20:56
@postamar postamar force-pushed the postamar_gt_08-20-extract-jdbc_add_and_adopt_jdbcpartition_and_jdbcpartitionfactory branch from 76136ec to a37b4b3 Compare August 20, 2024 20:56
@postamar
Copy link
Contributor Author

Closing because I'm merging the top of the stack #44482 into master using slash-approve-and-merge.

@postamar postamar closed this Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants