Refresh connector configuration between attempts / activities #20912

pedroslopez · 2022-12-28T22:28:05Z

Tell us about the problem you're trying to solve

#19811 introduces the ability for connectors to update configuration throughout the sync in order to support single-use refresh tokens. However, configuration is currently loaded at the start of the sync and passed around between attempts via temporal. This means that even though the database has updated configuration, subsequent attempts will not have that data.

I also believe the same problem exists between CHECK / SYNC, if a check occurs before the sync that emits a control message, the sync that runs afterwards will not have the updated configuration.

This is particularly bad because:

If there's a failure
the next attempt will start with a check and get new tokens
but those new tokens won't be available to the SYNC part of the attempt, and we keep failing/looping

Describe the solution you’d like

Subsequent sync attempts should use the latest configuration.

don't pass configs around via temporal
don't store job config statically in the jobs table
the orchestrator really does load the config each attempt/job from the database. This may be a new internal API as we need to hydrate the actor info with secrets.

evantahler · 2023-01-02T22:56:38Z

Adding the Platform Workflow team, as this also falls into their wheelhouse. This problem also exists today if a user updates their config mid-attempt.

This story is blocked until we can talk to the platform team to learn the best way to do this. This also likely means that we cannot certify any connector which uses short-lived credentials to GA.

evantahler · 2023-01-04T21:51:01Z

I spoke with @davinchia, @colesnodgrass, and @jdpgrailsdev about this issue. We all agree that there's a platform-level problem we need to solve around the mutability of inputs between temporal activities & job attempts. It is likely that as the platform transitions away from "sync-first" to "stream-first" behavior, this will become less of an issue... but that doesn't help us solve this problem in the near term. Today, the 'sync' is the atomic operation (inclusive of all subsequent activities), rather than the 'attempt', which is why inputs changing between attempts is hard for the platform to handle.

I think there are 2 potential paths forward:

Re-hydrate config (and state) at the beginning of the replication activity
Throw a new "restart-the-sync-now" exception

Re-hydrate config (and state) at the beginning of the replication activity

This pathway involves /always/ refreshing the contents of the config and state objects, as an atomic operation, as the first step of the replication activity. If we can get this working, we'll be sure that the sync is using the latest inputs (STATE and CONFIG) from the database.

We need to investigate if Temporal will get mad at us if we are changing the values of objects it expects to be controlling in memory. One way around this might be the introduction of a new small sub-activity(?) to refresh these inputs, which run before each "normal" phase of the replication workflow (check, replicate, normalize)

Throw a new "restart-the-sync-now" exception

@davinchia proposed an interesting idea which might be the stop-gap we need today. In the case in which the replication activity has received at least one AirbyteControlMessage, if there's a failure in the replication job, we throw a new custom exception that:

doesn't retry any remaining attempts
immediately re-schedules the replication for 1 second from now
cancel this current sync

This specially retried sync will re-load the config (and state) that it needs from the database and be on it's way. There's a possible circle in which the CHECK the next sync does will change the config again... but that seems unlikely because of the short time period between syncs.

This will look strange in the UI

The path forward is to time-box an investigation into these 2 options and see which one will be faster.

cgardens · 2023-01-04T23:18:47Z

If we emit a reasonable error message in option 2, it won't be that weird for the user. "Failure Origin: Facebook has reset the oauth token, which requires the snync to restart from where it left off. It may need to retry up to 3 times."

Doing option 1 on its own feels a little spooky to me, because it breaks one of the core invariants of the current system "every attempt for a job runs on the same inputs". If we want to move away from this paradigm entirely that's no problem, but just poking a hole in it for just this use case will be hard to reason about. I guess unless we entirely change the paradigm to "every attempt always fetches the new config". That'll make the job / attempt superfluous but at least is easy to understand.

evantahler · 2023-01-10T17:47:31Z

Grooming:

Initial investigation suggests that both paths are roughly the same in difficulty. Time to write a tech spec to compare both ideas. @pedroslopez is leaning toward the "update the configs" approach (not the throw an error approach). Goal is to review with platform team.

evantahler · 2023-01-10T18:49:28Z

@cgardens

Doing option 1 on its own feels a little spooky to me, because it breaks one of the core invariants of the current system "every attempt for a job runs on the same inputs". If we want to move away from this paradigm entirely that's no problem, but just poking a hole in it for just this use case will be hard to reason about. I guess unless we entirely change the paradigm to "every attempt always fetches the new config". That'll make the job / attempt superfluous but at least is easy to understand.

I also agree that we should choose option 1, because "every attempt always fetches the new config" is what we want, and we also want "every attempt always fetches the latest state". Right now, I think we have a bug in which Attempt 2 will re-use the original state before Attempt 1. This means that Checkpointing doesn't work. Solution 1 will solve this!

evantahler · 2023-01-10T21:53:12Z

From @cgardens:

We do have a test for platform checkpointing (

airbyte/airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/AdvancedAcceptanceTests.java

Lines 150 to 152 in 0b27aa5

    
           void testCheckpointing() throws Exception { 
        
             final SourceDefinitionRead sourceDefinition = testHarness.createE2eSourceDefinition(workspaceId); 
        
             final DestinationDefinitionRead destinationDefinition = testHarness.createE2eDestinationDefinition(workspaceId);

) but it only tests that checkpointing works after a canceled attempt, not a failed attempt. The bug identified above is real, and will be fixed by proposal 1 (Re-hydrate config (and state) at the beginning of the replication activity )

pedroslopez · 2023-01-11T06:56:17Z

Tech spec: https://docs.google.com/document/d/1WSZQg2bhNyQD4cgy-Owzejfm8mXhZSYmaqWHlAOqtAs/

marcelopio · 2023-01-13T15:03:50Z

The re-hydrate solution will also solve #17774

pedroslopez added type/bug Something isn't working from/connector-ops labels Dec 28, 2022

evantahler added the team/platform-move label Jan 2, 2023

github-actions bot added the platform-workflow/requires-grooming label Jan 2, 2023

pedroslopez self-assigned this Jan 7, 2023

pedroslopez mentioned this issue Jan 20, 2023

fix: refresh actor configuration and state between sync attempts #21629

Merged

pedroslopez mentioned this issue Feb 3, 2023

fix(platform): use updated configuration after check during syncs #21899

Closed

pedroslopez mentioned this issue Feb 21, 2023

fix(platform): use updated configuration after check during syncs airbytehq/airbyte-platform#91

Merged

octavia-approvington closed this as completed in airbytehq/airbyte-platform#91 Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh connector configuration between attempts / activities #20912

Refresh connector configuration between attempts / activities #20912

pedroslopez commented Dec 28, 2022 •

edited by evantahler

Loading

evantahler commented Jan 2, 2023 •

edited

Loading

evantahler commented Jan 4, 2023 •

edited

Loading

cgardens commented Jan 4, 2023

evantahler commented Jan 10, 2023

evantahler commented Jan 10, 2023

evantahler commented Jan 10, 2023

pedroslopez commented Jan 11, 2023

marcelopio commented Jan 13, 2023 •

edited

Loading

Refresh connector configuration between attempts / activities #20912

Refresh connector configuration between attempts / activities #20912

Comments

pedroslopez commented Dec 28, 2022 • edited by evantahler Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

evantahler commented Jan 2, 2023 • edited Loading

evantahler commented Jan 4, 2023 • edited Loading

Re-hydrate config (and state) at the beginning of the replication activity

Throw a new "restart-the-sync-now" exception

cgardens commented Jan 4, 2023

evantahler commented Jan 10, 2023

evantahler commented Jan 10, 2023

evantahler commented Jan 10, 2023

pedroslopez commented Jan 11, 2023

marcelopio commented Jan 13, 2023 • edited Loading

pedroslopez commented Dec 28, 2022 •

edited by evantahler

Loading

evantahler commented Jan 2, 2023 •

edited

Loading

evantahler commented Jan 4, 2023 •

edited

Loading

marcelopio commented Jan 13, 2023 •

edited

Loading