Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh connector configuration between attempts / activities #20912

Closed
pedroslopez opened this issue Dec 28, 2022 · 8 comments · Fixed by airbytehq/airbyte-platform#91
Closed
Assignees

Comments

@pedroslopez
Copy link
Contributor

pedroslopez commented Dec 28, 2022

Tell us about the problem you're trying to solve

#19811 introduces the ability for connectors to update configuration throughout the sync in order to support single-use refresh tokens. However, configuration is currently loaded at the start of the sync and passed around between attempts via temporal. This means that even though the database has updated configuration, subsequent attempts will not have that data.

I also believe the same problem exists between CHECK / SYNC, if a check occurs before the sync that emits a control message, the sync that runs afterwards will not have the updated configuration.

This is particularly bad because:

  • If there's a failure
  • the next attempt will start with a check and get new tokens
  • but those new tokens won't be available to the SYNC part of the attempt, and we keep failing/looping

Describe the solution you’d like

Subsequent sync attempts should use the latest configuration.

  • don't pass configs around via temporal
  • don't store job config statically in the jobs table
  • the orchestrator really does load the config each attempt/job from the database. This may be a new internal API as we need to hydrate the actor info with secrets.
@evantahler
Copy link
Contributor

evantahler commented Jan 2, 2023

Adding the Platform Workflow team, as this also falls into their wheelhouse. This problem also exists today if a user updates their config mid-attempt.

This story is blocked until we can talk to the platform team to learn the best way to do this. This also likely means that we cannot certify any connector which uses short-lived credentials to GA.

@evantahler
Copy link
Contributor

evantahler commented Jan 4, 2023

I spoke with @davinchia, @colesnodgrass, and @jdpgrailsdev about this issue. We all agree that there's a platform-level problem we need to solve around the mutability of inputs between temporal activities & job attempts. It is likely that as the platform transitions away from "sync-first" to "stream-first" behavior, this will become less of an issue... but that doesn't help us solve this problem in the near term. Today, the 'sync' is the atomic operation (inclusive of all subsequent activities), rather than the 'attempt', which is why inputs changing between attempts is hard for the platform to handle.

I think there are 2 potential paths forward:

  1. Re-hydrate config (and state) at the beginning of the replication activity
  2. Throw a new "restart-the-sync-now" exception

Re-hydrate config (and state) at the beginning of the replication activity

This pathway involves /always/ refreshing the contents of the config and state objects, as an atomic operation, as the first step of the replication activity. If we can get this working, we'll be sure that the sync is using the latest inputs (STATE and CONFIG) from the database.

We need to investigate if Temporal will get mad at us if we are changing the values of objects it expects to be controlling in memory. One way around this might be the introduction of a new small sub-activity(?) to refresh these inputs, which run before each "normal" phase of the replication workflow (check, replicate, normalize)

Throw a new "restart-the-sync-now" exception

@davinchia proposed an interesting idea which might be the stop-gap we need today. In the case in which the replication activity has received at least one AirbyteControlMessage, if there's a failure in the replication job, we throw a new custom exception that:

  • doesn't retry any remaining attempts
  • immediately re-schedules the replication for 1 second from now
  • cancel this current sync

This specially retried sync will re-load the config (and state) that it needs from the database and be on it's way. There's a possible circle in which the CHECK the next sync does will change the config again... but that seems unlikely because of the short time period between syncs.

This will look strange in the UI


The path forward is to time-box an investigation into these 2 options and see which one will be faster.

@cgardens
Copy link
Contributor

cgardens commented Jan 4, 2023

If we emit a reasonable error message in option 2, it won't be that weird for the user. "Failure Origin: Facebook has reset the oauth token, which requires the snync to restart from where it left off. It may need to retry up to 3 times."

Doing option 1 on its own feels a little spooky to me, because it breaks one of the core invariants of the current system "every attempt for a job runs on the same inputs". If we want to move away from this paradigm entirely that's no problem, but just poking a hole in it for just this use case will be hard to reason about. I guess unless we entirely change the paradigm to "every attempt always fetches the new config". That'll make the job / attempt superfluous but at least is easy to understand.

@pedroslopez pedroslopez self-assigned this Jan 7, 2023
@evantahler
Copy link
Contributor

Grooming:

  • Initial investigation suggests that both paths are roughly the same in difficulty. Time to write a tech spec to compare both ideas. @pedroslopez is leaning toward the "update the configs" approach (not the throw an error approach). Goal is to review with platform team.

@evantahler
Copy link
Contributor

@cgardens

Doing option 1 on its own feels a little spooky to me, because it breaks one of the core invariants of the current system "every attempt for a job runs on the same inputs". If we want to move away from this paradigm entirely that's no problem, but just poking a hole in it for just this use case will be hard to reason about. I guess unless we entirely change the paradigm to "every attempt always fetches the new config". That'll make the job / attempt superfluous but at least is easy to understand.

I also agree that we should choose option 1, because "every attempt always fetches the new config" is what we want, and we also want "every attempt always fetches the latest state". Right now, I think we have a bug in which Attempt 2 will re-use the original state before Attempt 1. This means that Checkpointing doesn't work. Solution 1 will solve this!

@evantahler
Copy link
Contributor

From @cgardens:

We do have a test for platform checkpointing (

void testCheckpointing() throws Exception {
final SourceDefinitionRead sourceDefinition = testHarness.createE2eSourceDefinition(workspaceId);
final DestinationDefinitionRead destinationDefinition = testHarness.createE2eDestinationDefinition(workspaceId);
) but it only tests that checkpointing works after a canceled attempt, not a failed attempt. The bug identified above is real, and will be fixed by proposal 1 (Re-hydrate config (and state) at the beginning of the replication activity )

@pedroslopez
Copy link
Contributor Author

@marcelopio
Copy link
Contributor

marcelopio commented Jan 13, 2023

The re-hydrate solution will also solve #17774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment