Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove start of run flag from ingest job tracker #4256

Closed
patchwork01 opened this issue Feb 13, 2025 · 0 comments · Fixed by #4284
Closed

Remove start of run flag from ingest job tracker #4256

patchwork01 opened this issue Feb 13, 2025 · 0 comments · Fixed by #4284
Assignees
Milestone

Comments

@patchwork01
Copy link
Collaborator

patchwork01 commented Feb 13, 2025

Background

There's a difference between how the ingest job tracker handles bulk import and standard ingest.

For bulk import, a run of a job starts when it's validated in the bulk import starter lambda. There's then a separate started status update when the Spark driver picks up the job. These are correlated into a single run of the job with a job run ID that's generated in the starter lambda and passed into the Spark driver. The validation and job run ID generation happens in BulkImportExecutor, and the further status updates are done in BulkImportJobDriver.

For standard ingest, a run of a job starts when an ingest task receives the job from the queue and starts running it. The updates were previously correlated by just the ID of the task, but there's now a job run ID, as there is in bulk import. The status updates and job run ID generation happen in IngestTask.

When the updates were correlated by task ID for standard ingest, we were detecting the start of a new run in JobRunsBuilder by checking for whether the status update should start a new run or not. Because the started status update is only the start of a run for standard ingest, there's a flag on that update for whether it starts a run or not. This is no longer used outside of tests, since everything uses the job run ID instead.

Description

We'd like to make the job run ID mandatory in updates to the ingest job tracker, and remove the flag that sets whether an ingest started status update should start a run or not.

This should avoid confusion in tests when no job runs are detected because the flag was not set. We had some trouble with this when we adjusted the logic for how the tracker updates are applied in #4253.

Analysis

There are a number of tests that still correlate the status updates based on the task ID, and don't set a job run ID. We'll need to adjust those tests to set a job run ID.

We can then make the job run ID mandatory in the ingest job events that have it by adding a requireNonNull call in the constructor (e.g. in IngestJobStartedEvent).

We can then remove the code to correlate on task ID in JobRunsBuilder.getBuilderIfCorrelatable, and remove the startOfRun field from IngestJobStartedEvent and IngestJobStartedStatus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant