Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Consistent Data Push on Ingestion Jobs for REFRESH use case (standalone only) #9268

Closed
yuanbenson opened this issue Aug 22, 2022 · 2 comments

Comments

@yuanbenson
Copy link
Contributor

Consistent data push protocol APIs are available via controller REST APIs such as startReplaceSegments, endReplaceSegments, and revertReplaceSegments. However, previously, ingestion jobs are not wired to use this feature.

Introduce a new boolean consistentDataPush in TableConfig->ingestionConfig->batchIngestionConfig that when enabled, supports batch ingestion in REFRESH mode to run in consistent data push mode.

Consistent push goal: supports atomic switching (on broker level) between data snapshots and eliminate the time period where the query is getting computed from inconsistent data mixed from existing and new data. Moreover, we aim to provide an easy way to rollback to the previous data in case of the bad data push.

See #7813 for more details.

Some tasks breakdown associated with this issue:

  1. Improve test coverage for pinot-batch-ingestion-standalone jobs to cover SegmentMetadataPushJobRunner,
    SegmentTarPushJobRunner and SegmentUriPushJobRunner.
  2. Refactor the common logics out of all pushJobRunner(s) into a new abstract class BaseSegmentPushJobRunner.
  3. Main change on enabling consistent data push on ingestion jobs.
@Jackie-Jiang
Copy link
Contributor

One challenge to solve here is the segment name conflict. Currently the reason why batch replace cannot be enabled for REFRESH mode is because the segment will have the same name, thus directly replace the existing segment. cc @snleee

@yuanbenson
Copy link
Contributor Author

Support for consistent push in ingestion jobs on standalone executor has been merged with #9295.
Opening a new issue for Hadoop and Spark support.

@yuanbenson yuanbenson changed the title Enable Consistent Data Push on Ingestion Jobs for REFRESH use case Enable Consistent Data Push on Ingestion Jobs for REFRESH use case (standalone only) Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants