Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When upserting new record, index the record before updating the upsert metadata #7860

Merged
merged 1 commit into from
Dec 6, 2021

Conversation

Jackie-Jiang
Copy link
Contributor

We should index the record before updating the upsert metadata (updating the validDocIds) so that when the new record is validated, it immediately becomes queryable. This can address the off-by-one issue described in #7849

NOTE: it is hard to add a test for such race conditions, so only added the comments to describe the behavior

@Jackie-Jiang Jackie-Jiang requested a review from yupeng9 December 3, 2021 00:53
@codecov-commenter
Copy link

codecov-commenter commented Dec 3, 2021

Codecov Report

Merging #7860 (804189d) into master (47e49ec) will increase coverage by 0.08%.
The diff coverage is 27.27%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7860      +/-   ##
============================================
+ Coverage     71.63%   71.71%   +0.08%     
+ Complexity     4088     4081       -7     
============================================
  Files          1580     1581       +1     
  Lines         81100    81358     +258     
  Branches      12068    12128      +60     
============================================
+ Hits          58093    58348     +255     
+ Misses        19092    19071      -21     
- Partials       3915     3939      +24     
Flag Coverage Δ
integration1 29.38% <3.03%> (-0.01%) ⬇️
integration2 27.87% <3.03%> (-0.02%) ⬇️
unittests1 68.80% <29.62%> (+0.04%) ⬆️
unittests2 14.50% <3.03%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...t/local/upsert/PartitionUpsertMetadataManager.java 76.23% <9.52%> (-3.98%) ⬇️
...not/controller/validation/StorageQuotaChecker.java 71.60% <16.66%> (-4.72%) ⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java 59.08% <100.00%> (+0.24%) ⬆️
...a/manager/realtime/RealtimeSegmentDataManager.java 50.00% <0.00%> (-25.00%) ⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java 80.00% <0.00%> (-20.00%) ⬇️
.../common/request/context/predicate/EqPredicate.java 66.66% <0.00%> (-6.67%) ⬇️
...verter/stats/MutableNoDictionaryColStatistics.java 45.00% <0.00%> (-5.00%) ⬇️
...lugin/stream/kafka20/KafkaStreamLevelConsumer.java 88.70% <0.00%> (-4.84%) ⬇️
.../loader/defaultcolumn/DefaultColumnStatistics.java 76.00% <0.00%> (-4.00%) ⬇️
...t/segment/spi/creator/ColumnIndexCreationInfo.java 96.29% <0.00%> (-3.71%) ⬇️
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1baad80...804189d. Read the comment docs.

canTakeMore = _numDocsIndexed++ < _capacity;
_partitionUpsertMetadataManager.addRecord(this, recordInfo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the consistency issue is always there, regardless of the order. Either we see duplicates, or we see missing records.

If we want to truly solve this problem, then we have to introduce a lock mechanism on the segment update/read per PK, though it would lead to performance implications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency across segments is hard to solve without locking, but this PR can solve the inconsistency within the segment.
More importantly, with the current code the new record is always missing before it is indexed, instead of immediately queryable when the upsert metadata is updated, so I would count this as a bug which need to be fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, consistency within the segment is fair. Yes, I agree with the issue that we switched the record location early before it's added to index

@@ -193,12 +175,6 @@ public GenericRow updateRecord(IndexSegment segment, RecordInfo recordInfo, Gene
if (recordInfo._comparisonValue.compareTo(currentRecordLocation.getComparisonValue()) >= 0) {
IndexSegment currentSegment = currentRecordLocation.getSegment();
int currentDocId = currentRecordLocation.getDocId();
if (_partialUpsertHandler != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shall keep the partial upsert merge logic, as the groovy-based partial upsert logic can trigger some default logic even for newly ingested record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partial upsert merge logic is separated as a separate method updateRecord(). This PR will keep the current functionality of partial upsert, but fix the inconsistency issue mentioned above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So you extracted the update logic and perform it before adding the record.

@Jackie-Jiang Jackie-Jiang merged commit 22b8969 into apache:master Dec 6, 2021
@Jackie-Jiang Jackie-Jiang deleted the upsert_update branch December 6, 2021 23:06
kriti-sc pushed a commit to kriti-sc/incubator-pinot that referenced this pull request Dec 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants