When upserting new record, index the record before updating the upsert metadata #7860

Jackie-Jiang · 2021-12-03T00:53:26Z

We should index the record before updating the upsert metadata (updating the validDocIds) so that when the new record is validated, it immediately becomes queryable. This can address the off-by-one issue described in #7849

NOTE: it is hard to add a test for such race conditions, so only added the comments to describe the behavior

…t metadata

codecov-commenter · 2021-12-03T02:48:00Z

Codecov Report

Merging #7860 (804189d) into master (47e49ec) will increase coverage by 0.08%.
The diff coverage is 27.27%.

@@             Coverage Diff              @@
##             master    #7860      +/-   ##
============================================
+ Coverage     71.63%   71.71%   +0.08%     
+ Complexity     4088     4081       -7     
============================================
  Files          1580     1581       +1     
  Lines         81100    81358     +258     
  Branches      12068    12128      +60     
============================================
+ Hits          58093    58348     +255     
+ Misses        19092    19071      -21     
- Partials       3915     3939      +24

Flag	Coverage Δ
integration1	`29.38% <3.03%> (-0.01%)`	⬇️
integration2	`27.87% <3.03%> (-0.02%)`	⬇️
unittests1	`68.80% <29.62%> (+0.04%)`	⬆️
unittests2	`14.50% <3.03%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...t/local/upsert/PartitionUpsertMetadataManager.java	`76.23% <9.52%> (-3.98%)`	⬇️
...not/controller/validation/StorageQuotaChecker.java	`71.60% <16.66%> (-4.72%)`	⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java	`59.08% <100.00%> (+0.24%)`	⬆️
...a/manager/realtime/RealtimeSegmentDataManager.java	`50.00% <0.00%> (-25.00%)`	⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java	`80.00% <0.00%> (-20.00%)`	⬇️
.../common/request/context/predicate/EqPredicate.java	`66.66% <0.00%> (-6.67%)`	⬇️
...verter/stats/MutableNoDictionaryColStatistics.java	`45.00% <0.00%> (-5.00%)`	⬇️
...lugin/stream/kafka20/KafkaStreamLevelConsumer.java	`88.70% <0.00%> (-4.84%)`	⬇️
.../loader/defaultcolumn/DefaultColumnStatistics.java	`76.00% <0.00%> (-4.00%)`	⬇️
...t/segment/spi/creator/ColumnIndexCreationInfo.java	`96.29% <0.00%> (-3.71%)`	⬇️
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1baad80...804189d. Read the comment docs.

yupeng9 · 2021-12-03T16:43:06Z

...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java

      canTakeMore = _numDocsIndexed++ < _capacity;
+      _partitionUpsertMetadataManager.addRecord(this, recordInfo);


I think the consistency issue is always there, regardless of the order. Either we see duplicates, or we see missing records.

If we want to truly solve this problem, then we have to introduce a lock mechanism on the segment update/read per PK, though it would lead to performance implications.

The inconsistency across segments is hard to solve without locking, but this PR can solve the inconsistency within the segment.
More importantly, with the current code the new record is always missing before it is indexed, instead of immediately queryable when the upsert metadata is updated, so I would count this as a bug which need to be fixed

Okay, consistency within the segment is fair. Yes, I agree with the issue that we switched the record location early before it's added to index

yupeng9 · 2021-12-03T23:16:16Z

...ocal/src/main/java/org/apache/pinot/segment/local/upsert/PartitionUpsertMetadataManager.java

@@ -193,12 +175,6 @@ public GenericRow updateRecord(IndexSegment segment, RecordInfo recordInfo, Gene
            if (recordInfo._comparisonValue.compareTo(currentRecordLocation.getComparisonValue()) >= 0) {
              IndexSegment currentSegment = currentRecordLocation.getSegment();
              int currentDocId = currentRecordLocation.getDocId();
-              if (_partialUpsertHandler != null) {


I think we shall keep the partial upsert merge logic, as the groovy-based partial upsert logic can trigger some default logic even for newly ingested record

The partial upsert merge logic is separated as a separate method updateRecord(). This PR will keep the current functionality of partial upsert, but fix the inconsistency issue mentioned above

I see. So you extracted the update logic and perform it before adding the record.

…t metadata (apache#7860)

Jackie-Jiang requested a review from yupeng9 December 3, 2021 00:53

When upserting new record, index the record before updating the upser…

804189d

…t metadata

Jackie-Jiang force-pushed the upsert_update branch from 4d24b09 to 804189d Compare December 3, 2021 02:19

mapshen mentioned this pull request Dec 3, 2021

Inconsistent Row Counts from Upsert Tables #7849

Closed

yupeng9 reviewed Dec 3, 2021

View reviewed changes

yupeng9 approved these changes Dec 6, 2021

View reviewed changes

Jackie-Jiang merged commit 22b8969 into apache:master Dec 6, 2021

Jackie-Jiang deleted the upsert_update branch December 6, 2021 23:06

kriti-sc pushed a commit to kriti-sc/incubator-pinot that referenced this pull request Dec 12, 2021

When upserting new record, index the record before updating the upser…

2f03e96

…t metadata (apache#7860)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When upserting new record, index the record before updating the upsert metadata #7860

When upserting new record, index the record before updating the upsert metadata #7860

Jackie-Jiang commented Dec 3, 2021

codecov-commenter commented Dec 3, 2021 •

edited

Loading

yupeng9 Dec 3, 2021

Jackie-Jiang Dec 3, 2021

yupeng9 Dec 3, 2021

yupeng9 Dec 3, 2021

Jackie-Jiang Dec 6, 2021

yupeng9 Dec 6, 2021

		canTakeMore = _numDocsIndexed++ < _capacity;
		_partitionUpsertMetadataManager.addRecord(this, recordInfo);

When upserting new record, index the record before updating the upsert metadata #7860

When upserting new record, index the record before updating the upsert metadata #7860

Conversation

Jackie-Jiang commented Dec 3, 2021

codecov-commenter commented Dec 3, 2021 • edited Loading

Codecov Report

yupeng9 Dec 3, 2021

Choose a reason for hiding this comment

Jackie-Jiang Dec 3, 2021

Choose a reason for hiding this comment

yupeng9 Dec 3, 2021

Choose a reason for hiding this comment

yupeng9 Dec 3, 2021

Choose a reason for hiding this comment

Jackie-Jiang Dec 6, 2021

Choose a reason for hiding this comment

yupeng9 Dec 6, 2021

Choose a reason for hiding this comment

codecov-commenter commented Dec 3, 2021 •

edited

Loading