-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When upserting new record, index the record before updating the upsert metadata #7860
Conversation
4d24b09
to
804189d
Compare
Codecov Report
@@ Coverage Diff @@
## master #7860 +/- ##
============================================
+ Coverage 71.63% 71.71% +0.08%
+ Complexity 4088 4081 -7
============================================
Files 1580 1581 +1
Lines 81100 81358 +258
Branches 12068 12128 +60
============================================
+ Hits 58093 58348 +255
+ Misses 19092 19071 -21
- Partials 3915 3939 +24
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
canTakeMore = _numDocsIndexed++ < _capacity; | ||
_partitionUpsertMetadataManager.addRecord(this, recordInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the consistency issue is always there, regardless of the order. Either we see duplicates, or we see missing records.
If we want to truly solve this problem, then we have to introduce a lock mechanism on the segment update/read per PK, though it would lead to performance implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inconsistency across segments is hard to solve without locking, but this PR can solve the inconsistency within the segment.
More importantly, with the current code the new record is always missing before it is indexed, instead of immediately queryable when the upsert metadata is updated, so I would count this as a bug which need to be fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, consistency within the segment is fair. Yes, I agree with the issue that we switched the record location early before it's added to index
@@ -193,12 +175,6 @@ public GenericRow updateRecord(IndexSegment segment, RecordInfo recordInfo, Gene | |||
if (recordInfo._comparisonValue.compareTo(currentRecordLocation.getComparisonValue()) >= 0) { | |||
IndexSegment currentSegment = currentRecordLocation.getSegment(); | |||
int currentDocId = currentRecordLocation.getDocId(); | |||
if (_partialUpsertHandler != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shall keep the partial upsert merge logic, as the groovy-based partial upsert logic can trigger some default logic even for newly ingested record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partial upsert merge logic is separated as a separate method updateRecord()
. This PR will keep the current functionality of partial upsert, but fix the inconsistency issue mentioned above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So you extracted the update logic and perform it before adding the record.
We should index the record before updating the upsert metadata (updating the validDocIds) so that when the new record is validated, it immediately becomes queryable. This can address the off-by-one issue described in #7849
NOTE: it is hard to add a test for such race conditions, so only added the comments to describe the behavior