Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support creating dictionary at runtime for an existing column #9678

Merged
merged 5 commits into from
Nov 2, 2022

Conversation

vvivekiyer
Copy link
Contributor

@vvivekiyer vvivekiyer commented Oct 28, 2022

OSS issue: #9348
Label: Feature
Document

With this PR, we add support to enable dictionary on a raw column for immutable segments during segment Reload. This support is added for both SV and MV columns.

Added Unit Tests. As the PR is being reviewed, I'm planning to add an integration test where we query on a column where dictionary is enabled.

@vvivekiyer vvivekiyer changed the title Enable dictionary code changes Support for enabling dictionary on a raw column Oct 28, 2022
@vvivekiyer vvivekiyer changed the title Support for enabling dictionary on a raw column SegmentReload: Support for enabling dictionary on a raw column Oct 28, 2022
@siddharthteotia siddharthteotia changed the title SegmentReload: Support for enabling dictionary on a raw column Support creating dictionary at runtime for an existing column Oct 28, 2022
@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2022

Codecov Report

Merging #9678 (07e5e7a) into master (63c6438) will decrease coverage by 44.06%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #9678       +/-   ##
=============================================
- Coverage     68.71%   24.65%   -44.07%     
+ Complexity     4971       53     -4918     
=============================================
  Files          1939     1938        -1     
  Lines        103671   104096      +425     
  Branches      15727    15778       +51     
=============================================
- Hits          71242    25664    -45578     
- Misses        27350    75822    +48472     
+ Partials       5079     2610     -2469     
Flag Coverage Δ
integration1 ?
integration2 24.65% <0.00%> (?)
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...t/creator/impl/SegmentIndexCreationDriverImpl.java 0.00% <0.00%> (-81.73%) ⬇️
...ocal/segment/index/loader/ForwardIndexHandler.java 0.00% <0.00%> (-71.01%) ⬇️
...ocal/segment/index/loader/IndexHandlerFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...ocal/segment/index/loader/SegmentPreProcessor.java 0.00% <0.00%> (-82.83%) ⬇️
...in/java/org/apache/pinot/spi/utils/BytesUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/NoOpRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/user/RoleType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/data/MetricFieldSpec.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1498 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vvivekiyer vvivekiyer marked this pull request as ready for review October 31, 2022 20:28
Copy link
Contributor

@somandal somandal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, some minor nits

@vvivekiyer vvivekiyer requested a review from somandal November 2, 2022 00:04
@siddharthteotia
Copy link
Contributor

Let's add e2e query execution tests as well (need not be in integration but one of the subclasses of BaseQueriesTest)

String fwdIndexFileExtension;
if (isSingleValue) {
fwdIndexFileExtension =
existingColMetadata.isSorted() ? V1Constants.Indexes.SORTED_SV_FORWARD_INDEX_FILE_EXTENSION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case will never be hit. If the column is sorted then dictionary should have already existed. Sorted forward index is always dictionary encoded / dictionary based.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, if someone had added a sorted column under the noDict list in the TableConfig what happens? For dict disable case, will this operation be allowed for sorted columns? Should this fail or be a no-op? Perhaps need to check this code path and add relevant Precondition checks / returns as no-op code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to write a unit test (please see ForwardIndexHandlerTest) where I created a sorted column without dictionary.
I understand the comment. But I believe it still makes sense to have this support just incase someone decides to change the behavior? We're just covering all cases by having this support. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior for sorted columns on the segment creation code path? How does it interact with the noDict columns list? The behavior should be consistent IMO. So if that code path allows creation of raw forward index for sorted columns then this code path should do. If not, I'd recommend tacking adding support for raw forward index for sorted columns as a separate change and tackle both segment creation and reload code paths. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. @vvivekiyer you may want to update this thread

@@ -223,6 +223,42 @@ public void testSelectionOverRangeFilter(String query, int min, int max, boolean
}
}

@Test(dataProvider = "selectionTestCases")
public void testSelectionOverRangeFilterAfterReload(String query, int min, int max, boolean inclusive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query execution tests seem to be focused on range index only. Any reason ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add other tests like

  • Run query on a raw column using that in filter
  • Enable dict + reload
  • Run same query again and it should use dict based predicate evaluator and return same result

Same goes for using the column in SELECT clause. It should correctly return the same result after doing dual lookup in rewriten fwd index and dict as it did before with raw fwd index. Things like that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of types for both SV and MV should be considered

@siddharthteotia
Copy link
Contributor

Decided to merge this in the interest of unblocking @somandal 's reload support work on #6473

Main work is done here. @vvivekiyer is going to follow-up immediately with

  • More tests (Called out above)
  • Sorted column handling check

@siddharthteotia siddharthteotia merged commit 351cb3e into apache:master Nov 2, 2022
operationMap = fwdIndexHandler.computeOperation(writer);
assertEquals(operationMap, Collections.EMPTY_MAP);

// TEST5: Change compression
// TEST8: Add text index and disable forward index.
Copy link
Contributor

@siddharthteotia siddharthteotia Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading. You mean to say "Add text index and enable dictionary" ?

segmentLocalFSDirectory.close();

validateIndexMap(column, true);
validateForwardIndex(column, null);
Copy link
Contributor

@siddharthteotia siddharthteotia Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check if ColumnIndexEntry exists / dictionary index buffer exists by calling hasIndex() ?

validateIndex(ColumnIndexType.FORWARD_INDEX, EXISTING_INT_COL_RAW_MV, 18499, 15, _schema, false, false, false, 0,
false, 13, ChunkCompressionType.LZ4, false, DataType.INT, 106688);

// Enable dictionary and inverted index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) should be enable dict, inv index and range index

_indexLoadingConfig.getRangeIndexColumns().add(EXISTING_INT_COL_RAW);
checkForwardIndexCreation(EXISTING_INT_COL_RAW, 42242, 16, _schema, false, true, false, 0, null, true, 0,
DataType.INT, 100000);
validateIndex(ColumnIndexType.RANGE_INDEX, EXISTING_INT_COL_RAW, 42242, 16, _schema, false, true, false, 0, true, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually validate that range index got correctly rewritten to be dict based ? I guess that's where query exec tests will be useful

checkForwardIndexCreation(EXISTING_STRING_COL_DICT, 9, 4, _schema, false, true, false, 26, null, true, 0,
DataType.STRING, 100000);

// TEST 3: EXISTING_STRING_COL_RAW. Enable dictionary. Also add inverted index and text index. Reload code path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) you can potentially reorg tests to make them more readable

  • simple enable dict on SV, no other change
  • simple enable dict on MV, no other change
  • enable dict + on SV
  • enable dict + on MV

@@ -232,4 +268,38 @@ public void testCountOverRangeFilter(String query, int expectedCount) {
assertEquals(aggregationResult.size(), 1);
assertEquals(((Number) aggregationResult.get(0)).intValue(), expectedCount, query);
}

@Test(dataProvider = "countTestCases")
public void testCountOverRangeFilterAfterReload(String query, int expectedCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best for us to control the query. Not really sure if the query being sent here by the provider is indeed using the concerned column in parts where we want

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • SELECT FROM foo LIMIT -- this will test out that dict is correct and fwd index got rewritten
  • Try the same in GROUP BY as GROUP BY uses dict.
  • Try MIN and MAX aggregations -- they can be answered from dict.
  • WHERE clause like mentioned in another commend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants