Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default column handling of noForwardIndex and regeneration of forward index on reload path #9810

Conversation

somandal
Copy link
Contributor

Related OSS Issue: #6473
Related OSS PRs:
#9333
#9740

Document outlining the breakup of PRs and the approach to solve various sub-problems of this issue: https://docs.google.com/document/d/1MNLLhYCg5e-UFBQ6wTBODd41sDsbjevwRfwoGuNowWw/edit?usp=sharing

This PR adds support for the following operations related to forward index disabled columns:

  • Adds default column handling for MV forward index disabled columns. Creates an inverted index for default MV columns with forward index disabled instead of a forward index.
  • Adds a utility class to generate the forward index from the inverted index and dictionary.
  • Adds support to enable the forward index on existing columns with the forward index disabled.
  • Adds support to create a temporary forward index if other indexes are added for columns where the forward index is disabled.

This PR does not add support for the following (and there will be follow up PRs to handle these):

  • Dealing with duplicated entries within a multi-value row.
  • Derived column handling.

Tests includes:

  • Segment reload path for default columns and existing columns
  • Query path to ensure queries work when the forward index is re-enabled

cc @siddharthteotia @Jackie-Jiang @vvivekiyer

@siddharthteotia siddharthteotia changed the title Handle forward index disabled default columns and add support to enable the forward index on a forward index disabled column Default column handling of noForwardIndex and support regeneration of forward index on reload path Nov 16, 2022
@siddharthteotia siddharthteotia changed the title Default column handling of noForwardIndex and support regeneration of forward index on reload path Default column handling of noForwardIndex and regeneration of forward index on reload path Nov 16, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2022

Codecov Report

Merging #9810 (d537783) into master (d3ea8dc) will increase coverage by 54.60%.
The diff coverage is 78.29%.

@@              Coverage Diff              @@
##             master    #9810       +/-   ##
=============================================
+ Coverage     15.86%   70.47%   +54.60%     
- Complexity      176     5074     +4898     
=============================================
  Files          1931     1987       +56     
  Lines        104306   107133     +2827     
  Branches      15901    16272      +371     
=============================================
+ Hits          16551    75497    +58946     
+ Misses        86531    26366    -60165     
- Partials       1224     5270     +4046     
Flag Coverage Δ
integration1 25.16% <0.00%> (?)
integration2 24.48% <0.00%> (?)
unittests1 67.94% <78.29%> (?)
unittests2 15.84% <1.56%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../apache/pinot/common/exception/QueryException.java 94.44% <ø> (+94.44%) ⬆️
...va/org/apache/pinot/query/runtime/QueryRunner.java 84.09% <0.00%> (ø)
...t/index/loader/bloomfilter/BloomFilterHandler.java 40.54% <50.00%> (+40.54%) ⬆️
...ocal/segment/index/loader/ForwardIndexHandler.java 81.87% <69.04%> (+81.87%) ⬆️
...inot/segment/local/customobject/VarianceTuple.java 64.28% <75.00%> (+64.28%) ⬆️
...tedIndexAndDictionaryBasedForwardIndexCreator.java 76.77% <76.77%> (ø)
...ent/index/loader/invertedindex/H3IndexHandler.java 74.07% <80.00%> (+74.07%) ⬆️
...t/index/loader/invertedindex/JsonIndexHandler.java 72.83% <80.00%> (+72.83%) ⬆️
...t/local/segment/index/loader/BaseIndexHandler.java 82.14% <82.14%> (ø)
.../apache/pinot/common/datablock/DataBlockUtils.java 90.43% <100.00%> (+90.43%) ⬆️
... and 1502 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@somandal somandal force-pushed the enable-forward-index-and-default-column-handling branch from 03d37ce to cf4e87d Compare November 17, 2022 20:09
}

private void updateMetadataProperties(File indexDir, Map<String, String> metadataProperties)
static void updateMetadataProperties(File indexDir, Map<String, String> metadataProperties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wondering if this will actually work in stateless / static fashion. Caller need not do anything after properties.save() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. properties.save() will save the properties into the metadata file. After this the next step is to cleanup indexes and stuff which is not related to this.



/**
* Base class for all of the {@link IndexHandler} classes which need to build their indexes from the forward index.
Copy link
Contributor

@siddharthteotia siddharthteotia Dec 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this.

IIUC, a particular IndexHandler is supposed to create that index or remove / cleanup / update that index.

We recently added ForwardIndexHandler as well which can rewrite forward index (for work done in #9348) and also remove forward index for noForwardIndex feature.

Similarly this PR uses ForwardIndexHandler to rebuild forward index using some other structures.

So I am not sure if above statement "all of the IndexHandler classes which need to build their indexes from the forward index. is exactly true or intuitive ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a list of indexes that require the forward index to build their indexes, such as inverted index, JSON index, etc. For most scenarios within the ForwardIndexHandler we too need the forward index to modify the forward index (except when we enable forward index for columns where it's disabled). Does that clarify your question? Yes each IndexHandler needs to deal with remove / cleanup / update of its own index, but they often rely on forward index to do so.

I'm trying to abstract away the common logic needed by all such IndexHandlers as they all may require the need to create a temp forward index and clean it up later on. I can avoid adding this abstract class but then we'll have to duplicate this logic in all such IndexHandlers. Let me know if this is preferred over what I'm doing now.

@somandal somandal force-pushed the enable-forward-index-and-default-column-handling branch from fc074ef to 30246e3 Compare December 5, 2022 19:19
@somandal somandal force-pushed the enable-forward-index-and-default-column-handling branch from 3704d19 to af94ed7 Compare December 6, 2022 06:04
@@ -845,6 +845,69 @@ public void testSelectQueriesWithReload()
assertEquals(resultRow[0], 240528);
assertEquals(resultRow[1], "gFuH");
}

// Re-enable forward index for column9, column11, and column6
Copy link
Contributor

@siddharthteotia siddharthteotia Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for a simple SELECT * or SELECT reenabledColumn FROM Foo LIMIT <MAX> ORDER BY <something> .

Basically no WHERE clause.

This will have to read the entire forward index and can be a good correctness test that entire forward index was generated correctly.

Something like

  • Generate proper regular segment that has forward index.
  • Gather resultSet1
  • Disable forward index on a particular column via the reload path
  • Query should fail
  • Rebuild forward index via the reload path
  • Gather resultSet2

resultSet1 should be same as resultSet2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest doing this for both SV and MV columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. one issue here, the MV column test is taking 7-8 minutes to run :( it passes but maybe this is too long. Any suggestions to improve this?

Also couldn't run SELECT * because some other columns still have forward index disabled in these tests. So aded a test for selecting the specific column instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect 7-8 mins ? That's quite long for reload to take on a single segment single column.

I think it will be useful to double check the implementation once and see if there isn't anything obvious sub-optimal stuff going on.

Copy link
Contributor Author

@somandal somandal Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reload isn't taking 7-8 minutes. The verification is taking that long. Need to walk through 400K rows, construct the sets for before and after and validate that they match. I did validate that the reload part is quick, so there shouldn't be a major performance issue with the reload part itself.

private void trackLengthOfLongestEntry(Dictionary dictionary, int[] lengthOfLongestEntry, int dictId) {
switch (_columnMetadata.getFieldSpec().getDataType().getStoredType()) {
case STRING:
lengthOfLongestEntry[0] = Math.max(dictionary.getStringValue(dictId).length(),
Copy link
Contributor

@siddharthteotia siddharthteotia Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to track the length as physical number of bytes (UTF-8 encoded) as opposed to String.length() which reports number of characters ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, good catch! i had a note in my TODO list for this and forgot to update it. I've fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I am worried how were the tests passing then ? ideally any test that runs a SELECT query and tries to read forward index should have failed because of the incorrect value of metadata ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried printing the value and walking through the debugger too and found that both values come out to be the same in my tests:

Example log that I added:

        LOGGER.warn("*********** old len: {}, new len: {}", dictionary.getStringValue(dictId).length(), dictionary.getStringValue(dictId).getBytes(UTF_8).length);

test logs:

19:17:05.932 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 18, new len: 18
19:17:05.939 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 7, new len: 7
19:17:05.939 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 15, new len: 15
19:17:05.940 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 6, new len: 6
19:17:05.940 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 6, new len: 6
19:17:05.940 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 4, new len: 4
19:17:05.941 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 26, new len: 26
19:17:05.941 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 8, new len: 8
19:17:05.941 WARN [InvertedIndexAndDictionaryBasedForwardIndexCreator] [main] *********** old len: 19, new len: 19

So essentially they calculate the same thing, each UTF8 character is stored as a single byte. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for confirming and fixing.

@somandal somandal force-pushed the enable-forward-index-and-default-column-handling branch from 2646460 to d537783 Compare December 9, 2022 00:37
@siddharthteotia siddharthteotia merged commit 86fe730 into apache:master Dec 12, 2022
@siddharthteotia
Copy link
Contributor

@somandal - looks good overall. Since this is quite big already and we have iterated upon it few times, I am merging it. I think the runtime of MV column should be looked into once to see there isn't any performance issue.

Secondly, it will be good to take another pass at the new file added to see if some methods can be refactored into smaller ones.

Thanks for working on this.

@somandal somandal deleted the enable-forward-index-and-default-column-handling branch December 12, 2022 16:27
@somandal
Copy link
Contributor Author

@somandal - looks good overall. Since this is quite big already and we have iterated upon it few times, I am merging it. I think the runtime of MV column should be looked into once to see there isn't any performance issue.

Secondly, it will be good to take another pass at the new file added to see if some methods can be refactored into smaller ones.

Thanks for working on this.

Thanks so much, appreciate your patience with this long PR!

I'll look into:

Secondly, it will be good to take another pass at the new file added to see if some methods can be refactored into smaller ones.

As for this:

I think the runtime of MV column should be looked into once to see there isn't any performance issue.

The MV test taking 7-8 minutes is spending most of that time on verification and not on reload. So I don't anticipate a major issue in the reload path for performance purposes, but I can definitely look into whether there are any optimization opportunities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants