Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support segment storage format without forward index #9333

Merged
merged 27 commits into from
Oct 12, 2022

Conversation

somandal
Copy link
Contributor

@somandal somandal commented Sep 3, 2022

This PR adds an option to disable the forward index for a given column via the FieldConfig properties list. This is a PR to solve issue #6473.

If the forward index is disabled, only a subset of queries can work. Basically queries that don't need to select, transform, group by, order by but do need to use the column in WHERE clauses can be supported.

To disable the forward index, the following mandates have been added at the moment (depending on the usage of this feature we may decide to relax some of them over time):

  • The inverted index for the column must be enabled
    • We require this to add functionality on the reload path to regenerate the forward index in case users decide later that they want a forward index
  • The column must be a dictionary based column
    • We require this to add functionality on the reload path to regenerate the forward index in case users decide later that they want a forward index
  • The column cannot have a V1 type range index present or be a multi-value columns with a range index present
    • V1 type range index resorts to full scans and needs the forward index to work properly
    • Multi-value columns always use the V1 type range index

The above checks have been added in the code in the following places:

  • TableConfig validator
  • Segment reload path (check when creating the default column)
  • Segment creation path

Sorted Column Handling:

For sorted columns we decided to allow someone to disable the forward index, but internally it's a no-op operation (we already create only a single index for sorted columns and use it for both forward and inverted index, we'll continue doing the same irrespective of the forwardIndexDisabled flag)

The ability to disable the forward index is currently implemented for the following cases:

  • New immutable segments generated for a column (all refresh use cases and reload use cases for new segments)

This PR does not add support for disabling the forward index for:

  • Mutable segments
  • Handling default and derived columns with the forward index disabled. We have decided to address default columns in the following way (derived still needs thought):
    • SV default columns should continue to create a sorted forward index (similar to the segment creation code path). We are okay with query inconsistency behavior for end users in this scenario.
    • MV default columns should not have a forward index but should have the inverted index + dictionary and any other indexes enabled during reload.
  • Enabling / disabling the disable forward index flag for an existing column -> we will add support for this reload case in a follow up PR shortly after this PR

Validations with other indices can be broadly classified into the following paths:

  • Query path - ensure that none of the other indices rely on the forward index
    • We found the range index v1 resorts to a scan on the forward index, thus disabled this feature if a range index of version 1 is present
    • Sorted columns rely on the forward index as the inverted index as well. We treat sorted column as a no-op when this feature is enabled (do not throw an error)
  • Segment creation path does not rely on the forward index to create other indices. The indexRow function creates all the indices needed at the same time for a given row.
  • Segment reload path relies on the forward index to add new indices such as range index, text index, etc. This path cannot be supported until we address enabling/disabling this feature on the reload path and will be done as part of a future PR along with any additional validations needed.

No-op creators have been added for the forward index disabled columns. This is useful for:

  • On the segment creation side, we can keep code clean and reuse the forward index creator map to walk through all the columns and create other indices

All code paths that need to read the forward index now have null checks added and appropriate exceptions thrown if the forward index doesn't exist.

cc @siddharthteotia @Jackie-Jiang

@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2022

Codecov Report

Merging #9333 (8bda499) into master (5ca72a3) will decrease coverage by 0.03%.
The diff coverage is 76.39%.

@@             Coverage Diff              @@
##             master    #9333      +/-   ##
============================================
- Coverage     69.88%   69.85%   -0.04%     
- Complexity     4871     5235     +364     
============================================
  Files          1927     1928       +1     
  Lines        102694   102842     +148     
  Branches      15586    15629      +43     
============================================
+ Hits          71771    71841      +70     
- Misses        25858    25921      +63     
- Partials       5065     5080      +15     
Flag Coverage Δ
integration1 25.96% <5.59%> (-0.01%) ⬇️
integration2 24.48% <2.48%> (-0.06%) ⬇️
unittests1 67.29% <73.29%> (+0.05%) ⬆️
unittests2 15.56% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...local/segment/index/datasource/BaseDataSource.java 100.00% <ø> (ø)
.../readers/forward/FixedBitMVForwardIndexReader.java 73.49% <ø> (ø)
...ocal/segment/readers/PinotSegmentColumnReader.java 75.00% <0.00%> (-1.60%) ⬇️
...t/local/startree/v2/store/StarTreeLoaderUtils.java 91.11% <ø> (ø)
...ltime/converter/stats/MutableColumnStatistics.java 68.83% <50.00%> (-0.51%) ⬇️
...verter/stats/MutableNoDictionaryColStatistics.java 43.47% <50.00%> (+0.62%) ⬆️
...rg/apache/pinot/core/minion/RawIndexConverter.java 57.27% <55.55%> (-0.53%) ⬇️
...he/pinot/segment/local/utils/TableConfigUtils.java 67.72% <55.55%> (+0.15%) ⬆️
...loader/defaultcolumn/BaseDefaultColumnHandler.java 58.20% <57.69%> (-0.04%) ⬇️
...ment/creator/impl/fwd/NoOpForwardIndexCreator.java 66.66% <66.66%> (ø)
... and 60 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@somandal somandal requested a review from jackjlli September 8, 2022 01:17
@Jackie-Jiang
Copy link
Contributor

We should allow disabling forward index for sorted column, which is a no-op because the inverted index for sorted column can be used as forward index. I believe it is already allowed in the implementation because the default column is always sorted. No need to add this in the validation.

@somandal
Copy link
Contributor Author

somandal commented Sep 8, 2022

We should allow disabling forward index for sorted column, which is a no-op because the inverted index for sorted column can be used as forward index. I believe it is already allowed in the implementation because the default column is always sorted. No need to add this in the validation.

Thanks for the suggestion @Jackie-Jiang. From going through the code it looks like for sorted columns the forward index and inverted index are essentially the same. Code snippet from the PhysicalColumnIndexCreator:

        // Single-value
        if (metadata.isSorted()) {
          // Sorted
          SortedIndexReader<?> sortedIndexReader = indexReaderProvider.newSortedIndexReader(fwdIndexBuffer, metadata);
          _forwardIndex = sortedIndexReader;
          _invertedIndex = sortedIndexReader;
          _fstIndex = null;
          return;
        }

I also see code on the segment creation path where we skip creating the inverted index if the column is sorted even if the inverted index is enabled for the column. So keeping these in mind, we didn't think it made much sense to disable forward index for such columns.

Let me know your thoughts based on the above. cc @siddharthteotia

@Jackie-Jiang
Copy link
Contributor

@somandal We want to disable the forward index to save the storage. For sorted index, we store the index in inverted format (map from dictId to doc range), and also use it as forward index. We should allow disabling the forward index for it in the table config, and simply ignore it when creating the index.

@somandal somandal force-pushed the no-forward-index branch 2 times, most recently from b7524bf to 2c0c090 Compare September 15, 2022 00:25
@somandal
Copy link
Contributor Author

@somandal We want to disable the forward index to save the storage. For sorted index, we store the index in inverted format (map from dictId to doc range), and also use it as forward index. We should allow disabling the forward index for it in the table config, and simply ignore it when creating the index.

Thanks for discussing this, done!

@somandal somandal requested review from siddharthteotia and jackjlli and removed request for jackjlli and siddharthteotia September 15, 2022 00:27
@siddharthteotia siddharthteotia merged commit e813867 into apache:master Oct 12, 2022
@siddharthteotia
Copy link
Contributor

Thanks for the contribution @somandal . Let's please also share the doc on follow-up changes like discussed in last meeting with OSS.

@siddharthteotia siddharthteotia changed the title Add an option to disable the creation of the forward index for a column Support skipping forward index for column(s) in offline segments Oct 12, 2022
@siddharthteotia siddharthteotia changed the title Support skipping forward index for column(s) in offline segments Support segment storage format without forward index Oct 12, 2022
@somandal somandal deleted the no-forward-index branch October 12, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants