Allocation free `DataBlockCache` lookups #8140

richardstartin · 2022-02-05T20:57:16Z

ColumnTypePair show up just behind int[]/double[] in allocation profiles of query execution.

This changes avoids allocating these by using the DataType as a first level lookup into an EnumMap (a small array behind the scenes) to a Set<String> which is the set of columns for that data type. This reduces allocation rate and improves performance (does not regress) for a range of queries:

master

Benchmark                                                (_intBaseValue)  (_numRows)                                                                                                                                                                                                                                                         (_query)  Mode  Cnt        Score         Error   Units
BenchmarkQueries.query                                                 0     1500000                                                                                                                                                                                                                             SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5    17463.734 ±     355.827   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000                                                                                                                                                                                                                             SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5   608548.162 ± 1016513.369    B/op
BenchmarkQueries.query                                                 0     1500000                                                        SELECT SUM(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999),MAX(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999) FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5    23784.153 ±    1163.643   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000                                                        SELECT SUM(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999),MAX(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999) FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5   426997.395 ±  420169.858    B/op
BenchmarkQueries.query                                                 0     1500000  SELECT SUM(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_sum,MAX(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_avg FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5    52328.177 ±    2807.010   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000  SELECT SUM(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_sum,MAX(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_avg FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5   998184.088 ± 1572072.289    B/op

branch

Benchmark                                                (_intBaseValue)  (_numRows)                                                                                                                                                                                                                                                         (_query)  Mode  Cnt       Score         Error   Units
BenchmarkQueries.query                                                 0     1500000                                                                                                                                                                                                                             SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5   17374.187 ±     334.140   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000                                                                                                                                                                                                                             SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5  603245.214 ± 1004990.185    B/op
BenchmarkQueries.query                                                 0     1500000                                                        SELECT SUM(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999),MAX(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999) FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5   23116.649 ±     753.629   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000                                                        SELECT SUM(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999),MAX(INT_COL) FILTER(WHERE INT_COL > 123 AND INT_COL < 599999) FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5  422940.704 ±  411258.808    B/op
BenchmarkQueries.query                                                 0     1500000  SELECT SUM(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_sum,MAX(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_avg FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5   52473.093 ±    3630.290   us/op
BenchmarkQueries.query:·gc.alloc.rate.norm                             0     1500000  SELECT SUM(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_sum,MAX(CASE WHEN (INT_COL > 123 AND INT_COL < 599999) THEN INT_COL ELSE 0 END) AS total_avg FROM MyTable WHERE NO_INDEX_INT_COL > 5 AND NO_INDEX_INT_COL < 1499999  avgt    5  973148.139 ± 1518881.578    B/op

codecov-commenter · 2022-02-05T21:34:39Z

Codecov Report

Merging #8140 (1ae6e8a) into master (a47af49) will increase coverage by 39.49%.
The diff coverage is 95.12%.

@@              Coverage Diff              @@
##             master    #8140       +/-   ##
=============================================
+ Coverage     30.69%   70.19%   +39.49%     
- Complexity        0     4302     +4302     
=============================================
  Files          1613     1624       +11     
  Lines         83952    84292      +340     
  Branches      12597    12635       +38     
=============================================
+ Hits          25768    59165    +33397     
+ Misses        55889    21041    -34848     
- Partials       2295     4086     +1791

Flag	Coverage Δ
integration1	`?`
integration2	`27.66% <82.92%> (-0.05%)`	⬇️
unittests1	`67.90% <95.12%> (?)`
unittests2	`14.21% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...a/org/apache/pinot/core/common/DataBlockCache.java	`91.42% <95.12%> (+9.19%)`	⬆️
...pinot/minion/exception/TaskCancelledException.java	`0.00% <0.00%> (-100.00%)`	⬇️
...nverttorawindex/ConvertToRawIndexTaskExecutor.java	`0.00% <0.00%> (-100.00%)`	⬇️
...e/pinot/common/minion/MergeRollupTaskMetadata.java	`0.00% <0.00%> (-94.74%)`	⬇️
...plugin/segmentuploader/SegmentUploaderDefault.java	`0.00% <0.00%> (-87.10%)`	⬇️
.../transform/function/MapValueTransformFunction.java	`0.00% <0.00%> (-85.30%)`	⬇️
...ot/common/messages/RoutingTableRebuildMessage.java	`0.00% <0.00%> (-81.82%)`	⬇️
...verttorawindex/ConvertToRawIndexTaskGenerator.java	`5.45% <0.00%> (-80.00%)`	⬇️
...ache/pinot/common/lineage/SegmentLineageUtils.java	`22.22% <0.00%> (-77.78%)`	⬇️
...ore/startree/executor/StarTreeGroupByExecutor.java	`0.00% <0.00%> (-77.78%)`	⬇️
... and 1164 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a47af49...1ae6e8a. Read the comment docs.

siddharthteotia · 2022-02-05T22:05:33Z

pinot-core/src/main/java/org/apache/pinot/core/common/DataBlockCache.java

@@ -109,12 +111,11 @@ public int getNumDocs() {
   * @return Array of int values
   */
  public int[] getIntValuesForSVColumn(String column) {
-    ColumnTypePair key = new ColumnTypePair(column, FieldSpec.DataType.INT);


So the potential problem being fixed here is the per call creation of ColumnTypePair object and that is leading to some heap/perf overhead ?

Exactly, about 8MB of these were allocated in 1s in a benchmark which only allocated ~25MB/s. It's one of the main sources of allocation.

siddharthteotia · 2022-02-05T22:12:26Z

pinot-core/src/main/java/org/apache/pinot/core/common/DataBlockCache.java

@@ -109,12 +111,11 @@ public int getNumDocs() {
   * @return Array of int values
   */
  public int[] getIntValuesForSVColumn(String column) {
-    ColumnTypePair key = new ColumnTypePair(column, FieldSpec.DataType.INT);
-    int[] intValues = (int[]) _valuesMap.get(key);


Previously there is a single level indirection to get the corresponding values[] for a given ColumnTypePar. Now it's a Map<Type,Map<String,Object>> plus another function call getValues().

So while we avoid the creation of ColumnTypePair object, is it possible that new code will add some perf overhead that will negate any benefit of this PR ?

This has all been measured, let me put together benchmark results (you can see the combined effect of the set of changes in #8134)

Note that an EnumMap lookup is an array access by ordinal, and the array is very small, so the cost of indirection here is very low.

Jackie-Jiang

LGTM

pinot-core/src/main/java/org/apache/pinot/core/common/DataBlockCache.java

richardstartin · 2022-02-07T20:04:01Z

I will apply Jackie's suggestions before merging, please wait until I've done that (tomorrow)

This reverts commit 65dcfe7.

siddharthteotia reviewed Feb 5, 2022

View reviewed changes

Jackie-Jiang approved these changes Feb 7, 2022

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/DataBlockCache.java Outdated Show resolved Hide resolved

pinot-core/src/main/java/org/apache/pinot/core/common/DataBlockCache.java Outdated Show resolved Hide resolved

richardstartin force-pushed the allocation-free-datablock-cache branch from 6321642 to 8cfedf9 Compare February 7, 2022 23:21

richardstartin added 2 commits February 8, 2022 08:02

intern DataBlockCache lookup keys

1cc9ea4

comments

1ae6e8a

richardstartin force-pushed the allocation-free-datablock-cache branch from 8cfedf9 to 1ae6e8a Compare February 8, 2022 08:02

richardstartin merged commit 65dcfe7 into apache:master Feb 8, 2022

richardstartin added a commit that referenced this pull request Feb 9, 2022

Revert "Allocation free DataBlockCache lookups (#8140)"

c15bf3e

This reverts commit 65dcfe7.

richardstartin mentioned this pull request Feb 9, 2022

[do not merge] Revert "Allocation free DataBlockCache lookups" #8178

Closed

richardstartin added the performance label Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation free `DataBlockCache` lookups #8140

Allocation free `DataBlockCache` lookups #8140

richardstartin commented Feb 5, 2022 •

edited

Loading

codecov-commenter commented Feb 5, 2022 •

edited

Loading

siddharthteotia Feb 5, 2022 •

edited

Loading

richardstartin Feb 5, 2022

siddharthteotia Feb 5, 2022 •

edited

Loading

richardstartin Feb 5, 2022

richardstartin Feb 5, 2022

Jackie-Jiang left a comment

richardstartin commented Feb 7, 2022

Allocation free DataBlockCache lookups #8140

Allocation free DataBlockCache lookups #8140

Conversation

richardstartin commented Feb 5, 2022 • edited Loading

codecov-commenter commented Feb 5, 2022 • edited Loading

Codecov Report

siddharthteotia Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

richardstartin Feb 5, 2022

Choose a reason for hiding this comment

siddharthteotia Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

richardstartin Feb 5, 2022

Choose a reason for hiding this comment

richardstartin Feb 5, 2022

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

richardstartin commented Feb 7, 2022

Allocation free `DataBlockCache` lookups #8140

Allocation free `DataBlockCache` lookups #8140

richardstartin commented Feb 5, 2022 •

edited

Loading

codecov-commenter commented Feb 5, 2022 •

edited

Loading

siddharthteotia Feb 5, 2022 •

edited

Loading

siddharthteotia Feb 5, 2022 •

edited

Loading