Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add global strategy for partial upsert #7906

Merged
merged 15 commits into from
Jan 20, 2022

Conversation

deemoliu
Copy link
Contributor

@deemoliu deemoliu commented Dec 15, 2021

Description

Recently we got interesting use cases from industry about partial upsert.

Users have two event as follows, t is the timestamp column and t1<t2

{t1, a1, b1, c1, d1}
{t2, a2, nil, nil, nil}

user specified field "a" as Overwrite field, and "b", "c", "d" field are empty in the second event.
she expected merge result to be {a2, b1, c1, d1}
However the merge result was {a2, nil, nil, nil} which is the same as full upsert.

The reason of this issue is because she didn't specify the mergers for "b", "c", "d" fields. Thus these fields will use the default behavior, "Overwrite regardless null".

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "partialUpsertStrategies":{
      "a": "OVERWRITE"
    }
  }
}

Her issue can be fixed with the following config, since the "overwrite" merger behavior is "Overwrite unless null".

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "partialUpsertStrategies":{
      "a": "OVERWRITE",
      "b": "OVERWRITE",
      "c": "OVERWRITE",
      "d": "OVERWRITE"
    }
  }
}

In the PR, I added a global strategy. so that user can use "defaultPartialUpsertStrategy", user will not need to set partialUpsertStategy for fields "b", "c", "d" fields.

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "defaultPartialUpsertStrategy": "OVERWRITE",
    "partialUpsertStrategies":{
      "a": "OVERWRITE"
    }
  }
}

NOTE: if we don't specify the overwrite. the overwrite behavior is "Overwrite regardless null".

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

@codecov-commenter
Copy link

codecov-commenter commented Dec 15, 2021

Codecov Report

Merging #7906 (d863947) into master (1d1a7d3) will decrease coverage by 6.45%.
The diff coverage is 60.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7906      +/-   ##
============================================
- Coverage     71.40%   64.95%   -6.46%     
- Complexity     4223     4231       +8     
============================================
  Files          1597     1554      -43     
  Lines         82903    81100    -1803     
  Branches      12369    12178     -191     
============================================
- Hits          59201    52677    -6524     
- Misses        19689    24665    +4976     
+ Partials       4013     3758     -255     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 68.16% <60.00%> (+0.02%) ⬆️
unittests2 14.40% <0.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ata/manager/realtime/RealtimeTableDataManager.java 11.64% <0.00%> (-57.46%) ⬇️
...rg/apache/pinot/spi/config/table/UpsertConfig.java 86.95% <75.00%> (-2.52%) ⬇️
...not/segment/local/upsert/PartialUpsertHandler.java 38.70% <100.00%> (+5.37%) ⬆️
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/core/data/manager/realtime/TimerService.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
... and 381 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d1a7d3...d863947. Read the comment docs.

@@ -87,6 +93,10 @@ public HashFunction getHashFunction() {
return _partialUpsertStrategies;
}

public Strategy getGlobalUpsertStrategy() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this strategy applies to partial upsert only, right?

if (schema != null) {
for (String dimensionName : schema.getDimensionNames()) {
if (!schema.getPrimaryKeyColumns().contains(dimensionName) && !_column2Mergers.containsKey(dimensionName)) {
_column2Mergers.put(dimensionName, PartialUpsertMergerFactory.getMerger(globalUpsertStrategy));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think globalUpsertStrategy can be null? Shall we always assign a default value?

@@ -53,21 +53,27 @@
@JsonPropertyDescription("Partial update strategies.")
private final Map<String, Strategy> _partialUpsertStrategies;

@JsonPropertyDescription("global upsert strategy")
private final Strategy _globalUpsertStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more clear if rename to _defaultPartialUpsertStrategy?

} else {
_partialUpsertStrategies = null;
_globalUpsertStrategy = Strategy.OVERWRITE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably set to null for full upsert?

@JsonProperty("comparisonColumn") @Nullable String comparisonColumn,
@JsonProperty("hashFunction") @Nullable HashFunction hashFunction) {
Preconditions.checkArgument(mode != null, "Upsert mode must be configured");
_mode = mode;

if (mode == Mode.PARTIAL) {
_partialUpsertStrategies = partialUpsertStrategies != null ? partialUpsertStrategies : new HashMap<>();
_globalUpsertStrategy = globalUpsertStrategy != null ? globalUpsertStrategy : Strategy.OVERWRITE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the existing behavior. I do agree OVERWRITE makes more sense to partial upsert, but not sure if we want to introduce backward incompatibility here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jackie-Jiang for review. If we use Ignore (#7907) as default behavior for global mergers, do you think it will solve the backward compatible issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much adoption of partial uspert today, so I feel it's ok to get to best possible default option, whenever we can

@@ -251,7 +249,8 @@ public void testSerDe()
{
// with upsert config
UpsertConfig upsertConfig =
new UpsertConfig(UpsertConfig.Mode.FULL, null, "comparison", UpsertConfig.HashFunction.NONE);
new UpsertConfig(UpsertConfig.Mode.FULL, null, UpsertConfig.Strategy.OVERWRITE, "comparison",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FULL upsert, pass in null as the default strategy as that does not apply? Same for other places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, strategy applies to partial only

@icefury71
Copy link
Contributor

Quick comment: the existing code does not define any default strategy for any column (eg: OVERWRITE). So instead of introducing a new property, can we simply define a default strategy (eg: within UpsertConfig). (of course null handling is still needed). CC @yupeng9 @Jackie-Jiang

@deemoliu
Copy link
Contributor Author

deemoliu commented Jan 14, 2022

Quick comment: the existing code does not define any default strategy for any column (eg: OVERWRITE). So instead of introducing a new property, can we simply define a default strategy (eg: within UpsertConfig). (of course null handling is still needed). CC @yupeng9 @Jackie-Jiang

Hi @icefury71 thanks for the comment. Yes, there is no default strategy for columns that not specified in upsertConfig. The current behavior for columns not specified is "OVERWRITE even if the fieldValue of the new record is null". In this PR, i updated the default behavior (which is represented by "globalUpsertStrategy") to use "OVERWRITE unless the fieldValue of new record is null", which is the same behavior of the OVERWRITE merger.
I think @Jackie-Jiang mentioned updating this will introduce backward incompatibility and @yupeng9 mentioned we should get the best possible default option.

@icefury71 @yupeng9 @Jackie-Jiang what do you think?

@deemoliu deemoliu force-pushed the qiaochu/global-strategy branch from 413fcca to d0d18b5 Compare January 14, 2022 21:49
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the PR description to reflect the new config key defaultPartialUpsertStrategy

_helixManager = helixManager;
_tableNameWithType = tableNameWithType;
for (Map.Entry<String, UpsertConfig.Strategy> entry : partialUpsertStrategies.entrySet()) {
_column2Mergers.put(entry.getKey(), PartialUpsertMergerFactory.getMerger(entry.getValue()));
}

if (schema != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema should never be null here

_helixManager = helixManager;
_tableNameWithType = tableNameWithType;
for (Map.Entry<String, UpsertConfig.Strategy> entry : partialUpsertStrategies.entrySet()) {
_column2Mergers.put(entry.getKey(), PartialUpsertMergerFactory.getMerger(entry.getValue()));
}

if (schema != null) {
for (String dimensionName : schema.getDimensionNames()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably include all physical columns (including date time columns) except for primary key columns and comparison column (main time column if no comparison column is configured)

_helixManager = helixManager;
_tableNameWithType = tableNameWithType;
for (Map.Entry<String, UpsertConfig.Strategy> entry : partialUpsertStrategies.entrySet()) {
_column2Mergers.put(entry.getKey(), PartialUpsertMergerFactory.getMerger(entry.getValue()));
}
// For all physical columns (including date time columns) except for primary key columns and comparison column.
// If no comparison column is configured, use main time column as the comparison time.
for (String columnName : schema.getColumnNames()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (String columnName : schema.getColumnNames()) {
for (String columnName : schema.getPhysicalColumnNames()) {

Map<String, UpsertConfig.Strategy> partialUpsertStrategies) {
public PartialUpsertHandler(HelixManager helixManager, String tableNameWithType, Schema schema,
Map<String, UpsertConfig.Strategy> partialUpsertStrategies, UpsertConfig.Strategy defaultPartialUpsertStrategy,
String comparisonColumn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotate it as nullable

_column2Mergers.put(columnName, PartialUpsertMergerFactory.getMerger(defaultPartialUpsertStrategy));
}
} else {
if (!schema.getDateTimeNames().contains(columnName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main time column is configured within the table config validationConfig

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

Comment on lines 167 to 169
partialUpsertHandler = new PartialUpsertHandler(_helixManager, _tableNameWithType, schema,
upsertConfig.getPartialUpsertStrategies(), upsertConfig.getDefaultPartialUpsertStrategy(),
tableConfig.getValidationConfig().getTimeColumnName(), upsertConfig.getComparisonColumn());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to avoid extra handling in the method (comparisonColumn is non-null this way)

Suggested change
partialUpsertHandler = new PartialUpsertHandler(_helixManager, _tableNameWithType, schema,
upsertConfig.getPartialUpsertStrategies(), upsertConfig.getDefaultPartialUpsertStrategy(),
tableConfig.getValidationConfig().getTimeColumnName(), upsertConfig.getComparisonColumn());
String comparisonColumn = upsertConfig.getComparisonColumn();
if (comparisonColumn == null) {
comparisonColumn = tableConfig.getValidationConfig().getTimeColumnName();
}
partialUpsertHandler = new PartialUpsertHandler(_helixManager, _tableNameWithType, schema,
upsertConfig.getPartialUpsertStrategies(), upsertConfig.getDefaultPartialUpsertStrategy(), comparisonColumn);

@yupeng9
Copy link
Contributor

yupeng9 commented Jan 20, 2022

the compatibility regression seems broken from previous PRs. and the error is not related to this one.

@yupeng9 yupeng9 merged commit d317c59 into apache:master Jan 20, 2022
KKcorps pushed a commit to KKcorps/incubator-pinot that referenced this pull request Jan 21, 2022
* Add global strategy for partial upsert

* fix UT setup

* try fix lint

* fix tests

* handle empty globalUpsertStrategy

* update defaultValue for full upsert to be null

* update _globalUpsertStrategy to _defaultPartialUpsertStrategy

* try fix lint

* fix checkstyle

* add taskConfig test setup code

* include all physical columns (including date time columns) except for primary key columns and comparison column

* fix partial upsert handler merge tests

* Annotate comparison column as nullable, use main time column

* simplified partialUpsertHandler (comparison column is non-null)

* fix checkstyle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants