Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper NULL value support in Pinot #8697

Closed
nizarhejazi opened this issue May 13, 2022 · 2 comments
Closed

Proper NULL value support in Pinot #8697

nizarhejazi opened this issue May 13, 2022 · 2 comments
Assignees
Labels

Comments

@nizarhejazi
Copy link
Contributor

When nullHandlingEnabled config is set to true, Pinot still returns (in SELECT) the default value of columns vs. null when the value is null. Null value is supported only in filtering phase.

Reserving a special value of primitives to indicate Null does not work for our use case for the following reasons:

  • We have to mix two different styles of working with nulls (IS/IS NOT NULL predicate vs. checking special value returned by SELECT).
  • We don’t query Pinot directly but through Presto. Reserving special value for Null requires filter out these special values everywhere in the Presto SQL statement. This makes generated Presto SQL on our end more complex.
    • We generate Presto SQL statements from user input (UI, or query editor).
  • Some of Presto functions does not work with special primitive values (e.g. COALESCE). In general, this pattern deviates from how Null is handled by Presto and other big data systems.

For performance reasons, pinot stores and transmit values using primitive types. Supporting Null value throughout the engine is a big effort. Can we start by supporting it in SelectionOnlyOperator/SelectionOrderByOperator?

The idea is to transfer back a bitmap per column (presence vector) from servers to broker if a special config is set (nullHandlingInSelect). This change is fully backward compatible.

@Jackie-Jiang
Copy link
Contributor

There are 2 ways to support this, with tradeoff between performance and backward compatibility:

  1. Send back a bitmap per column from server to broker to mark the null values (extra overhead but fully backward compatible)
  2. Reserve a special value per data type to represent the null value (no extra overhead, but all reserved value will be treated as null)

We can start with option 1, and if necessary, we may build option 2 and use a flag to switch between them

@nizarhejazi
Copy link
Contributor Author

Here is a draft PR (#8742) that demonstrates NULL value support in:

  1. SELECT, ORDER BY, DISTINCT, HAVING, GROUP BY.
  2. Aggregation functions: COUNT, MAX.
  3. Non-dictionary columns.
  4. Basic transforms.
  5. Few data types: Boolean, Int, and BigDecimal.

@Jackie-Jiang Please review this draft PR and let me know if you agree with my general approach. If you agree, I will send a PR soon and follow up with few more PRs to get NULL support done for:

  • All SELECT statements
  • All data types
  • Dictionary/non-dictionary based columns
  • All transforms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants