-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PRIMARY KEY Aggregate support to dataframe API #8356
Conversation
# Conflicts: # datafusion/common/src/functional_dependencies.rs
I plan to review this PR tomorrow if no one beats me to it |
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs # datafusion/optimizer/src/optimizer.rs
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this very carefully (over 6 sittings as you can see in the commit history!) and it is almost ready to merge. It pays a longstanding tech debt where functional dependencies worked in SQL but not in dataframe API.
I asked @mustafasrepo to make one little optional fix and then it will be ready to go.
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs
Since this has been around for a week, it doesn't make any potentially disruptive change, pays an old technical debt, and that I have already reviewed it very carefully; I will go ahead and merge this. In case anything breaks, let us know and we will promptly fix it. |
* Aggregate rewrite for dataframe API. * Simplifications * Minor changes * Minor changes * Add new test * Add new tests * Minor changes * Add rule, for aggregate simplification * Simplifications * Simplifications * Simplifications * Minor changes * Simplifications * Add new test condition * Tmp * Push requirement below aggregate * Add join and subqeury alias * Add cross join support * Minor changes * Add logical plan repartition support * Add union support * Add table scan * Add limit * Minor changes, buggy * Add new tests, fix existing bugs * change concat type array_concat * Resolve some of the bugs * Comment out a rule * All tests pass, when single distinct is closed * Fix aggregate bug * Change analyze and explain implementations * All tests pass * Resolve linter errors * Simplifications, remove unnecessary codes * Comment out tests * Remove pushdown projection * Pushdown empty projections * Fix failing tests * Simplifications * Update comments, simplifications * Remove eliminate projection rule, Add method for group expr len aggregate * Simplifications, subquery support * Update comments, add unnest support, simplifications * Remove eliminate projection pass * Change name * Minor changes * Minor changes * Add comments * Fix failing test * Minor simplifications * update * Minor * Remove ordering * Minor changes * add merge projections * Add comments, resolve linter errors * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Review Part 1 * Review Part 2 * Fix quadratic search, Change trim_expr impl * Review Part 3 * Address reviews * Minor changes * Review Part 4 * Add case expr support * Review Part 5 * Review Part 6 * Finishing touch: Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
Currently we can rewrite group by expressions according to primary key.
Consider query below
normally, after aggregation only
sn
andSUM(amount) as sum1
would be available for output. However, givensn
isPRIMARY_KEY
, we know that eachsn
value maps to a fixedts
value (e.g They are functionally depandant). Hence, by treating query above asunder the hood, we can output
ts
also at the end. However, this feature was only available withSQL
API. See discussion.This PR brings this support to the dataframe API also. With this PR following dataframe query can be executed given
col_id
isPRIMARY_KEY
.What changes are included in this PR?
Are these changes tested?
Yes, most of the changes comes from either plan tests, or
.slt
testAre there any user-facing changes?