Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby and having getfield index fixes #2281

Merged
merged 27 commits into from
Jan 24, 2024
Merged

groupby and having getfield index fixes #2281

merged 27 commits into from
Jan 24, 2024

Conversation

jycor
Copy link
Contributor

@jycor jycor commented Jan 22, 2024

Implementing more groupby and having strangeness from MySQL, and fixing some getfield indexing errors.

Aliases in the projection scope were always added as new columns in the having scope; this led to getfields with indexes that were invalid in aggregation functions. Now, we don't add the alias to the scope if the alias overlaps with a column that is used in an aggregation.

There is also a small fix to rewriting planbuilder tests to have proper tabbing, write skipped queries, and removed some random new lines.

There are still some issues, left as skipped tests:

  • We incorrectly disallow aliases from the projection scope in aggregations
  • Columns that are aliased and used in group by should not be visible to having clauses; so, we should throw an error.

Correctness: dolthub/dolt#7382

@jycor jycor requested a review from max-hoffman January 22, 2024 09:59
},
{
// TODO: this should fail
Query: "select x + 1 as xx from xy join uv on x = u group by xx having x = 123;",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poking around this, if there is a HAVING but not GROUP BY does it still fail? Same issue with a WINDOW clause? Same issue with cartesian join that is just select ... from xy,uv .... This just seems really weird, incremental improvements are good but more info would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAVING with no GROUP BY should fail as well, added test for this

I think we're running into issues detailed here: dolthub/dolt#6676

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an alias that is x it's definition is still the projection one? Subquery aliases also. For select x + 1 as xx, (select 1 from uv having x = 123) from xy, uv group by xx), I'm assuming the x is still visible inside the subquery? Lots other patterns that might be weird, HAVING subquery is another one select x + 1 as xx from xy join uv on x = u group by xx having (select 1 from u where x = 123);

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a new havingScope because it has special visibility rules, but the way this is structured still makes it difficult for me to understand what the rules are. Are some table columns excluded? It seems like everything in fromScope is a fallback. So then we only block certain projection aliases? So then the question is what projection scope columns take precedence, and the list seems like 1) a column from fromScope is resolved before a projection alias of the same name, 2) there is a blacklist of aliases depending on whether we are in a grouping. The alias->getField check seems maybe unnecessary? I can't think why it would be a problem for HAVING to resolve an alias rather than its underlying column.

Generating invalid ids also seems like it is still a problem? Why can't we reference the projection scope alias expression id in a HAVING clause?

It also seemed like a prior version was concerned with orphaned HAVING column identifiers post-grouping (ex: select sum(x) from xy group by y HAVING x > 0), which I'm not sure is reasonable to validate here, but is not this PRs concern now?

@@ -160,11 +160,6 @@ Project
├─ colSet: (4-6)
└─ tableId: 2
`,
},
{
Query: "analyze table xy update histogram on (x, y) using data '{\"row_count\": 40, \"distinct_count\": 40, \"null_count\": 1, \"columns\": [\"x\", \"y\"], \"histogram\": [{\"row_count\": 20, \"upper_bound\": [50.0]}, {\"row_count\": 20, \"upper_bound\": [80.0]}]}'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep?

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, more to do in the future but the incremental fixes w/ thorough testing and doc comments are all wins

}

// Aliased expression is allowed if there is no group by (it is implicitly a single group)
if len(fromScope.groupBy.inCols) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is inCols == 0 equivalent to hasAggs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hasAggs() returns whether or not there are aggregations used in the groupBy, while inCols are just the columns in the groupBy (can be outside of aggregation function)

@jycor jycor merged commit 5ef7723 into main Jan 24, 2024
7 checks passed
@jycor jycor deleted the james/getfield branch January 24, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants