-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support subquery execution in the query language #7646
Conversation
Functioning subqueries in the open source version. Still needs to have a lot of tests added (I only ran some manual tests) and I also need to adapt the refactor in the query engine to also work in the closed source version. So while this isn't ready to merge, this is a substantial step in the direction of having this implemented and merged. |
11f10cb
to
b0a29fc
Compare
2f9a3c2
to
25b9512
Compare
@jwilder found a problem with some queries. Listing them here so I make sure to fix them before we merge this.
|
1224503
to
8deb045
Compare
I've resolved the previous two issues. There is one more issue that came up. If you have something like The reason for this is because we return all rows from each series in the inner query before continuing to the next series. The problem is, the outer query needs to then merge these different series and can't because they aren't ordered correctly. This is a more difficult problem. I'll include more notes on this later. |
A larger explanation of the above problem. So the problem arises because now we have more than two levels of grouping that we can use. The problem didn't show up when there was only one level of grouping because we could figure out the grouping at the beginning of the query and didn't have to worry about it being grouped in a different way later. The problem arises when you have intervals and group by at least one tag in the inner query. When we output the points for the final time, we arrange for all of the points in one series to be output before the points in another series. But, when the points are grouped into the same bucket, we need to output all of the points within the same interval for the grouped series before continuing to the next interval. This fundamental difference in how the points are output causes the problem. After processing the first aggregate, we have merged all of the series that are being output into the same stream so we can't read concurrently from different streams to group them together again. We have to read the full stream. My current idea is to stop merging different series into the same iterator. That would resolve the issue, but I have concerns about how it would affect the closed source version since it might mean we need a socket per grouped series. If you do something like My other idea is to have the aggregates start outputting points in a way that the next aggregate iterator can process, but I haven't worked out how to do that yet and I don't know if it will hold when you have an inner query within an inner query. |
After taking a few days to clear my head by working on something else, this is my current favored idea and I actually think it will work. So the important part is determining where the top level aggregate is located and determining how to structure the iterators based on that top level aggregate. I still need to think about I figure that if we have a query like this:
And the I think I'll start playing with this and see if it helps... Likely a lot of edge cases I haven't thought of. On that list is also what to do when |
b01f384
to
03c48d7
Compare
I believe this to now be ready for further testing. I believe the second idea that I had worked. Basically, the only dimensions we have to order by are the last ones. If those dimensions get passed separate from the ones we need to group by, we can prepare the iterators at the lowest level to order data in an appropriate way. I have not done exhaustive testing of this. I'm not even 100% certain that the output is correct yet, but I wanted to give a heads up for anybody who wants to try it and can give feedback. There's no more repeating timestamps so I think it's working correctly. |
16d3f2a
to
82f2576
Compare
I've now included a test for a different Looking much better now. This is ready for further testing. |
82f2576
to
0c94a81
Compare
@jsternberg Just started playing with this branch and I came across this panic
for this query select bottom(bottom,3),some from (select bottom(n, 10) from ctr where time > now() - 3m group by *) To repro run Edit: Just confirmed the following queries also cause the panic select bottom(n,3),some from (select * from ctr where time > now() - 3m group by *) select bottom(n,3),some from (select * from ctr where time > now() - 3m) |
I'm a little confused as to why select * from (select * from ctr where time > now() - 10s group by *) limit 10 yields the error
but select n,some from (select n,some from ctr where time > now() - 10s group by *) limit 10 yields the expected results
|
Just discovered that The query
yields all of the series available to the group by *
|
Another thing I came across was that is consistent with how CQs work, but that is technically kind of wrong. If you select explicitly for a tag in the select statement of the sub query, then that tag is transformed into a field in the super query. For example select n,some from (select n,some from ctr where time > now() - 10s) group by * limit 10 slimit 2 yields
whereas I would have expected
|
fbd3794
to
4a08229
Compare
func IsSelector(expr Expr) bool { | ||
if call, ok := expr.(*Call); ok { | ||
switch call.Name { | ||
case "first", "last", "min", "max", "percentile", "top", "bottom": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample
is also a selector.
397d0d1
to
503ebd6
Compare
@@ -2057,7 +2058,7 @@ func (p *Parser) peekRune() rune { | |||
return r | |||
} | |||
|
|||
func (p *Parser) parseSource() (Source, error) { | |||
func (p *Parser) parseSource(subqueries bool) (Source, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like there should be two methods here parseSource
and parseSourceWithSubqueries
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did this is because otherwise it would have required copying code. Splitting them into two different functions was impossible to compose because the subqueries logic had to go in the middle of the function. It wasn't possible to have it at the beginning or end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Makes sense.
@@ -2028,11 +2029,11 @@ func (p *Parser) parseAlias() (string, error) { | |||
} | |||
|
|||
// parseSources parses a comma delimited list of sources. | |||
func (p *Parser) parseSources() (Sources, error) { | |||
func (p *Parser) parseSources(subqueries bool) (Sources, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with these parseSources
and parseSourcesWithSubqueries
.
Just finished a first pass read through. From what I can tell LGTM. |
503ebd6
to
386dd9e
Compare
We've confirmed this branch is working nicely, renders ~24h of data in under 30s. Caveat was that we needed to add With commits before
With commit
|
I've pushed an amended commit that will automatically switch the inner query to turn |
386dd9e
to
1f9bb5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking good. Just a couple questions and a nit.
} | ||
evalTime(stmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract this inline recursive function assignment out to a standalone function? It's confusing to follow within the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I made a Reduce
function on SelectStatement
that would call Reduce
on the appropriate fields. I couldn't add it as part of the case statement for Reduce
itself though because that only accepts an Expr
and I didn't want to change the function signature.
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cleaner to break this block out to an evalSubqueryType()
function. The indentation is pretty far over and it would allow you to remove the label on the for
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to use the FieldExprByName
utility function. I think another wrapper function for this might be good, but tell me what you think about the new code first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// used for executing queries. | ||
type ShardMapper interface { | ||
MapShards(sources influxql.Sources, opt *influxql.SelectOptions) (IteratorCreator, error) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is ShardMapper
necessary? Does the query engine's interface need to know about shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think. The original intention of this is when I took it from lazy iterator evaluation. It was to allow the ordering of the shards. Here, it's used to organize the shards based on their sources and create a special IteratorCreator
that can use FieldDimensions
rather than requiring FieldDimensions
and the sources to need to be propagated everywhere.
So it maps the shards needed to the sources requested. Eventually, it will probably also implement lazy iterators if we can ever make that performant.
1f9bb5c
to
9ffbe50
Compare
This adds query syntax support for subqueries and adds support to the query engine to execute queries on subqueries. Subqueries act as a source for another query. It is the equivalent of writing the results of a query to a temporary database, executing a query on that temporary database, and then deleting the database (except this is all performed in-memory). The syntax is like this: SELECT sum(derivative) FROM (SELECT derivative(mean(value)) FROM cpu GROUP BY *) This will execute derivative and then sum the result of those derivatives. Another example: SELECT max(min) FROM (SELECT min(value) FROM cpu GROUP BY host) This would let you find the maximum minimum value of each host. There is complete freedom to mix subqueries with auxiliary fields. The only caveat is that the following two queries: SELECT mean(value) FROM cpu SELECT mean(value) FROM (SELECT value FROM cpu) Have different performance characteristics. The first will calculate `mean(value)` at the shard level and will be faster, especially when it comes to clustered setups. The second will process the mean at the top level and will not include that optimization.
9ffbe50
to
d7c8c7c
Compare
@jsternberg Did you push your latest changes up? I tried searching the d7c8c7c diff and couldn't find |
Hi I'm trying to filter a result of a subquery, but the where condition doesn't work on the result of the subquery. The last result should list sub_gourp=1 and sub_group=2 only, but it includes also sub_group=3.
Am I missing something? |
If For issues like this in the future, please open a new ticket or send a message to the mailing list. |
This adds query syntax support for subqueries and adds support to the
query engine to execute queries on subqueries.
Subqueries act as a source for another query. It is the equivalent of
writing the results of a query to a temporary database, executing
a query on that temporary database, and then deleting the database
(except this is all performed in-memory).
The syntax is like this:
This will execute derivative and then sum the result of those derivatives.
Another example:
This would let you find the maximum minimum value of each host.
There is complete freedom to mix subqueries with auxiliary fields. The only
caveat is that the following two queries:
Have different performance characteristics. The first will calculate
mean(value)
at the shard level and will be faster, especially when it comes toclustered setups. The second will process the mean at the top level and will not
include that optimization.
Fixes #4619.