-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance AggregationFuzzer to verify results against Presto #6595
Comments
Love this. One nuance is that we might end up with slight parsing discrepancies between the Duck and Presto parsers. Hopefully it won't affect most simple cases. @majetideepak discussion about moving our parser to follow Presto semantic would be super handy to go around this. |
@pedroerp I am still contemplating between the Presto parser vs. Postgres parser that Duck uses. Using Postgres parser makes the porting straightforward. My guess is that the PG parser should be compatible for Aggregation Fuzzer's needs. |
It seems that using PG parser will be easier. That parser is working well, the other one hasn't been used much and will require quite a bit of hardening. |
Agreed. Btw, do we need to use it through DuckDB, or could we just use the postgreSQL parser directly? I guess DuckDB should add a only a thin layer on top of that? |
@rui-mo Rui, it would be nice to also add support for verifying Spark functions against Spark. Do you think it would be possible to create SparkQueryRunner similar to PrestoQueryRunner above? |
…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova
…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova
…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova
…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova
Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of #6595 Pull Request resolved: #6686 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6
@mbasmanova : This is very useful and would really improve our confidence for Presto functions. Thanks ! |
@mbasmanova We found Spark can also be launched through REST API. Looks it is possible to create a SparkQueryRunner similar with PrestoQueryRunner. Link: https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/. |
That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible? |
@mbasmanova @rui-mo maybe it's more convenient to test with Spark connect: -yuan |
Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of facebookincubator#6595 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova
Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of #6595 Pull Request resolved: #6701 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456
That looks promising. Do you have an idea about how to "load data" into Spark? The way Fuzzer works is it generates Velox vectors, then runs queries over these. For Presto, I'm jumping through some hoops to create Hive tables with the content of these Velox vectors. Can we do something similar for Spark? See PrestoQueryRunner::execute in prestodb/presto@master...mbasmanova:presto:native-query-runner |
I have been running AggregationFuzzer with Presto as a source of truth locally and discovered a few things.
|
@mbasmanova : There are use-cases when Presto users are writing their own UDFs to register with Prestissimo. It would be great to use fuzzer along with them as well. Infact, I feel that we should run fuzzer in Presto builds as well. wdyt ? |
Aditi, would you share more details about these use cases? Is there an example? It should be possible to run the Fuzzer of any kind of UDF. |
I agree that Fuzzer can run any kind of UDF. At IBM we have a bunch of scalar functions related to IBM specific security/governance tech that we add to Presto registry. These are implemented at Prestissimo level and not in Velox. To test these functions with the fuzzer you propose here we need to be able to invoke it (could simply be on command line) in Presto builds. Infact I feel that we should run this fuzzer test with each Presto build as well. Based on prestodb/presto#21044, it seems simpler to put rest of fuzzer in Presto build as well. Velox needs bunch more dependencies to make this to run in Velox builds. Ofcourse the disadvantage with this is that for any Presto function added to Velox, the testing will be delayed until the function makes it to Presto. Would be great to hear your thoughts on this. |
@aditi-pandit It would be best to test functions in the same repo as where they are added / modified. Hence, Presto functions added to Velox repo would be tested in Velox CI on each Velox PR. IBM-specific functions added to xxx repo would be tested by CI in that repo. You are correct that adding PrestoQueryRunner to Velox repo would require adding some dependencies to Velox (something to allow for creating an HTTP client). |
@mbasmanova : It is reasonable that the functions should be tested in the repo they are added in. There were issues raised internally if there is a circular dependency between Presto and Velox if we wanted to run the Velox fuzzer in Presto builds. With your PR that doesn't seem to be the case. It relies on an HTTP client to send Presto requests to a running Presto server. There aren't dependencies on Presto code as such. To test custom Presto functions, we could obtain the fuzzer program from the Velox submodule and run it independently in a test at build time. I am glossing over some details here since we do need a way to register the custom Presto functions with the fuzzer. Infact, I feel that we should do a fuzzer run with existing Presto functions during build anyways. |
That's right. See example in main() in https://github.com/prestodb/presto/pull/21028/files |
…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6686 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6
Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6701 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456
…or#6808) Summary: Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6808 Reviewed By: xiaoxmeng Differential Revision: D49758968 Pulled By: mbasmanova fbshipit-source-id: 889f4d0d8c24ee370d0cc5a0b5a65046b06a965d
It looks possible to fetch results via REST API. https://github.com/jamesshocking/Spark-REST-API-UDF
Maybe similar as Presto runner, we can generate files through vectors and read them back to create table in Spark. |
@rui-mo (cc @mbasmanova) Probably Spark doesn't provide an embedded rest server to serve SQL queries like Presto. If it's complicated to use the job-submission API (the one in https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/) to implement this sort of IPC work then we may have to adopt Spark SQL thrift server which could support ODBC / JDBC by default. ODBC is naturally designed for cross-language SQL query execution. JDBC could also become an option but we need an JDBC rest server to make it be able to talk with C++. But I am still feeling that managing the JVM process (Spark, Presto) would be a non-trivial job from C++ side. I'll look at Presto's solution to see how it works. |
Description
Currently, Aggregation Fuzzer verifies results against DuckDB. However, not all functions are available in DuckDB and sometimes semantics don't match. It would be better to verify against Presto Java.
We could launch PrestoJava process and talk to it via REST API: https://prestodb.io/docs/current/develop/client-protocol.html
I put together a prototype of a PrestoQueryRunner that can execute Presto queries:
https://github.com/prestodb/presto/compare/master...mbasmanova:presto:native-query-runner?expand=1
There are a few things to figure out still.
(1) By default, Presto returns results in JSON format. This is slow and hard to parse. Hence, I hacked Presto to return results in PrestoPage format (base64-encoded). Perhaps, we could introduce a new HTTP header that a client would specify to request PrestoPage format instead of default JSON format.
(2) The PrestoQueryRunner needs HTTP client. I'm using Proxygen as it is already available in Prestissimo. However, we need this code in Velox, so we can run fuzzer on each PR. Hence, we need to figure out how to add Proxygen dependency to Velox.
(3) PrestoQueryRunner code needs to be hooked into the Aggregation Fuzzer.
CC: @amitkdutta @aditi-pandit @kgpai @laithsakka @pedroerp @spershin
The text was updated successfully, but these errors were encountered: