Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance AggregationFuzzer to verify results against Presto #6595

Open
mbasmanova opened this issue Sep 15, 2023 · 20 comments
Open

Enhance AggregationFuzzer to verify results against Presto #6595

mbasmanova opened this issue Sep 15, 2023 · 20 comments
Labels
enhancement New feature or request

Comments

@mbasmanova
Copy link
Contributor

Description

Currently, Aggregation Fuzzer verifies results against DuckDB. However, not all functions are available in DuckDB and sometimes semantics don't match. It would be better to verify against Presto Java.

We could launch PrestoJava process and talk to it via REST API: https://prestodb.io/docs/current/develop/client-protocol.html

I put together a prototype of a PrestoQueryRunner that can execute Presto queries:

https://github.com/prestodb/presto/compare/master...mbasmanova:presto:native-query-runner?expand=1

There are a few things to figure out still.

(1) By default, Presto returns results in JSON format. This is slow and hard to parse. Hence, I hacked Presto to return results in PrestoPage format (base64-encoded). Perhaps, we could introduce a new HTTP header that a client would specify to request PrestoPage format instead of default JSON format.

(2) The PrestoQueryRunner needs HTTP client. I'm using Proxygen as it is already available in Prestissimo. However, we need this code in Velox, so we can run fuzzer on each PR. Hence, we need to figure out how to add Proxygen dependency to Velox.

(3) PrestoQueryRunner code needs to be hooked into the Aggregation Fuzzer.

CC: @amitkdutta @aditi-pandit @kgpai @laithsakka @pedroerp @spershin

@duanmeng
Copy link
Collaborator

duanmeng commented Sep 20, 2023

add Proxygen dependency to Velox.

Perhaps proxygen is too heavy with lots of dependencies in terms of a simple HTTP client. Can libcurl or its c++ wrapper curlpp or libcpr (supports FetchContent like fmt, xsimd in velox/CMAKE)be an option too?

@pedroerp
Copy link
Contributor

Love this. One nuance is that we might end up with slight parsing discrepancies between the Duck and Presto parsers. Hopefully it won't affect most simple cases.

@majetideepak discussion about moving our parser to follow Presto semantic would be super handy to go around this.

@majetideepak
Copy link
Collaborator

@pedroerp I am still contemplating between the Presto parser vs. Postgres parser that Duck uses. Using Postgres parser makes the porting straightforward. My guess is that the PG parser should be compatible for Aggregation Fuzzer's needs.

@mbasmanova
Copy link
Contributor Author

It seems that using PG parser will be easier. That parser is working well, the other one hasn't been used much and will require quite a bit of hardening.

@pedroerp
Copy link
Contributor

Agreed. Btw, do we need to use it through DuckDB, or could we just use the postgreSQL parser directly? I guess DuckDB should add a only a thin layer on top of that?

@mbasmanova
Copy link
Contributor Author

@rui-mo Rui, it would be nice to also add support for verifying Spark functions against Spark. Do you think it would be possible to create SparkQueryRunner similar to PrestoQueryRunner above?

mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Sep 22, 2023
…bator#6686)

Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of facebookincubator#6595


Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Sep 22, 2023
…bator#6686)

Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of facebookincubator#6595


Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Sep 23, 2023
…bator#6686)

Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of facebookincubator#6595


Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Sep 23, 2023
…bator#6686)

Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of facebookincubator#6595


Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova
facebook-github-bot pushed a commit that referenced this issue Sep 23, 2023
Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of #6595

Pull Request resolved: #6686

Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova

fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6
@aditi-pandit
Copy link
Collaborator

@mbasmanova : This is very useful and would really improve our confidence for Presto functions. Thanks !

@rui-mo
Copy link
Collaborator

rui-mo commented Sep 25, 2023

@mbasmanova We found Spark can also be launched through REST API. Looks it is possible to create a SparkQueryRunner similar with PrestoQueryRunner. Link: https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/.
cc @PHILO-HE @zhouyuan

@mbasmanova
Copy link
Contributor Author

@rui-mo

We found Spark can also be launched through REST API

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

@zhouyuan
Copy link
Contributor

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

@mbasmanova @rui-mo maybe it's more convenient to test with Spark connect:
https://spark.apache.org/docs/latest/spark-connect-overview.html

-yuan

mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Sep 27, 2023
Summary:
Extract ReferenceQueryRunner interface to allow using different reference
databases for results verification in AggregationFuzzer. For example, we would
want to verify Presto functions against Presto and Spark functions against
Spark.

Part of facebookincubator#6595


Reviewed By: xiaoxmeng

Differential Revision: D49553797

Pulled By: mbasmanova
facebook-github-bot pushed a commit that referenced this issue Sep 27, 2023
Summary:
Extract ReferenceQueryRunner interface to allow using different reference
databases for results verification in AggregationFuzzer. For example, we would
want to verify Presto functions against Presto and Spark functions against
Spark.

Part of #6595

Pull Request resolved: #6701

Reviewed By: xiaoxmeng

Differential Revision: D49553797

Pulled By: mbasmanova

fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456
@mbasmanova
Copy link
Contributor Author

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

@mbasmanova @rui-mo maybe it's more convenient to test with Spark connect: https://spark.apache.org/docs/latest/spark-connect-overview.html

-yuan

That looks promising. Do you have an idea about how to "load data" into Spark? The way Fuzzer works is it generates Velox vectors, then runs queries over these. For Presto, I'm jumping through some hoops to create Hive tables with the content of these Velox vectors. Can we do something similar for Spark?

See PrestoQueryRunner::execute in prestodb/presto@master...mbasmanova:presto:native-query-runner

facebook-github-bot pushed a commit that referenced this issue Sep 29, 2023
Summary:
Part of #6595

Pull Request resolved: #6808

Reviewed By: xiaoxmeng

Differential Revision: D49758968

Pulled By: mbasmanova

fbshipit-source-id: 889f4d0d8c24ee370d0cc5a0b5a65046b06a965d
@mbasmanova
Copy link
Contributor Author

I have been running AggregationFuzzer with Presto as a source of truth locally and discovered a few things.

  • bitwise_xxx aggregate functions in Presto take only BIGINT inputs, but Velox allows TINYINT, SMALLINT, INTEGER and BIGINT. This causes failures when fuzzer generates plans with non-BIGINT inputs as Presto query succeeds and returns BIGINT result, but Velox plan returns result of the same type as input.
  • min/max/min_by/max_by functions in Presto do not allow MAP inputs, but Velox allows these. MAPs are considered non-orderable in Presto. Velox needs to be fixed.
  • array_sort function doesn't allow MAP inputs and doesn't allow inputs with nested nulls. This prevents using array_sort to make results of various functions deterministic, i.e. array_sort(array_agg(x)) doesn't work for all inputs. We would need to invent something else. Temp workaround could be to avoid generating maps and complex types with nested nulls.

@aditi-pandit
Copy link
Collaborator

@mbasmanova : There are use-cases when Presto users are writing their own UDFs to register with Prestissimo. It would be great to use fuzzer along with them as well. Infact, I feel that we should run fuzzer in Presto builds as well. wdyt ?

@mbasmanova
Copy link
Contributor Author

@aditi-pandit

There are use-cases when Presto users are writing their own UDFs to register with Prestissimo.

Aditi, would you share more details about these use cases? Is there an example? It should be possible to run the Fuzzer of any kind of UDF.

@aditi-pandit
Copy link
Collaborator

@mbasmanova :

I agree that Fuzzer can run any kind of UDF.

At IBM we have a bunch of scalar functions related to IBM specific security/governance tech that we add to Presto registry. These are implemented at Prestissimo level and not in Velox.

To test these functions with the fuzzer you propose here we need to be able to invoke it (could simply be on command line) in Presto builds. Infact I feel that we should run this fuzzer test with each Presto build as well.

Based on prestodb/presto#21044, it seems simpler to put rest of fuzzer in Presto build as well. Velox needs bunch more dependencies to make this to run in Velox builds.

Ofcourse the disadvantage with this is that for any Presto function added to Velox, the testing will be delayed until the function makes it to Presto.

Would be great to hear your thoughts on this.

@mbasmanova
Copy link
Contributor Author

@aditi-pandit It would be best to test functions in the same repo as where they are added / modified. Hence, Presto functions added to Velox repo would be tested in Velox CI on each Velox PR. IBM-specific functions added to xxx repo would be tested by CI in that repo.

You are correct that adding PrestoQueryRunner to Velox repo would require adding some dependencies to Velox (something to allow for creating an HTTP client).

@aditi-pandit
Copy link
Collaborator

@aditi-pandit It would be best to test functions in the same repo as where they are added / modified. Hence, Presto functions added to Velox repo would be tested in Velox CI on each Velox PR. IBM-specific functions added to xxx repo would be tested by CI in that repo.

You are correct that adding PrestoQueryRunner to Velox repo would require adding some dependencies to Velox (something to allow for creating an HTTP client).

@mbasmanova : It is reasonable that the functions should be tested in the repo they are added in.

There were issues raised internally if there is a circular dependency between Presto and Velox if we wanted to run the Velox fuzzer in Presto builds. With your PR that doesn't seem to be the case. It relies on an HTTP client to send Presto requests to a running Presto server. There aren't dependencies on Presto code as such.

To test custom Presto functions, we could obtain the fuzzer program from the Velox submodule and run it independently in a test at build time. I am glossing over some details here since we do need a way to register the custom Presto functions with the fuzzer.

Infact, I feel that we should do a fuzzer run with existing Presto functions during build anyways.
wdyt ?

@mbasmanova
Copy link
Contributor Author

To test custom Presto functions, we could obtain the fuzzer program from the Velox submodule and run it independently in a test at build time.

That's right. See example in main() in https://github.com/prestodb/presto/pull/21028/files

ericyuliu pushed a commit to ericyuliu/velox that referenced this issue Oct 12, 2023
…bator#6686)

Summary:
Extend PrestoSerializer to support deserializing dictionary-encoded data.

The format of dictionary-encoded columns is:
- 4 bytes: number of rows
- N bytes: dictionary column
- 4*numRows bytes: indices
- 24 bytes: 'instance id' (used by Presto, but not present in Velox)

Part of facebookincubator#6595

Pull Request resolved: facebookincubator#6686

Reviewed By: pedroerp

Differential Revision: D49532000

Pulled By: mbasmanova

fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6
ericyuliu pushed a commit to ericyuliu/velox that referenced this issue Oct 12, 2023
Summary:
Extract ReferenceQueryRunner interface to allow using different reference
databases for results verification in AggregationFuzzer. For example, we would
want to verify Presto functions against Presto and Spark functions against
Spark.

Part of facebookincubator#6595

Pull Request resolved: facebookincubator#6701

Reviewed By: xiaoxmeng

Differential Revision: D49553797

Pulled By: mbasmanova

fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456
ericyuliu pushed a commit to ericyuliu/velox that referenced this issue Oct 12, 2023
…or#6808)

Summary:
Part of facebookincubator#6595

Pull Request resolved: facebookincubator#6808

Reviewed By: xiaoxmeng

Differential Revision: D49758968

Pulled By: mbasmanova

fbshipit-source-id: 889f4d0d8c24ee370d0cc5a0b5a65046b06a965d
@rui-mo
Copy link
Collaborator

rui-mo commented Oct 13, 2023

@mbasmanova

I don't see a way to fetch results via REST API. Do you know if that's possible?

It looks possible to fetch results via REST API. https://github.com/jamesshocking/Spark-REST-API-UDF

Do you have an idea about how to "load data" into Spark? The way Fuzzer works is it generates Velox vectors, then runs queries over these. For Presto, I'm jumping through some hoops to create Hive tables with the content of these Velox vectors. Can we do something similar for Spark?

Maybe similar as Presto runner, we can generate files through vectors and read them back to create table in Spark.

@zhztheplayer
Copy link
Contributor

@rui-mo (cc @mbasmanova) Probably Spark doesn't provide an embedded rest server to serve SQL queries like Presto. If it's complicated to use the job-submission API (the one in https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/) to implement this sort of IPC work then we may have to adopt Spark SQL thrift server which could support ODBC / JDBC by default.

https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server

ODBC is naturally designed for cross-language SQL query execution. JDBC could also become an option but we need an JDBC rest server to make it be able to talk with C++.

But I am still feeling that managing the JVM process (Spark, Presto) would be a non-trivial job from C++ side. I'll look at Presto's solution to see how it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants