Enhance AggregationFuzzer to verify results against Presto #6595

mbasmanova · 2023-09-15T19:14:47Z

Description

Currently, Aggregation Fuzzer verifies results against DuckDB. However, not all functions are available in DuckDB and sometimes semantics don't match. It would be better to verify against Presto Java.

We could launch PrestoJava process and talk to it via REST API: https://prestodb.io/docs/current/develop/client-protocol.html

I put together a prototype of a PrestoQueryRunner that can execute Presto queries:

https://github.com/prestodb/presto/compare/master...mbasmanova:presto:native-query-runner?expand=1

There are a few things to figure out still.

(1) By default, Presto returns results in JSON format. This is slow and hard to parse. Hence, I hacked Presto to return results in PrestoPage format (base64-encoded). Perhaps, we could introduce a new HTTP header that a client would specify to request PrestoPage format instead of default JSON format.

(2) The PrestoQueryRunner needs HTTP client. I'm using Proxygen as it is already available in Prestissimo. However, we need this code in Velox, so we can run fuzzer on each PR. Hence, we need to figure out how to add Proxygen dependency to Velox.

(3) PrestoQueryRunner code needs to be hooked into the Aggregation Fuzzer.

CC: @amitkdutta @aditi-pandit @kgpai @laithsakka @pedroerp @spershin

duanmeng · 2023-09-20T16:47:02Z

add Proxygen dependency to Velox.

Perhaps proxygen is too heavy with lots of dependencies in terms of a simple HTTP client. Can libcurl or its c++ wrapper curlpp or libcpr (supports FetchContent like fmt, xsimd in velox/CMAKE)be an option too?

pedroerp · 2023-09-21T00:16:03Z

Love this. One nuance is that we might end up with slight parsing discrepancies between the Duck and Presto parsers. Hopefully it won't affect most simple cases.

@majetideepak discussion about moving our parser to follow Presto semantic would be super handy to go around this.

majetideepak · 2023-09-21T16:55:20Z

@pedroerp I am still contemplating between the Presto parser vs. Postgres parser that Duck uses. Using Postgres parser makes the porting straightforward. My guess is that the PG parser should be compatible for Aggregation Fuzzer's needs.

mbasmanova · 2023-09-21T17:14:51Z

It seems that using PG parser will be easier. That parser is working well, the other one hasn't been used much and will require quite a bit of hardening.

pedroerp · 2023-09-22T01:16:53Z

Agreed. Btw, do we need to use it through DuckDB, or could we just use the postgreSQL parser directly? I guess DuckDB should add a only a thin layer on top of that?

mbasmanova · 2023-09-22T08:34:27Z

@rui-mo Rui, it would be nice to also add support for verifying Spark functions against Spark. Do you think it would be possible to create SparkQueryRunner similar to PrestoQueryRunner above?

…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova

Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of #6595 Pull Request resolved: #6686 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6

aditi-pandit · 2023-09-23T18:32:16Z

@mbasmanova : This is very useful and would really improve our confidence for Presto functions. Thanks !

rui-mo · 2023-09-25T06:16:26Z

@mbasmanova We found Spark can also be launched through REST API. Looks it is possible to create a SparkQueryRunner similar with PrestoQueryRunner. Link: https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/.
cc @PHILO-HE @zhouyuan

mbasmanova · 2023-09-26T05:50:24Z

@rui-mo

We found Spark can also be launched through REST API

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

zhouyuan · 2023-09-27T01:10:47Z

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

@mbasmanova @rui-mo maybe it's more convenient to test with Spark connect:
https://spark.apache.org/docs/latest/spark-connect-overview.html

-yuan

Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of facebookincubator#6595 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova

Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of #6595 Pull Request resolved: #6701 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456

mbasmanova · 2023-09-27T22:34:52Z

That sounds great. I don't see a way to fetch results via REST API. Do you know if that's possible?

@mbasmanova @rui-mo maybe it's more convenient to test with Spark connect: https://spark.apache.org/docs/latest/spark-connect-overview.html

-yuan

That looks promising. Do you have an idea about how to "load data" into Spark? The way Fuzzer works is it generates Velox vectors, then runs queries over these. For Presto, I'm jumping through some hoops to create Hive tables with the content of these Velox vectors. Can we do something similar for Spark?

See PrestoQueryRunner::execute in prestodb/presto@master...mbasmanova:presto:native-query-runner

Summary: Part of #6595 Pull Request resolved: #6808 Reviewed By: xiaoxmeng Differential Revision: D49758968 Pulled By: mbasmanova fbshipit-source-id: 889f4d0d8c24ee370d0cc5a0b5a65046b06a965d

mbasmanova · 2023-10-03T21:49:27Z

I have been running AggregationFuzzer with Presto as a source of truth locally and discovered a few things.

bitwise_xxx aggregate functions in Presto take only BIGINT inputs, but Velox allows TINYINT, SMALLINT, INTEGER and BIGINT. This causes failures when fuzzer generates plans with non-BIGINT inputs as Presto query succeeds and returns BIGINT result, but Velox plan returns result of the same type as input.
min/max/min_by/max_by functions in Presto do not allow MAP inputs, but Velox allows these. MAPs are considered non-orderable in Presto. Velox needs to be fixed.
array_sort function doesn't allow MAP inputs and doesn't allow inputs with nested nulls. This prevents using array_sort to make results of various functions deterministic, i.e. array_sort(array_agg(x)) doesn't work for all inputs. We would need to invent something else. Temp workaround could be to avoid generating maps and complex types with nested nulls.

aditi-pandit · 2023-10-09T16:13:34Z

@mbasmanova : There are use-cases when Presto users are writing their own UDFs to register with Prestissimo. It would be great to use fuzzer along with them as well. Infact, I feel that we should run fuzzer in Presto builds as well. wdyt ?

mbasmanova · 2023-10-11T13:09:19Z

@aditi-pandit

There are use-cases when Presto users are writing their own UDFs to register with Prestissimo.

Aditi, would you share more details about these use cases? Is there an example? It should be possible to run the Fuzzer of any kind of UDF.

aditi-pandit · 2023-10-11T17:33:00Z

@mbasmanova :

I agree that Fuzzer can run any kind of UDF.

At IBM we have a bunch of scalar functions related to IBM specific security/governance tech that we add to Presto registry. These are implemented at Prestissimo level and not in Velox.

To test these functions with the fuzzer you propose here we need to be able to invoke it (could simply be on command line) in Presto builds. Infact I feel that we should run this fuzzer test with each Presto build as well.

Based on prestodb/presto#21044, it seems simpler to put rest of fuzzer in Presto build as well. Velox needs bunch more dependencies to make this to run in Velox builds.

Ofcourse the disadvantage with this is that for any Presto function added to Velox, the testing will be delayed until the function makes it to Presto.

Would be great to hear your thoughts on this.

mbasmanova · 2023-10-11T17:36:56Z

@aditi-pandit It would be best to test functions in the same repo as where they are added / modified. Hence, Presto functions added to Velox repo would be tested in Velox CI on each Velox PR. IBM-specific functions added to xxx repo would be tested by CI in that repo.

You are correct that adding PrestoQueryRunner to Velox repo would require adding some dependencies to Velox (something to allow for creating an HTTP client).

aditi-pandit · 2023-10-11T22:04:00Z

@aditi-pandit It would be best to test functions in the same repo as where they are added / modified. Hence, Presto functions added to Velox repo would be tested in Velox CI on each Velox PR. IBM-specific functions added to xxx repo would be tested by CI in that repo.

You are correct that adding PrestoQueryRunner to Velox repo would require adding some dependencies to Velox (something to allow for creating an HTTP client).

@mbasmanova : It is reasonable that the functions should be tested in the repo they are added in.

There were issues raised internally if there is a circular dependency between Presto and Velox if we wanted to run the Velox fuzzer in Presto builds. With your PR that doesn't seem to be the case. It relies on an HTTP client to send Presto requests to a running Presto server. There aren't dependencies on Presto code as such.

To test custom Presto functions, we could obtain the fuzzer program from the Velox submodule and run it independently in a test at build time. I am glossing over some details here since we do need a way to register the custom Presto functions with the fuzzer.

Infact, I feel that we should do a fuzzer run with existing Presto functions during build anyways.
wdyt ?

mbasmanova · 2023-10-11T22:06:52Z

To test custom Presto functions, we could obtain the fuzzer program from the Velox submodule and run it independently in a test at build time.

That's right. See example in main() in https://github.com/prestodb/presto/pull/21028/files

…bator#6686) Summary: Extend PrestoSerializer to support deserializing dictionary-encoded data. The format of dictionary-encoded columns is: - 4 bytes: number of rows - N bytes: dictionary column - 4*numRows bytes: indices - 24 bytes: 'instance id' (used by Presto, but not present in Velox) Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6686 Reviewed By: pedroerp Differential Revision: D49532000 Pulled By: mbasmanova fbshipit-source-id: 8f3adb3c9e61d842b1bc00ecced9f972892952a6

Summary: Extract ReferenceQueryRunner interface to allow using different reference databases for results verification in AggregationFuzzer. For example, we would want to verify Presto functions against Presto and Spark functions against Spark. Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6701 Reviewed By: xiaoxmeng Differential Revision: D49553797 Pulled By: mbasmanova fbshipit-source-id: 489cd41ce4276c2d1d5230780f76439a39e70456

…or#6808) Summary: Part of facebookincubator#6595 Pull Request resolved: facebookincubator#6808 Reviewed By: xiaoxmeng Differential Revision: D49758968 Pulled By: mbasmanova fbshipit-source-id: 889f4d0d8c24ee370d0cc5a0b5a65046b06a965d

rui-mo · 2023-10-13T00:20:12Z

@mbasmanova

I don't see a way to fetch results via REST API. Do you know if that's possible?

It looks possible to fetch results via REST API. https://github.com/jamesshocking/Spark-REST-API-UDF

Do you have an idea about how to "load data" into Spark? The way Fuzzer works is it generates Velox vectors, then runs queries over these. For Presto, I'm jumping through some hoops to create Hive tables with the content of these Velox vectors. Can we do something similar for Spark?

Maybe similar as Presto runner, we can generate files through vectors and read them back to create table in Spark.

zhztheplayer · 2023-11-28T05:42:29Z

@rui-mo (cc @mbasmanova) Probably Spark doesn't provide an embedded rest server to serve SQL queries like Presto. If it's complicated to use the job-submission API (the one in https://sparkbyexamples.com/spark/submit-spark-job-via-rest-api/) to implement this sort of IPC work then we may have to adopt Spark SQL thrift server which could support ODBC / JDBC by default.

https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server

ODBC is naturally designed for cross-language SQL query execution. JDBC could also become an option but we need an JDBC rest server to make it be able to talk with C++.

But I am still feeling that managing the JVM process (Spark, Presto) would be a non-trivial job from C++ side. I'll look at Presto's solution to see how it works.

mbasmanova added the enhancement New feature or request label Sep 15, 2023

mbasmanova mentioned this issue Sep 15, 2023

Allow clients to receive query results in PrestoPage format prestodb/presto#20886

Closed

mbasmanova mentioned this issue Sep 22, 2023

Add support for DICTIONARY encoding to PrestoSerializer #6686

Closed

mbasmanova mentioned this issue Sep 23, 2023

Add geometric_mean Presto aggregate function #6678

Closed

mbasmanova mentioned this issue Sep 26, 2023

Decouple AggregationFuzzer from DuckDB #6701

Closed

mbasmanova mentioned this issue Sep 29, 2023

Expose test APIs for converting vectors to variant's #6808

Closed

mbasmanova mentioned this issue Oct 3, 2023

[native] Add AggregationFuzzer that uses Presto as a source of truth prestodb/presto#21028

Draft

mbasmanova mentioned this issue Oct 5, 2023

[native] Add PrestoQueryRunner prestodb/presto#21044

Merged

This was referenced Mar 26, 2024

Add support for kurtosis Spark aggregate function #9233

Closed

Enhance AggregationFuzzer to verify results against Spark #9270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance AggregationFuzzer to verify results against Presto #6595

Enhance AggregationFuzzer to verify results against Presto #6595

mbasmanova commented Sep 15, 2023

duanmeng commented Sep 20, 2023 •

edited

Loading

pedroerp commented Sep 21, 2023

majetideepak commented Sep 21, 2023

mbasmanova commented Sep 21, 2023

pedroerp commented Sep 22, 2023

mbasmanova commented Sep 22, 2023

aditi-pandit commented Sep 23, 2023

rui-mo commented Sep 25, 2023

mbasmanova commented Sep 26, 2023

zhouyuan commented Sep 27, 2023

mbasmanova commented Sep 27, 2023

mbasmanova commented Oct 3, 2023

aditi-pandit commented Oct 9, 2023

mbasmanova commented Oct 11, 2023

aditi-pandit commented Oct 11, 2023

mbasmanova commented Oct 11, 2023

aditi-pandit commented Oct 11, 2023

mbasmanova commented Oct 11, 2023

rui-mo commented Oct 13, 2023 •

edited

Loading

zhztheplayer commented Nov 28, 2023

Enhance AggregationFuzzer to verify results against Presto #6595

Enhance AggregationFuzzer to verify results against Presto #6595

Comments

mbasmanova commented Sep 15, 2023

Description

duanmeng commented Sep 20, 2023 • edited Loading

pedroerp commented Sep 21, 2023

majetideepak commented Sep 21, 2023

mbasmanova commented Sep 21, 2023

pedroerp commented Sep 22, 2023

mbasmanova commented Sep 22, 2023

aditi-pandit commented Sep 23, 2023

rui-mo commented Sep 25, 2023

mbasmanova commented Sep 26, 2023

zhouyuan commented Sep 27, 2023

mbasmanova commented Sep 27, 2023

mbasmanova commented Oct 3, 2023

aditi-pandit commented Oct 9, 2023

mbasmanova commented Oct 11, 2023

aditi-pandit commented Oct 11, 2023

mbasmanova commented Oct 11, 2023

aditi-pandit commented Oct 11, 2023

mbasmanova commented Oct 11, 2023

rui-mo commented Oct 13, 2023 • edited Loading

zhztheplayer commented Nov 28, 2023

duanmeng commented Sep 20, 2023 •

edited

Loading

rui-mo commented Oct 13, 2023 •

edited

Loading