Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhead from json serde #5122

Closed
lmatz opened this issue Sep 6, 2022 · 5 comments
Closed

overhead from json serde #5122

lmatz opened this issue Sep 6, 2022 · 5 comments

Comments

@lmatz
Copy link
Contributor

lmatz commented Sep 6, 2022

Stateless streaming queries take a large percentage in real-world scenarios.

These queries take the input, do some simple transformation and filter out some records, and then output to some sink.

In essence, the query itself is not that cpu-intensive, and it is also not io-bounded in terms of state access.
An example is Nexmark q10:

CREATE SOURCE bid (
    "auction" BIGINT,
    "bidder" BIGINT,
    "price" BIGINT,
    "channel" VARCHAR,
    "url" VARCHAR,
    "date_time" TIMESTAMP,
    "extra" VARCHAR
) with (
    connector = 'nexmark',
    nexmark.table.type = 'Bid',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '1000000'
) ROW FORMAT JSON;

CREATE MATERIALIZED VIEW mv_10 as SELECT auction, bidder, price, dateTime, extra, DATE_FORMAT(dateTime, 'yyyy-MM-dd'), DATE_FORMAT(dateTime, 'HH:mm')
FROM bid;

Therefore, JSON serde is likely to be the bottleneck of the query.
#4555 has discussed this issue.
For the purpose of benchmarking RisingWave itself, #4555 and #4961 proposed to generate in-memory data chunks to bypass JSON serde. This is the right way to go.

In reality, the query will read input from some other systems, and JSON deserialization is not avoidable.
Therefore, we need to benchmark some trustworthy JSON libraries as JSON deser is probably the most CPU-intensive part and takes most of the processing time in such stateless queries.
And again, there are a lot of such queries.

@lmatz
Copy link
Contributor Author

lmatz commented Sep 6, 2022

Different libraries make different tradeoffs and perform differently on different workloads.
Check out https://github.com/serde-rs/json-benchmark

https://github.com/pikkr/pikkr

Flink uses https://github.com/FasterXML/jackson, check out https://github.com/FasterXML/jackson

@lmatz lmatz removed this from the release-0.1.13 milestone Sep 6, 2022
@neverchanje
Copy link
Contributor

We need to first trace the JSON parsing time and how much it occupies the e2e latency. I think whether JSON is the bottleneck is not confirmed yet.

@TennyZhuang
Copy link
Contributor

Prefer #4961

@jon-chuang
Copy link
Contributor

jon-chuang commented Sep 6, 2022

I think it is important to note the difference between static parsing e.g. serde library, and dynamic json parsing where the structure is unknown. The latter is what many of these benchmarks (incl. https://github.com/serde-rs/json-benchmark) are based upon but is completely irrelevant to our actual requirement.

In our case, we can probably see some performance gains by leveraging a parser where the structure is known (I assume our current usage of serde json is with dynamic parsing).

Prior to that, we can bench the expected difference in performance by using known structure, as rust data type, deserialize using serde-json struct, deserialize using serde-json dynamic, deserialize using pikkr, and see the result.

@lmatz
Copy link
Contributor Author

lmatz commented Sep 6, 2022

I don't think we need to do what people have done years ago to verify again json parsing is the bottleneck:
databricks, data artisan, and some independent blog have all pointed out json parsing(only de) can decrease the throughput by a huge factor, up to 35X.

And the context is for simple stateless queries, there is honestly not much going on in RisingWave except json parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants