overhead from json serde #5122

lmatz · 2022-09-06T02:53:37Z

Stateless streaming queries take a large percentage in real-world scenarios.

These queries take the input, do some simple transformation and filter out some records, and then output to some sink.

In essence, the query itself is not that cpu-intensive, and it is also not io-bounded in terms of state access.
An example is Nexmark q10:

CREATE SOURCE bid (
    "auction" BIGINT,
    "bidder" BIGINT,
    "price" BIGINT,
    "channel" VARCHAR,
    "url" VARCHAR,
    "date_time" TIMESTAMP,
    "extra" VARCHAR
) with (
    connector = 'nexmark',
    nexmark.table.type = 'Bid',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '1000000'
) ROW FORMAT JSON;

CREATE MATERIALIZED VIEW mv_10 as SELECT auction, bidder, price, dateTime, extra, DATE_FORMAT(dateTime, 'yyyy-MM-dd'), DATE_FORMAT(dateTime, 'HH:mm')
FROM bid;

Therefore, JSON serde is likely to be the bottleneck of the query.
#4555 has discussed this issue.
For the purpose of benchmarking RisingWave itself, #4555 and #4961 proposed to generate in-memory data chunks to bypass JSON serde. This is the right way to go.

In reality, the query will read input from some other systems, and JSON deserialization is not avoidable.
Therefore, we need to benchmark some trustworthy JSON libraries as JSON deser is probably the most CPU-intensive part and takes most of the processing time in such stateless queries.
And again, there are a lot of such queries.

The text was updated successfully, but these errors were encountered:

lmatz · 2022-09-06T02:57:29Z

Different libraries make different tradeoffs and perform differently on different workloads.
Check out https://github.com/serde-rs/json-benchmark

https://github.com/pikkr/pikkr

Flink uses https://github.com/FasterXML/jackson, check out https://github.com/FasterXML/jackson

neverchanje · 2022-09-06T03:21:24Z

We need to first trace the JSON parsing time and how much it occupies the e2e latency. I think whether JSON is the bottleneck is not confirmed yet.

TennyZhuang · 2022-09-06T04:29:46Z

Prefer #4961

jon-chuang · 2022-09-06T04:46:26Z

I think it is important to note the difference between static parsing e.g. serde library, and dynamic json parsing where the structure is unknown. The latter is what many of these benchmarks (incl. https://github.com/serde-rs/json-benchmark) are based upon but is completely irrelevant to our actual requirement.

In our case, we can probably see some performance gains by leveraging a parser where the structure is known (I assume our current usage of serde json is with dynamic parsing).

Prior to that, we can bench the expected difference in performance by using known structure, as rust data type, deserialize using serde-json struct, deserialize using serde-json dynamic, deserialize using pikkr, and see the result.

lmatz · 2022-09-06T04:58:08Z

I don't think we need to do what people have done years ago to verify again json parsing is the bottleneck:
databricks, data artisan, and some independent blog have all pointed out json parsing(only de) can decrease the throughput by a huge factor, up to 35X.

And the context is for simple stateless queries, there is honestly not much going on in RisingWave except json parsing.

lmatz added type/perf needs-investigation labels Sep 6, 2022

github-actions bot added this to the release-0.1.13 milestone Sep 6, 2022

lmatz removed this from the release-0.1.13 milestone Sep 6, 2022

jon-chuang mentioned this issue Sep 6, 2022

source: support native (data chunk) format for benchmark purpose #4555

Closed

TennyZhuang added this to the release-0.1.13 milestone Sep 6, 2022

lmatz assigned jon-chuang Sep 6, 2022

jon-chuang closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhead from json serde #5122

overhead from json serde #5122

lmatz commented Sep 6, 2022

lmatz commented Sep 6, 2022 •

edited

Loading

neverchanje commented Sep 6, 2022

TennyZhuang commented Sep 6, 2022

jon-chuang commented Sep 6, 2022 •

edited

Loading

lmatz commented Sep 6, 2022

overhead from json serde #5122

overhead from json serde #5122

Comments

lmatz commented Sep 6, 2022

lmatz commented Sep 6, 2022 • edited Loading

neverchanje commented Sep 6, 2022

TennyZhuang commented Sep 6, 2022

jon-chuang commented Sep 6, 2022 • edited Loading

lmatz commented Sep 6, 2022

lmatz commented Sep 6, 2022 •

edited

Loading

jon-chuang commented Sep 6, 2022 •

edited

Loading