-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overhead from json serde #5122
Comments
Different libraries make different tradeoffs and perform differently on different workloads. https://github.com/pikkr/pikkr Flink uses https://github.com/FasterXML/jackson, check out https://github.com/FasterXML/jackson |
We need to first trace the JSON parsing time and how much it occupies the e2e latency. I think whether JSON is the bottleneck is not confirmed yet. |
Prefer #4961 |
I think it is important to note the difference between static parsing e.g. serde library, and dynamic json parsing where the structure is unknown. The latter is what many of these benchmarks (incl. https://github.com/serde-rs/json-benchmark) are based upon but is completely irrelevant to our actual requirement. In our case, we can probably see some performance gains by leveraging a parser where the structure is known (I assume our current usage of serde json is with dynamic parsing). Prior to that, we can bench the expected difference in performance by using known structure, as rust data type, deserialize using serde-json struct, deserialize using serde-json dynamic, deserialize using pikkr, and see the result. |
I don't think we need to do what people have done years ago to verify again json parsing is the bottleneck: And the context is for simple stateless queries, there is honestly not much going on in RisingWave except json parsing. |
Stateless streaming queries take a large percentage in real-world scenarios.
These queries take the input, do some simple transformation and filter out some records, and then output to some sink.
In essence, the query itself is not that cpu-intensive, and it is also not io-bounded in terms of state access.
An example is Nexmark q10:
Therefore, JSON serde is likely to be the bottleneck of the query.
#4555 has discussed this issue.
For the purpose of benchmarking RisingWave itself, #4555 and #4961 proposed to generate in-memory data chunks to bypass JSON serde. This is the right way to go.
In reality, the query will read input from some other systems, and JSON deserialization is not avoidable.
Therefore, we need to benchmark some trustworthy JSON libraries as JSON deser is probably the most CPU-intensive part and takes most of the processing time in such stateless queries.
And again, there are a lot of such queries.
The text was updated successfully, but these errors were encountered: