-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate whether pysimdjson could be used in Rally #1046
Comments
I'm watching this issue - if you find any missing functionality or issues in pysimdjson that would block this, let me know and they'll be resolved. |
The three main contenders for parsing JSON are:
Ease of useNothing beats the standard library here, but orjson and pysimdjson both provide wheels, so no compilation is needed in practice. orjson is more popular (3.4k stars vs. 0.5k for pysimdjson). orjson is also more actively maintained (which makes sense as pysimdjson is only a wrapper). But orjson had Python 3.10 wheels before pysimdjson. Neither currently has Python 3.11 wheels. Small note: orjson only serializes to/deserializes from bytes, which makes sense but is more restrictive than the standard library. Speed
|
A good test bed for pysimdjson support for extracting specific keys is this |
Keep in mind 3.11 is not out yet, and you should never push beta tag wheels to pypi as the ABI is not yet stable. When 3.11 is released and cibuildwheel is updated, pysimdjson (and orjson) will push 3.11 wheels. |
While I don't have anything super useful to add here in terms of replacements, I would just like to throw my anecdotal hat into this ring with respect to the It wasn't until I ran multiple copies of Elastic Rally with identical settings concurrently from the same host was I able to actually start approach any of the hardware limits in the cluster. In the end, I had to run 12x Elastic Rally instances on the My suspicion was that, similar to the Golang stdlb for |
@berglh Thanks for the report! It's true that you should always check that the client is not the bottleneck. Until we fix #1399, would you mind running https://github.com/benfred/py-spy on one of the Rally processes? It will tell us what exactly is being slow. |
@pquentin I'm not sure if you were after the flame graph specifically or a different format. Can run again with the other output if required. I went ahead and cleared out or cluster password from the SVG. I didn't see anything specifically JSON related in the hotspots, but there's a lot going on as I captured the parent and subprocesses of the elastic/logs track. esrally_profile |
I opened #1566 so that this issue stays focused on pysimdjson. |
There are largely two areas where handling large chunks of JSON impacts performance in Rally:
The simjson project seems to take advantage of modern SIMD vector instructions to achieve much higher performance than other libraries.
The pysimjson project beings those benefits to Python via bindings with prebuilt binary wheels for a lot of platforms. Additionally, it provides JSON pointers via at(), or proxies for objects and lists to reduce the creation of Python objects. We've been hitting these issues at various points e.g. in #941 and #935 (i.e. especially after using an async-io based load generator).
Given the benchmark results this could be a very useful library to use. Both projects use Apache 2.0 license.
The text was updated successfully, but these errors were encountered: