-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
47% of the time is spent in BoolReader::read_bool
when decoding lossy images
#71
Comments
Equivalent function in libwebp Unfortunately, unlike with Huffman compression I don't think it is possible to avoid the bit-by-bit decoding for arithmetic coding. |
Unfortunately #72 didn't seem to make a difference on benchmarks on my machine. Decoding the linked image is still 1.8x slower than I've grepped libwebp source code for There is also a |
This might be useful as reference for optimizing this: https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far-too-many-ways-part-2/ |
That whole series of articles is phenomenal and I've referred to it many times. Unfortunately it isn't directly applicable to this function. The problem is that while |
Merry christmas folks! I've been tinkering with this after Shnatsel nerd-sniped me with a comment on reddit. My fork (main...SLiV9:image-webp:main) is still a bit of a mess, but at least I can report that I'm making progress: decoding with my fork is "only" 1.3x slower than with dwebp, according to my unscientific benchmark.
It unfortunately required a lot more changes than just |
That's amazing! I can run a regression test on the 7500 webp images I scraped from the web once you feel it's ready. |
Here's my PR: #124 I couldn't get any sampling profiler to work anymore (last time I tried was a year ago, I think my laptop is now just much older than my compiler), so I used hyperfine for benchmarking, callgrind to visualize call graphs and jump instructions, and cargo asm for obsessing over zero-cost abstractions. So next to that regression test, I'd love it if you could run the same benchmark and samply profile as at the top of the ticket. I'm curious how similar it is to my results. And perhaps I should have benchmarked it with some smaller files; I don't expect worse performance for any image bigger than 4x4 pixels, but you never know. |
ProfilingI can confirm an exact 1.3x improvement on the original image from this issue! Here's the updated profile on your PR on that image: https://share.firefox.dev/41TQNjp
According to the profile we can shave off another 2.6% by removing this Lines 642 to 645 in 34223fa
Smaller imagesOn a 1024x1024 image I'm seeing a 8% improvement on a lossy image but a 7% regression on a lossless encoding of that same image. Tested on this AI-generated pic I pulled out of downloads: 00016-3392158409 Profile for the current git for the 1024x1024 lossless image: https://share.firefox.dev/3DxL6gW |
Awesome.
Does samply reconstruct inlined functions somehow? I'm confused how functions like "saturating_sub", "from_be_bytes" and "Option::cloned" get sampled at all. But your profile of
Hmm, how reproducible is that? Looking at the callgraphs I didn't touch those modules at all (only vp8.rs), and in the source code I also don't see any use of functions or constants from Edit: I also cannot reproduce it locally. |
Yes, it does. You need this in your
On my Zen 4 machine I'm getting it very reliably with Here are the exact images derived from the PNG I posted, so you could reproduce the measurements exactly: It's not unheard of for code layout to change performance by that much. Or there could be small changes in inlining decisions greatly affecting a hot loop, but that I'd expect to be reproducible. If you still cannot reproduce the regression on your machine, we can just chalk it up to code layout changes. Lossy images are far more common than lossless ones anyway, so if we picked one to prioritize, it would be lossy ones. |
I've re-run the benchmarks on the lossless image just to be sure, and I think the regression I measured is an artifact of something else. It seems to disappear into noise at higher iteration counts and with more warmup iterations before the actual measurements. So please disregard anything I said about the lossless decoding regression. I am sorry for the confusion. |
I just installed Linux Mint 22 on my desktop, which is much beefer and quieter, and I reran the benchmarks. Interestingly the performance improvement compared to main are even higher: on this machine my fork is 1.4x faster than main.
I now also see no significant improvements from the read_flag commit (but then it doesn't hurt either I suppose). And the main reason I switched: I can now run samply as well, so I see for example the 1.9% runtime spent on the Frame clone that you mentioned. Lastly, I just ran
|
I'm going to go ahead and close this now that #124 is merged |
Decoding this file with
image-webp
is 1.8x slower thandwebp -noasm -nofancy
:Puente_de_Don_Luis_I,_Oporto,_Portugal,_2019-06-02,_DD_29-31_HDR.webp.gz
(source: wikipedia, converted to lossy WebP using
imagemagick
)Profiling with
samply
showsimage_webp::vp8::BoolReader::read_bool
being responsible for 47% of the total runtime when decoding this image: https://share.firefox.dev/49ZwlOoThis seems to be a similar issue to #55
The text was updated successfully, but these errors were encountered: