Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

47% of the time is spent in BoolReader::read_bool when decoding lossy images #71

Closed
Shnatsel opened this issue Apr 30, 2024 · 13 comments
Closed

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Apr 30, 2024

Decoding this file with image-webp is 1.8x slower than dwebp -noasm -nofancy:
Puente_de_Don_Luis_I,_Oporto,_Portugal,_2019-06-02,_DD_29-31_HDR.webp.gz

(source: wikipedia, converted to lossy WebP using imagemagick)

Profiling with samply shows image_webp::vp8::BoolReader::read_bool being responsible for 47% of the total runtime when decoding this image: https://share.firefox.dev/49ZwlOo

This seems to be a similar issue to #55

@fintelia
Copy link
Contributor

fintelia commented May 1, 2024

Equivalent function in libwebp

Unfortunately, unlike with Huffman compression I don't think it is possible to avoid the bit-by-bit decoding for arithmetic coding.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 16, 2024

Unfortunately #72 didn't seem to make a difference on benchmarks on my machine. Decoding the linked image is still 1.8x slower than dwebp -noasm -nofancy.

I've grepped libwebp source code for VP8GetBit and I couldn't find any specialized assembly routines for it, but they do have a dedicated variant for prob=0x80 (128 in decimal), so that's something we could probably crib for easy gains. prob=128 is used here and here.

There is also a VP8GetBitAlt function the purpose of which isn't clear to me at a glance.

@Shnatsel
Copy link
Contributor Author

This might be useful as reference for optimizing this: https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far-too-many-ways-part-2/

@fintelia
Copy link
Contributor

That whole series of articles is phenomenal and I've referred to it many times. Unfortunately it isn't directly applicable to this function. The problem is that while read_bool sounds like it is just reading a bit from the input stream, it is actually doing range coding to extract a compressed 1-bit symbol based on the current state of the range coder and the given probability.

@SLiV9
Copy link
Contributor

SLiV9 commented Dec 27, 2024

Merry christmas folks! I've been tinkering with this after Shnatsel nerd-sniped me with a comment on reddit. My fork (main...SLiV9:image-webp:main) is still a bit of a mess, but at least I can report that I'm making progress: decoding with my fork is "only" 1.3x slower than with dwebp, according to my unscientific benchmark.

Summary
  'dwebp -noasm -nofancy Puente.webp' ran
    1.00 ± 0.00 times faster than 'dwebp -noasm -nofancy Puente.webp'
    1.31 ± 0.00 times faster than 'target/release/image-webp-runner Puente.webp'
    1.71 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp --use-reference'

It unfortunately required a lot more changes than just read_bool, so once I'm done (i.e. once I give up), I'll clean it up, simplify it as much as possible and squash the commits before making any PRs. The tests still pass, but I didn't actually check the output of the decoding, so that might also be a final check that still needs to be done.

@Shnatsel
Copy link
Contributor Author

That's amazing!

I can run a regression test on the 7500 webp images I scraped from the web once you feel it's ready.

@SLiV9
Copy link
Contributor

SLiV9 commented Dec 28, 2024

Here's my PR: #124

I couldn't get any sampling profiler to work anymore (last time I tried was a year ago, I think my laptop is now just much older than my compiler), so I used hyperfine for benchmarking, callgrind to visualize call graphs and jump instructions, and cargo asm for obsessing over zero-cost abstractions.

So next to that regression test, I'd love it if you could run the same benchmark and samply profile as at the top of the ticket. I'm curious how similar it is to my results.

And perhaps I should have benchmarked it with some smaller files; I don't expect worse performance for any image bigger than 4x4 pixels, but you never know.

@Shnatsel
Copy link
Contributor Author

Profiling

I can confirm an exact 1.3x improvement on the original image from this issue!

Here's the updated profile on your PR on that image: https://share.firefox.dev/41TQNjp

read_flag is responsible for 13% of the time, so a specialized fast path for it might be worth a shot (assuming it isn't inlined, in which case constant propagation should take care of it anyway?)

According to the profile we can shave off another 2.6% by removing this .clone():

image-webp/src/decoder.rs

Lines 642 to 645 in 34223fa

// TODO: avoid cloning frame
let frame = Vp8Decoder::new(range_reader(&mut self.r, range.start..range.end)?)
.decode_frame()?
.clone();

Smaller images

On a 1024x1024 image I'm seeing a 8% improvement on a lossy image but a 7% regression on a lossless encoding of that same image. Tested on this AI-generated pic I pulled out of downloads: 00016-3392158409

Profile for the current git for the 1024x1024 lossless image: https://share.firefox.dev/3DxL6gW
Same image on your PR: https://share.firefox.dev/4fAQ8qi

@SLiV9
Copy link
Contributor

SLiV9 commented Dec 29, 2024

I can confirm an exact 1.3x improvement on the original image from this issue!

Awesome.

Here's the updated profile on your PR on that image: https://share.firefox.dev/41TQNjp

read_flag is responsible for 13% of the time, so a specialized fast path for it might be worth a shot (assuming it isn't inlined, in which case constant propagation should take care of it anyway?)

Does samply reconstruct inlined functions somehow? I'm confused how functions like "saturating_sub", "from_be_bytes" and "Option::cloned" get sampled at all.

But your profile of read_coefficients is quite different (13% read_flag, 39% read_with_tree, 5% idct4x4) from what I saw (8% read_flag, 45% read_with_tree, 9% idct4x4) and no doubt more accurate than callgrind's VM. So I agree that read_flag might gain something from a fast path.

Smaller images

On a 1024x1024 image I'm seeing a 8% improvement on a lossy image but a 7% regression on a lossless encoding of that same image. Tested on this AI-generated pic I pulled out of downloads: 00016-3392158409

Profile for the current git for the 1024x1024 lossless image: https://share.firefox.dev/3DxL6gW Same image on your PR: https://share.firefox.dev/4fAQ8qi

Hmm, how reproducible is that? Looking at the callgraphs I didn't touch those modules at all (only vp8.rs), and in the source code I also don't see any use of functions or constants from vp8. My changes might affect binary size but I wouldn't expect that to result in an 8% slow down.

Edit: I also cannot reproduce it locally.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 29, 2024

Does samply reconstruct inlined functions somehow?

Yes, it does. You need this in your Cargo.toml for it to work:

[profile.release]
debug = true

perf record --call-graph=dwarf can reconstruct inlined functions as well, not sure how precise that is though. On the other hand it can capture kernel traces while samply cannot.

Hmm, how reproducible is that?

On my Zen 4 machine I'm getting it very reliably with hyperfine. I checked against latest git rather than the crates.io release just to be sure, and on this particular image the results are identical between crates.io and git, but then regress by 8% on your branch.

Here are the exact images derived from the PNG I posted, so you could reproduce the measurements exactly:
00016-webp-converted.zip

It's not unheard of for code layout to change performance by that much. Or there could be small changes in inlining decisions greatly affecting a hot loop, but that I'd expect to be reproducible. If you still cannot reproduce the regression on your machine, we can just chalk it up to code layout changes. Lossy images are far more common than lossless ones anyway, so if we picked one to prioritize, it would be lossy ones.

@Shnatsel
Copy link
Contributor Author

I've re-run the benchmarks on the lossless image just to be sure, and I think the regression I measured is an artifact of something else. It seems to disappear into noise at higher iteration counts and with more warmup iterations before the actual measurements. So please disregard anything I said about the lossless decoding regression. I am sorry for the confusion.

@SLiV9
Copy link
Contributor

SLiV9 commented Jan 1, 2025

I just installed Linux Mint 22 on my desktop, which is much beefer and quieter, and I reran the benchmarks. Interestingly the performance improvement compared to main are even higher: on this machine my fork is 1.4x faster than main.

Summary
  dwebp -noasm -nofancy Puente.webp ran
    1.01 ± 0.01 times faster than dwebp -noasm -nofancy Puente.webp
    1.19 ± 0.01 times faster than target/release/image-webp-runner Puente.webp
    1.19 ± 0.01 times faster than target/release/image-webp-runner-without-readflag-20250101-1828 Puente.webp
    1.68 ± 0.02 times faster than target/release/image-webp-runner Puente.webp --use-reference

I now also see no significant improvements from the read_flag commit (but then it doesn't hurt either I suppose).

And the main reason I switched: I can now run samply as well, so I see for example the 1.9% runtime spent on the Frame clone that you mentioned.

Lastly, I just ran perf stat and it is showing something very interesting on this machine. The branch prediction is better (which makes sense because that's part of what I've been optimizing) but the cache misses are worse than dwebp on all levels, both relative and absolute. Specifically the 50% "cache-misses" (which is L3 I guess?) seems weird, it's only 30% on my laptop. I'm not too confident to say what it means (if anything).

 Performance counter stats for 'target/release/image-webp-runner Puente.webp':

            742,03 msec task-clock                       #    1,000 CPUs utilized             
     3.317.363.954      cycles                           #    4,471 GHz                         (57,15%)
       667.741.184      branch-instructions              #  899,884 M/sec                       (57,15%)
        26.111.124      branch-misses                    #    3,91% of all branches             (57,15%)
     6.616.377.942      instructions                     #    1,99  insn per cycle              (64,29%)
        13.384.718      cache-references                 #   18,038 M/sec                       (64,29%)
         6.968.569      cache-misses                     #   52,06% of all cache refs           (64,29%)
         9.156.959      L1-dcache-load-misses            #    0,76% of all L1-dcache accesses   (64,29%)
     1.208.244.411      L1-dcache-loads                  #    1,628 G/sec                       (64,29%)
         4.740.647      L1-icache-load-misses                                                   (64,29%)
         4.660.379      icache_16b.ifdata_stall          #    6,281 M/sec                       (64,28%)
         8.450.530      icache_64b.iftag_stall           #   11,388 M/sec                       (64,28%)
         8.501.014      icache_tag.stalls                #   11,456 M/sec                       (64,28%)
        18.960.860      l2_rqsts.references              #   25,553 M/sec                       (57,14%)
         8.792.648      l2_rqsts.miss                    #   11,849 M/sec                       (57,14%)


 Performance counter stats for 'target/release/image-webp-runner --use-reference Puente.webp':

          1.074,78 msec task-clock                       #    1,000 CPUs utilized             
     4.776.462.813      cycles                           #    4,444 GHz                         (57,02%)
     1.137.468.776      branch-instructions              #    1,058 G/sec                       (57,02%)
        66.575.053      branch-misses                    #    5,85% of all branches             (57,02%)
     7.876.089.193      instructions                     #    1,65  insn per cycle              (64,18%)
        14.495.232      cache-references                 #   13,487 M/sec                       (64,18%)
         7.255.675      cache-misses                     #   50,06% of all cache refs           (64,18%)
         8.439.234      L1-dcache-load-misses            #    0,60% of all L1-dcache accesses   (64,26%)
     1.417.066.645      L1-dcache-loads                  #    1,318 G/sec                       (64,35%)
         6.837.416      L1-icache-load-misses                                                   (64,44%)
         8.202.516      icache_16b.ifdata_stall          #    7,632 M/sec                       (64,47%)
         2.411.492      icache_64b.iftag_stall           #    2,244 M/sec                       (64,47%)
         2.408.612      icache_tag.stalls                #    2,241 M/sec                       (64,40%)
        19.094.842      l2_rqsts.references              #   17,766 M/sec                       (57,14%)
         8.712.437      l2_rqsts.miss                    #    8,106 M/sec                       (57,05%)


 Performance counter stats for 'dwebp -noasm -nofancy Puente.webp':

            632,14 msec task-clock                       #    1,000 CPUs utilized             
     2.817.121.918      cycles                           #    4,456 GHz                         (57,05%)
       416.740.732      branch-instructions              #  659,253 M/sec                       (57,20%)
        29.424.762      branch-misses                    #    7,06% of all branches             (57,29%)
     5.397.791.096      instructions                     #    1,92  insn per cycle              (64,41%)
        15.434.086      cache-references                 #   24,416 M/sec                       (64,42%)
         2.976.191      cache-misses                     #   19,28% of all cache refs           (64,41%)
         8.531.085      L1-dcache-load-misses            #    1,30% of all L1-dcache accesses   (64,40%)
       654.354.805      L1-dcache-loads                  #    1,035 G/sec                       (64,41%)
         1.460.015      L1-icache-load-misses                                                   (64,41%)
         2.900.975      icache_16b.ifdata_stall          #    4,589 M/sec                       (64,31%)
         6.145.207      icache_64b.iftag_stall           #    9,721 M/sec                       (64,15%)
         6.067.751      icache_tag.stalls                #    9,599 M/sec                       (64,08%)
        16.161.307      l2_rqsts.references              #   25,566 M/sec                       (56,93%)
         9.387.819      l2_rqsts.miss                    #   14,851 M/sec                       (56,94%)

@Shnatsel
Copy link
Contributor Author

I'm going to go ahead and close this now that #124 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants