Converting from `f32` to `f16` incurs a 2x slowdown on reading images #177

Shnatsel · 2022-12-30T11:41:48Z

Profling shows that on the read_single_image_from_buffer_rgba_channels benchmark, 50% of the time is spent in the half::binary16::f16::from_f32 function.

Interactive profile so you can explore it yourself: https://share.firefox.dev/3Vt2pD6

The text was updated successfully, but these errors were encountered:

Shnatsel · 2022-12-30T11:52:20Z

Specifically, the profiler blames these lines:

exrs/src/image/read/specific_channels.rs

Lines 296 to 298 in 49fece0

    
           SampleType::F32 => for pixel in pixels.iter_mut() { 
        
               *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
           },

Is the benchmark actually about loading an f32 image as f16? I expected it to measure a straightforward load.

Shnatsel · 2022-12-30T12:16:19Z

I've sent a PR to half to allow inlining the conversion functions - it seems the #[inline] attribute was missed on them: VoidStarKat/half-rs#61

~~However, the benchmarks of exr are built with LTO, so the inline attribute shouldn't matter. In fact, LTO was needed to paper over just this one missing #[inline]!~~

~~With this change to half, the benchmarks with and without LTO are the same, and you can enjoy good compile times AND good performance!~~ Wait, no, that's something else, or at least that PR didn't help.

johannesvollmer · 2022-12-31T02:13:54Z

Is the benchmark actually about loading an f32 image as f16? I expected it to measure a straightforward load.

Maybe we should rename this benchmark, as it seems to also convert the numbers :)

Are you interested in profiling f32 or in profiling f16?

There are files of both types in the repository, but most have a small number of pixels and therefore do not represent the common real world data. I don't know if there is a large f16 file right now.

You are welcome to add the benchmarks that you need, for example a benchmark that loads f32 values from an f32 file :)

johannesvollmer · 2022-12-31T02:17:04Z

The conversion is something that will happen in the real world, so we want it to be fast. I think I remember there are intrinsics for converting between f32 and f16, maybe they need to be activated with a flag in the half dependency?

Edit:

use-intrinsics - Use hardware intrinsics for f16 and bf16 conversions if available on the compiler host target. By default, without this feature, conversions are done only in software, which will be the fallback if the host target does not have hardware support. Available only on Rust nightly channel.

std - Enable features that depend on the Rust std library, including everything in the alloc feature. Enabling the std feature enables runtime CPU feature detection when the use-intrsincis feature is also enabled. Without this feature detection, intrinsics are only used when compiler host target supports them.

shouldn't it be possible for users to provide their own half dependency with that feature enabled?

Shnatsel · 2022-12-31T07:18:58Z

Yes, I was surprised that a conversion is performed. It would be nice to rename the benchmark.

I think I remember there are intrinsics for converting between f32 and f16, maybe they need to be activated with a flag in the half dependency

There are two problems with this.

First, on x86 only very very recent CPUs have a native f16 type - ~~according to Wikipedia, CPUs supporting that haven't even launched yet, but are coming in 11 days from now.~~ that's for arithmetic, conversions are available since 2009. There are suitable intrinsics on ARM, but they're not supported by half or even the Rust standard library.

Second, those conversion intrinsics operate on a chunk of values - e.g. &[f32; 4] while exr crate calls conversion on individual f32 values, so even if the intrinsics were available, exr wouldn't be able to make effective use of them.

Shnatsel · 2022-12-31T07:47:57Z

On the exr side the actionable part is switching from converting f32 to &[f32; 4]. ~~But I don't think this will result in any immediate gains, at least without further modifications in half, ideas for which I'll take to the half issue tracker.~~

However, operating on images in the f16 pixel format is a terrible idea. CPUs do not implement f16 natively; even the ones with f16 intrinsics only ever operate on [f16; 8] to [f16; 32], never individual values. half explicitly calls out in documentation that CPUs do not implement f16 and any math on these values is done by converting them to f32 and back, which is very slow (even on CPUs with intrinsics, because those operate only on chunks).

The only use case is shipping them to a GPU and displaying them there, since GPUs generally do support f16 natively. So the relevant benchmark should be explicitly marked as an exotic use case.

Shnatsel · 2022-12-31T10:06:51Z

I was wrong about intrinsics not being available, they have been available on x86_64 CPUs since 2009.

This issue is getting hard to follow, I'll close this and open another one with the actionable takeaways.

johannesvollmer · 2022-12-31T10:08:42Z

haha alright, don't worry :) no big deal

johannesvollmer · 2022-12-31T10:10:12Z

key take away here is: please modify or add benchmarks as you see fit :)

Shnatsel closed this as completed Dec 31, 2022

Shnatsel mentioned this issue Dec 31, 2022

Pixel format conversions are slow #178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting from `f32` to `f16` incurs a 2x slowdown on reading images #177

Converting from `f32` to `f16` incurs a 2x slowdown on reading images #177

Shnatsel commented Dec 30, 2022

Shnatsel commented Dec 30, 2022

Shnatsel commented Dec 30, 2022 •

edited

Loading

johannesvollmer commented Dec 31, 2022 •

edited

Loading

johannesvollmer commented Dec 31, 2022 •

edited

Loading

Shnatsel commented Dec 31, 2022 •

edited

Loading

Shnatsel commented Dec 31, 2022 •

edited

Loading

Shnatsel commented Dec 31, 2022

johannesvollmer commented Dec 31, 2022

johannesvollmer commented Dec 31, 2022

Converting from f32 to f16 incurs a 2x slowdown on reading images #177

Converting from f32 to f16 incurs a 2x slowdown on reading images #177

Comments

Shnatsel commented Dec 30, 2022

Shnatsel commented Dec 30, 2022

Shnatsel commented Dec 30, 2022 • edited Loading

johannesvollmer commented Dec 31, 2022 • edited Loading

johannesvollmer commented Dec 31, 2022 • edited Loading

Shnatsel commented Dec 31, 2022 • edited Loading

Shnatsel commented Dec 31, 2022 • edited Loading

Shnatsel commented Dec 31, 2022

johannesvollmer commented Dec 31, 2022

johannesvollmer commented Dec 31, 2022

Converting from `f32` to `f16` incurs a 2x slowdown on reading images #177

Converting from `f32` to `f16` incurs a 2x slowdown on reading images #177

Shnatsel commented Dec 30, 2022 •

edited

Loading

johannesvollmer commented Dec 31, 2022 •

edited

Loading

johannesvollmer commented Dec 31, 2022 •

edited

Loading

Shnatsel commented Dec 31, 2022 •

edited

Loading

Shnatsel commented Dec 31, 2022 •

edited

Loading