-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting from f32
to f16
incurs a 2x slowdown on reading images
#177
Comments
Specifically, the profiler blames these lines: exrs/src/image/read/specific_channels.rs Lines 296 to 298 in 49fece0
Is the benchmark actually about loading an |
I've sent a PR to
|
Maybe we should rename this benchmark, as it seems to also convert the numbers :) Are you interested in profiling f32 or in profiling f16? There are files of both types in the repository, but most have a small number of pixels and therefore do not represent the common real world data. I don't know if there is a large f16 file right now. You are welcome to add the benchmarks that you need, for example a benchmark that loads f32 values from an f32 file :) |
The conversion is something that will happen in the real world, so we want it to be fast. I think I remember there are intrinsics for converting between f32 and f16, maybe they need to be activated with a flag in the Edit:
shouldn't it be possible for users to provide their own |
Yes, I was surprised that a conversion is performed. It would be nice to rename the benchmark.
There are two problems with this. First, on x86 only very very recent CPUs have a native Second, those conversion intrinsics operate on a chunk of values - e.g. |
On the However, operating on images in the f16 pixel format is a terrible idea. CPUs do not implement The only use case is shipping them to a GPU and displaying them there, since GPUs generally do support f16 natively. So the relevant benchmark should be explicitly marked as an exotic use case. |
I was wrong about intrinsics not being available, they have been available on x86_64 CPUs since 2009. This issue is getting hard to follow, I'll close this and open another one with the actionable takeaways. |
haha alright, don't worry :) no big deal |
key take away here is: please modify or add benchmarks as you see fit :) |
Profling shows that on the
read_single_image_from_buffer_rgba_channels
benchmark, 50% of the time is spent in thehalf::binary16::f16::from_f32
function.Interactive profile so you can explore it yourself: https://share.firefox.dev/3Vt2pD6
The text was updated successfully, but these errors were encountered: