Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sample rounding and clean up noise shaping leftovers #771

Merged
merged 3 commits into from
May 29, 2021

Conversation

roderickvd
Copy link
Member

Two things:

  1. Remove unused parameter (leftover from scrapped noise shaping)
  2. Round half away from zero instead of truncating to lower quantisation error

I've been fighting with Rust over something as stupid as point 2 but am sure now:

assert_eq!( 0,  0.1_f32.round() as i16);
assert_eq!( 1,  0.5_f32.round() as i16);
assert_eq!( 1,  0.9_f32.round() as i16);
assert_eq!( 0, -0.1_f32.round() as i16);
assert_eq!(-1, -0.5_f32.round() as i16);
assert_eq!(-1, -0.9_f32.round() as i16);

Without round() this would all return 0.

@roderickvd roderickvd self-assigned this May 27, 2021
@roderickvd
Copy link
Member Author

roderickvd commented May 27, 2021

b724603: Absolute best rounding method is "round to even". Although this is the default and recommended rounding mode in IEEE 754 ("round to nearest, ties to even"), there are no LLVM intrinsics for it, and it introduces a new dependency. Compiling with --release, this increases the binary by only 728 bytes (yes, bytes) compared to std::f32::round().

@roderickvd roderickvd force-pushed the round-and-cleanup-dithering branch from fd4915a to b724603 Compare May 27, 2021 19:18
@roderickvd
Copy link
Member Author

Going to rebase and merge this, if there is any late opposition against the added 728 bytes of b724603 we can always revert later 😆

@JasonLG1979
Copy link
Contributor

I don't think 728 bytes is going to make or break it for anyone,lol!!! A monolithic librespot binary compiled for ARMv6 with the ALSA backend is about 13.6 MB.

@roderickvd roderickvd merged commit 8062bd2 into librespot-org:dev May 29, 2021
@roderickvd roderickvd deleted the round-and-cleanup-dithering branch May 29, 2021 20:53
@roderickvd roderickvd changed the title Round half away from zero and clean up noise shaping leftovers Improve sample rounding and clean up noise shaping leftovers May 29, 2021
@JasonLG1979
Copy link
Contributor

JasonLG1979 commented May 30, 2021

@roderickvd I'm not sure what the deal is but this addition causes librespot to use about twice the CPU as before during normal use (just playing not downloading or caching). Going directly to hardware with the --device arg (so no resampling overhead) it went from 10 - 15% to 20 to 30%. Not a huge deal as there's still plenty of headroom and it doesn't cause any audible issues but a couple more bumps like that and librespot won't run a Pi Zero anymore.

Edit: That's with --format S32 so no dithering either. My guess is that the new rounding method is a little costly.

@JasonLG1979
Copy link
Contributor

JasonLG1979 commented May 30, 2021

For a frame of reference on my Pi 4 running 64bit Raspberry Pi OS it uses less than 10% CPU and on my Fedora x86_64 desktop it's about 0.5% CPU. So clearly not the end of the world just something to keep in the back of your mind.

Edit: Diving a little deeper it's got to be the rounding. If I switch to F32 (bypassing the rounding in librespot, letting pipewire handle it) on my desktop it cuts the CPU usage in about half.

@JasonLG1979
Copy link
Contributor

@roderickvd I'm sorry if I'm talking your ear off but another thing I noticed is the use of math::round::half_to_even I get that it replicates the IEEE 754 standard but I'm not so sure if that's what you'd want in this case. The point of that standard is to make rounding errors more uniform and predictable. Uniform and predictable to me sounds like signal correlated. Wouldn't math::round::stochastic be a better choice to randomize ties?

@roderickvd
Copy link
Member Author

@roderickvd I'm sorry if I'm talking your ear off but another thing I noticed is the use of math::round::half_to_even I get that it replicates the IEEE 754 standard but I'm not so sure if that's what you'd want in this case. The point of that standard is to make rounding errors more uniform and predictable. Uniform and predictable to me sounds like signal correlated. Wouldn't math::round::stochastic be a better choice to randomize ties?

That's not the same.

Randomisation is what dithering is for (and does better). A stochastic rounding function was in the original dithering PR, but yanked because it behaves as a rectangular probability density function. That would be uniform, and it's too theoretical to explain but that causes intermodulation distortion. (If you're interested, read the publicly available works by Lipschitz and Wannamaker)

The Gaussian or triangular ditherers take care of non-uniform, uncorrelated randomisation. Next step, you need to round instead of truncate. The best way to round is with the least error, and that is with an unbiased function (i.e. round half to even/uneven, also called convergent rounding). "Normal" rounding is biased in that it rounds away from zero.

We can watch the CPU usage and revert to standard round() if necessary. We'll also have to see how the 64-bit PR impacts this. The libmath crate introduces another conversion from f32 to f64 and back again, which is no longer necessary with the 64-bit PR.

At this point I'm not too concerned and only expect CPU usage to drop with the 64-bit PR.

@JasonLG1979
Copy link
Contributor

That's not the same.

Randomisation is what dithering is for (and does better). A stochastic rounding function was in the original dithering PR, but yanked because it behaves as a rectangular probability density function. That would be uniform, and it's too theoretical to explain but that causes intermodulation distortion. (If you're interested, read the publicly available works by Lipschitz and Wannamaker)

The Gaussian or triangular ditherers take care of non-uniform, uncorrelated randomisation. Next step, you need to round instead of truncate. The best way to round is with the least error, and that is with an unbiased function (i.e. round half to even/uneven, also called convergent rounding). "Normal" rounding is biased in that it rounds away from zero.

Thanks for the explanation. I will take your word for it and skip the technical papers,lol!!! I was just curious.

We can watch the CPU usage and revert to standard round() if necessary. We'll also have to see how the 64-bit PR impacts this. The libmath crate introduces another conversion from f32 to f64 and back again, which is no longer necessary with the 64-bit PR.

At this point I'm not too concerned and only expect CPU usage to drop with the 64-bit PR.

Good deal. Rebase it and I'll give it a test.

@roderickvd
Copy link
Member Author

Try it, just merged it into #773.

@roderickvd
Copy link
Member Author

@roderickvd I'm not sure what the deal is but this addition causes librespot to use about twice the CPU as before during normal use (just playing not downloading or caching). Going directly to hardware with the --device arg (so no resampling overhead) it went from 10 - 15% to 20 to 30%. Not a huge deal as there's still plenty of headroom and it doesn't cause any audible issues but a couple more bumps like that and librespot won't run a Pi Zero anymore.

On my RPi 3B it seems about +5%

@JasonLG1979
Copy link
Contributor

On my RPi 3B it seems about +5%

That sounds about right considering it probably only used about 5% to begin with before. the Pi Zero is orders of magnitudes slower then even a Pi 3. As a frame of reference, it takes about 13 min to compile librespot on a Pi 4 and over 4 hours on a Pi Zero,lol!!!

@JasonLG1979
Copy link
Contributor

20 - 30% isn't a huge deal under normal playback conditions. My only concern is that CPU usage can spike momentarily to as high as in the 70's while caching a track. Another doubling of CPU usage and we hit 100% and audio starts to drop out. Is there a way to give the playback thread a higher priority or is that done already?

@roderickvd
Copy link
Member Author

I'm not sure but inclined to say no. But let's cross that bridge if and when we get there.

  1. I'm about done with sound quality improvements, don't expect any more CPU hogs
  2. it's easy to swap for std::f64::round() if necessary

@JasonLG1979
Copy link
Contributor

I'm not sure but inclined to say no. But let's cross that bridge if and when we get there.

Fair enough. No need for premature optimization.

  1. I'm about done with sound quality improvements, don't expect any more CPU hogs
  2. it's easy to swap for std::f64::round() if necessary

Well on a selfish note, I'm currently softening the wife up to let me buy another Pi 4 and a Topping D10s DAC for audio streaming so I may not care that much for much longer,lol!!!

But generally I don't think it's a huge deal. It's not like a Pi Zero is a multi-tasking beast. Realistically you can only expect it to do one thing at a time well. My test set up is a stock install of Raspotify OS lite with nothing else running except the default services and librespot. I don't think that it's unreasonable to say that if you plan on running librespot on a Pi Zero that's all you're going to run on it.

@roderickvd
Copy link
Member Author

That's a great deal. You should come on Gitter if you want to talk gear and other stuff.

@JasonLG1979
Copy link
Contributor

So the verdict is that the 64 bit PR didn't help. Comparing 5d43b7d (the last build I had laying around) it still basically doubles. There was a cargo update in between so I can build right after that to make sure it's not a dependency update that's actually causing the problem?

@roderickvd
Copy link
Member Author

Nah I’m pretty sure this is it, I’m seeing it too. I’d like to arrive at a different conclusion though, namely that we can have 64-bit sample handling basically for free.

@JasonLG1979
Copy link
Contributor

JasonLG1979 commented May 30, 2021

Nah I’m pretty sure this is it, I’m seeing it too. I’d like to arrive at a different conclusion though, namely that we can have 64-bit sample handling basically for free.

It's for sure it. I replaced it with int_value.round() and CPU is back where it was. The builtin round is clearly just more optimized. It is what it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants