CubicSDR uses lots of CPU #150

bobobo1618 · 2015-09-27T21:02:25Z

On the profiling talk from #64

I ran some of the profiling tools on OS X and came up with this (times are a bit off since everything was built in debug mode):

TL;DR:

Demodulator takes ~28% CPU
FFT takes ~26% CPU
SDRPostThread takes ~24% CPU
IO from the HackRF takes ~10% CPU

Working on digging into it more now.

cjcliffe · 2015-09-27T22:38:43Z

Thanks; I'm guessing I can get rid of most of the CPU usage in SDRPostThread since it's processing the DC offset correction there -- which I believe the HackRF already handles.

FFT seems about right; the rest I'll have to take a closer look and see; demodulator seems a lot more expensive than it should be.

I may be able to reduce the FFT CPU as well by just letting FFTW do more of the work (larger FFT just crunch it visually) and easing up on the liquid-dsp decimation stages.

bobobo1618 · 2015-09-27T22:58:28Z

Judging by the output from osmocom_fft, the HackRF doesn't do DC offset correction itself:

And yeah, it looked like liquid was doing most of the FFT work itself. When I followed the call stack down I couldn't find fftw.

cjcliffe · 2015-09-27T23:05:41Z

So the "Auto" DC Offset correction doesn't do the trick?

Yeah, FFTW is a powerhouse; it can handle the full bandwidth of the HackRF without much trouble so I'd be surprised to see it high on the list -- if I do some more work on just using simple half-band Liquid-DSP decimators and put way more load onto FFTW and just average the bins to fit the screen I think a few things will be achieved:

FFTW will generate more bins than we can see, and I run a cheap average over them to fit the visual bins
Averaging will reveal more small signals that might just not fit into the available bins at high decimation
Liquid-DSP will get relief from simpler and less aggressive decimation
Half-band decimators won't have the aliasing problem where signals creep back across the edges the wrong way as you zoom and move
FFT zoom animation peaks the CPU because decimation filters are adjusted on-the-fly, the additional FFT resolution with half-band steps per stage will provide transition relief when making trivial zoom adjustments.
Per-stage zoom increments could easily be implemented in a texture and/or shader; the CPU will not have to perform any averaging.

cjcliffe · 2015-09-27T23:37:55Z

Ok, I noticed from https://github.com/jocover/SoapyHackRF/blob/master/HackRF_Settings.cpp#L214 that there's no built-in DC offset correction. So that will likely stay for now but I can give the option to turn it off to save some cycles.

cjcliffe · 2015-10-14T05:00:09Z

@bobobo1618 let me know if https://github.com/cjcliffe/CubicSDR/tree/soapysdr-pfbch branch helps at all. It should primarily improve running multiple demodulators.

bobobo1618 · 2015-10-15T02:20:42Z

Very much doesn't help on the CPU front. Eats the CPU alive and is very jittery at 16MHz bandwidth and doesn't even let me set above that. It gives me this (when I enter 24000000):

Set sample rate: 16000000
calculated optimal element count of 266752

I haven't updated my build in a long time so I'm not sure how much of this is unique to this branch.

cjcliffe · 2015-10-15T02:34:23Z

That sounds strange, I'm getting several streams at 12Msps here on my 2010 macbook without trouble; does it perform well up until that bandwidth?

Did you build with release configuration? Also can you install fftw3 development files and build/install liquid-dsp with FFTW support to see if that helps -- this feature will heavily depend on FFTW being integrated I believe.

bobobo1618 · 2015-10-15T03:20:05Z

Not much luck I'm afraid.

Still jittery and still most of the CPU time is being spent in libliquid.

bobobo1618 · 2015-10-15T03:40:36Z

Ah, I just noticed that CubicSDR is (why?) using its own packaged version of libliquid. When I replaced it with my own things changed slightly, with libliquid's norm function taking over from block as top CPU consumer.

cjcliffe · 2015-10-15T04:19:45Z

@bobobo1618 looks like you're using a bundled app version? I've only released the soapy-pfbch in source so far and if you bundled the app locally it should be your copy of libliquid (most people don't have it, it's bundled automatically by the app configuration)

Edit: on another note, the 16Mhz limit was removed many commits ago, are you seeing the new SoapySDR device selection dialog pop-up on startup? -- you may be building an old project..

bobobo1618 · 2015-10-15T04:28:52Z

Yup, locally built app bundle. Maybe it was but before I swapped it out it showed one thing and after I swapped it out it showed another so I'm dubious.

As for performance, I think I identified where all my CPU is going. It seems it's all in the 'channelizer' and DC filter in SDRPostThread. The DC filter seems to be the most intensive of them but the channelizer isn't able to run in real time on its own either.

I don't understand much of what I'm looking at here so I might be wrong but it looks like SDRPostThread is taking in the raw samples (SDRThreadIQData), applying the DC filter, passing things through to the FFT visualiser (visualDataBuffers), running the channelizer and passing through the data to the demodulator.

These might be dumb questions but can you somehow localize the DC filter to the narrow chunk of spectrum that's affected by the DC offset, so that the entire band isn't being filtered? Also can you apply the demodulator resampling and whatnot (filtering the used chunk of the spectrum) before running the channelizer? It seems these are expensive operations and reducing the amount of data they have to process could help.

cjcliffe · 2015-10-15T04:34:23Z

I'm checking with @jgaeddert to see if it's possible to disable unused channelizer channels; this is just a first pass but it's performing very well here; but I do have DC filter disabled for SDRPlay.

The channelizer is what allows it to split up the bandwidth into amounts that the resampler can handle; without it the demodulator gets the full stream -- right now it's doing all the channels (and as a result I have to increase the channel size) but I've put in a request to see if I can just enable the channels I need -- hopefully then I can make a simple binary tree and cascade a set of channelizers to handle active areas more efficiently.

Edit: on another note, there's probably enough room in the SDRThread now to do DC filtering there which may free up the post thread

cjcliffe · 2015-10-15T05:43:15Z

@bobobo1618 I tried 12Msps with the DC blocker on and it was jittery and unusable here -- Moving the DC Blocker to the SDRThread frees up enough resources for it to work with a few WBFM streams at 12Msps but the CPU usage is up about 15-20%. I've committed the change to the soapy-pfbch branch so let me know how it goes.

Not sure if I can just apply the DC blocking to a portion of the band, but I could just hide the spike visually and apply DC correction when demodulating at the baseband..

Ultimately I'll probably create a few islands of channelizers for groups of streams with some re-samplers at the front end of each but I'd like to see how far I can push the channelizer implementation so I can optimize/generalize it for re-use later.

bobobo1618 · 2015-10-15T06:04:14Z

The changes make it better but it's still pretty jittery and still eats a ton of CPU. The DC blocker appears to run better in this configuration (running time is only slightly higher than realtime) but the channelizer still hits the CPU limit. This is at 16MHz.

I had a look at one of the SDR applications I know run well and it looks like it uses an FFT rotator rather than a FIR filter, which seems to perform rather better. Any chance that would make sense?

cjcliffe · 2015-10-15T06:18:16Z

From what I know it's using FFT internally to perform the channelization; the FIR is just the prototype filter for post-channelization stop-band suppression. That's why it's important to have liquid-dsp built with FFTW3 or else it will fall back to internal FFT which was rough here..

This is far from the final implementation so please keep building and posting results as I report tweaks; thanks for your help!

bobobo1618 · 2015-10-15T16:32:22Z

Are there any special tricks you're using to build FFTW or LibLiquid? Liquid seems like it should just be automatically using FFTW if it's available but there don't seem to be calls to it coming out of the .dylib.

And do you build FFTW with MPI or OpenMP or anything like that?

cjcliffe · 2015-10-15T22:33:45Z

Nothing too special, only points I would make with regards to fftw3 integration are:

Make sure to compile FFTW3 with ./configure --enable-single (disables double precision, liquid needs fftw3f single version)
When you configure liquid-dsp, make sure you see the following:

checking fftw3.h usability... yes
checking fftw3.h presence... yes
checking for fftw3.h... yes
checking for fftwf_plan_dft_1d in -lfftw3f... yes

For CubicSDR ensure that your CMake FFTW/Liquid paths (check 'advanced' in the CMake GUI if you're using it) are pointing at the /usr/local/ versions and not /opt/local/ for ports or something.

If you have time to try tinkering with the library configuration flags and checking to find improvements that would be a good exercise to try at some point.

Edit: going to have a go at adding the channelizer toggling support to liquid-dsp myself, wish me luck :)

cjcliffe · 2015-10-15T23:00:51Z

Already well into implementing the channel toggling.. :) This may be a lot easier than I expected.

cjcliffe · 2015-10-16T00:49:34Z

Ok.. so assuming I didn't just hallucinate; my liquid-dsp channelizer toggle patch I have here just let me run 12 evenly spaced 200khz FM streams at 12Msps on the SDRPlay on my macbook pro before starting to jitter (~80% CPU)... that's a fair improvement from 80% at 3.

I was able to push it to almost 20 when grouping them a little closer together.. I'll update my liquid-dsp fork soon 😃

cjcliffe · 2015-10-16T01:07:26Z

@bobobo1618 Update your CubicSDR soapysdr-pfbch checkout and try building my fork/branch of liquid-dsp at https://github.com/cjcliffe/liquid-dsp/tree/firpfbch_toggle_channels and let me know if that causes any magic for you

bobobo1618 · 2015-10-16T02:47:23Z

Well it kinda did but it was really... Odd... Every single FM band I tuned to sounded exactly the same. Even if I clicked and dragged the display. It was like the demodulator was looping over a tiny band of the total spectrum. Everything displayed properly though. Really odd.

Performance was great though! A few lost samples at 20MHz but that was it.

cjcliffe · 2015-10-16T02:48:44Z

@bobobo1618 there may be a practical channel limit which I've thoroughly uncapped, can you step up the bandwidth from 10Mhz->20Mhz and tell me where it goes wrong?

cjcliffe · 2015-10-16T02:59:58Z

@bobobo1618 also might be worth comparing https://github.com/cjcliffe/CubicSDR/releases/tag/0.1.9-alpha-pfbch-issue150 which will be running on the exact build I'm testing here.

cjcliffe · 2015-10-16T03:19:31Z

Hmm, yeah I'm getting some strange bleeds all over the place, likely some filtering issues -- I'll investigate :)

bobobo1618 · 2015-10-16T03:24:22Z

Yeah, things are pretty unusable at the moment. The visual feedback is entirely disconnected from the audio from what I can tell.

For example, in both of these screenshots, I can hear a 100% clear, pure demodulated FM signal, while the FFT and whatnot is showing me noise.

Likewise, when I tune into a 'strong' signal, it's noise.

cjcliffe · 2015-10-16T03:25:58Z

Yeah I messed with the filter prototype to try and reduce CPU usage earlier, I'm going to take a look at that again.

cjcliffe · 2015-10-16T03:43:55Z

Hmm, so not my filter settings, I think what's going on is some extreme aliasing of my 400khz channels with 200khz FM stations, I'm going to tweak the channel bandwidth and see if I can make it fit the best non-aliased divisions it can.

jgaeddert · 2015-10-16T10:49:45Z

I haven't looked at the code, but looking at your callstack I think I see what you're doing, so you might consider using the firpfbch_crcf object rather than the firpfbch2_crcf. The firpfbch2 effectively runs the output at twice the rate of the firpfbch. So if you have an input sample rate of 16 MHz and are trying to break it into 16 channels, the firpfbch_crcf object will result in 16 channels each at 1 MHz while the firpfbch2 object will result in 16 channels each at 2 MHz. I can go into the details as to why later.

cjcliffe · 2015-10-16T13:01:13Z

@jgaeddert I'm primarily using the firpfbch2 since I'm assuming I wouldn't be able to demodulate signals that cross the channel boundaries -- right now I have it doing 400khz channels and I didn't see another way to have multiple 400khz demodulators with irregular placements and not cross any channel bounds than to have the bands overlap like that..

It's actually working perfectly except for my patch to try and reduce CPU usage which seems to just mash everything together.. I haven't done any deeper looking into what I've broken but it definitely gets the performance I was looking for :)

cjcliffe · 2015-10-16T21:34:08Z

@bobobo1618 @jgaeddert Going to try a slightly more complicated method of creating half-band resampler "islands" (grouping frequencies with bandwidth limit rules to decide splitting / merging / placement of resampling groups) essentially making my own "arbitrary channelizer".

If anything I'd just like to see how it compares -- I feel like I'm trying to find a one-fits-all solution when I should just be making several paths and optimizations for various scenarios. Channelizer still seems like it's the easiest way if I can just figure out how to skip the work done in unused channels..

Edit: tried running a test set of resamplers and the CPU usage was higher for just a few channels than using firpfbch alone.

cjcliffe · 2015-10-16T22:45:29Z

To cover all the bases I've implemented firpfbch_crcf in place of firpfbch2_crcf as @jgaeddert recommended at https://github.com/cjcliffe/CubicSDR/tree/soapysdr-pfbch-single and it seems to perform pretty well; I notice some spots where the signals seem to alias and degrade a bit but overall it's a lot better than my firpfbch2_crcf hacks :)

cjcliffe · 2015-10-16T23:53:29Z

So this is my system currently at 12Mhz input, with 10 x 200khz FM mono streams:

Specs:

bobobo1618 · 2015-10-17T17:16:42Z

Okay, runs fine with two channels (stereo FM) at 16MHz (175% CPU) and even 20MHz (~200% CPU) now but only while the window is out of focus... When I bring the window into focus, CPU jumps to 240% and 290% respectively. I would've said this is due to rendering the FFT but CPU stays low even while the entire window is visible. Want a CPU profile?

bobobo1618 · 2015-10-17T17:25:44Z

Struggles a lot at 24MHz though. Drops samples with a single mono stream while the window is in focus, produces extremely frequent and audible artifacts while it's not.

It's fun to have the broadcast FM spectrum in one window though :)

cjcliffe · 2015-10-17T17:27:17Z

@bobobo1618 CPU profile would be great; I've found some spots where I can eliminate another very heavy decimation used for the upper left miniature visualizer -- When the app isn't focused it reduces the frame rate by adding a delay in the main thread so that's probably giving you enough CPU for it to squeak by.

I'll have another update this afternoon that should cut down the CPU usage for the UI significantly; and then some more advanced updates for the waterfall soon which should help yet again.

Edit: wow, yeah that's a nice view; and I'm guessing 50% of your CPU is going to the mini-vis in the upper left ;)

bobobo1618 · 2015-10-17T17:40:57Z

Have a pile of profiles:

No demodulator, unfocused:

No demodulator, focused:

Demodulator, unfocused:

Demodulator, focused:

bobobo1618 · 2015-10-17T17:41:20Z

Sorry, all of the above were at 16MHz. I can go higher if you want.

cjcliffe · 2015-10-17T17:46:12Z

I'm interested to see what it does with no demodulator at 20-24Mhz; I think that's where we'll find some issues. It looks like the DC filter is still chewing up a fair bit -- I need to see what I can do about that; DC filter only really needs to be applied to channel 0..

bobobo1618 · 2015-10-17T17:50:09Z

Sure. No demod, 30MHz:

cjcliffe · 2015-10-17T17:51:41Z

Yeah that looks like mostly DC filter.. I'm going to tackle reducing DC to channel 0 only since it's fairly in-line with the fix for the CPU usage in the mini-vis

jgaeddert · 2015-10-17T19:33:24Z

I take it the DC-blocking filter is the iirfilt_crcf call that's eating up the CPU? I haven't looked at the code in detail, but are you performing DC-blocking on each channel? or just the full stream before channelization?

cjcliffe · 2015-10-17T20:22:20Z

@jgaeddert I've been applying it to the entire stream up until now; but I just pushed a commit that no longer does that.

@bobobo1618 soapysdr-pfbch-single has been updated and removes DC blocking from the SDR thread and only applies it to channel 0 when needed -- I'm working on using that same data to augment the waterfall/spectrum FFT to remove the visual spike.

It's also now using the channel data to supply the demodulator mini-waterfall in the upper left which frees up some resources as well.

bobobo1618 · 2015-10-17T22:07:01Z

Seemed to help at 30MHz, runs a lot more smoothly now.

cjcliffe · 2015-10-17T22:37:53Z

@bobobo1618 that looks a fair bit better; are you able to demodulate any streams at 30Mhz?

bobobo1618 · 2015-10-17T23:33:30Z

Nope. And I just noticed the frequency is being capped at 25MHz. The highest I'm able to get (a single) usable demodulated signal is 21MHz.

An idea that may be helpful, could you use (or add an option to use) a polyphase decimator block when only a single demodulator is being run, rather than a channeliser? I played around with the GNU radio companion a bit and found the former a lot easier on the CPU.

bobobo1618 · 2015-10-17T23:42:01Z

Actually I just noticed that the HackRF is only specified to reach 20MSPS... I think the issues I'm seeing can actually be safely attributed to hardware....

bobobo1618 · 2015-10-17T23:45:28Z

I can actually do things like this at 20MHz now that I try:

cjcliffe · 2015-10-17T23:59:08Z

@bobobo1618 wow, I think I count 21 x 200khz Mono streams there; that's awesome progress 😃

Just doing some testing here and I also ran into the over-bandwidth issue -- if you put in a higher bandwidth than the device limit it doesn't correct the main sample rate value and just gets weird -- I accidentally set my RTL dongle to 12Mhz when I thought I was on SDRPlay and it took me a minute to figure out what was going on.. Will fix that up soon.

bobobo1618 · 2015-10-18T02:27:47Z

It works but it starts to struggle at ~16 streams. It's still understandable at 21 but it's not really pleasant to listen to. It's fine at ~12 streams though, particularly when it's in the background.

cjcliffe · 2015-10-18T16:33:34Z

@bobobo1618 can you pull the latest soapysdr-pfbch-single branch and try tinkering with CubicSDRDefs.h on the following line:

#define CHANNELIZER_RATE_MAX 400000

And try some various rates other than 400khz? I'd be interested to see at 20Mhz input if there's an ideal channel/cpu performance ratio. At 400khz you're getting 50 channels and I'm thinking something like 500khz-800khz might be better but I'm unsure as I cap out at 12Mhz (30 channels) here.

If you could do up a quick list of CPU results for each channel rate that would be great.

Thanks!

bobobo1618 · 2015-10-18T17:31:32Z

I tried changing it around but didn't see much difference. In fact, at least with the demodulator on, 400k used less CPU than 800k. The difference was tiny (5%?) though.

cjcliffe · 2015-10-18T17:47:48Z

@bobobo1618 ok that's good to hear; that means I can provide some adjustment for allowing higher demodulation rates without affecting immediate performance too much.

cjcliffe · 2015-10-18T18:22:24Z

@bobobo1618 I've merged everything down to https://github.com/cjcliffe/CubicSDR/tree/soapysdr-support branch so you can pull the latest updates from there now.

cjcliffe · 2015-10-30T04:03:25Z

going to close this one for now; will open some more specific optimization issues soon; thanks!

cjcliffe added bug enhancement labels Sep 27, 2015

cjcliffe added this to the 0.2.0 milestone Sep 27, 2015

cjcliffe self-assigned this Sep 27, 2015

cjcliffe closed this as completed Oct 30, 2015

CubicSDR uses lots of CPU #150

CubicSDR uses lots of CPU #150

Comments

bobobo1618 commented Sep 27, 2015

cjcliffe commented Sep 27, 2015

bobobo1618 commented Sep 27, 2015

cjcliffe commented Sep 27, 2015

cjcliffe commented Sep 27, 2015

cjcliffe commented Oct 14, 2015

bobobo1618 commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

bobobo1618 commented Oct 15, 2015

bobobo1618 commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

bobobo1618 commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

bobobo1618 commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

bobobo1618 commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

cjcliffe commented Oct 15, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

bobobo1618 commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

bobobo1618 commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

jgaeddert commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

cjcliffe commented Oct 16, 2015

bobobo1618 commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

jgaeddert commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

bobobo1618 commented Oct 17, 2015

cjcliffe commented Oct 17, 2015

bobobo1618 commented Oct 18, 2015

cjcliffe commented Oct 18, 2015

bobobo1618 commented Oct 18, 2015

cjcliffe commented Oct 18, 2015

cjcliffe commented Oct 18, 2015

cjcliffe commented Oct 30, 2015