-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop unrolling optimization #15
Comments
Thanks for sharing this idea. A define sounds good, would allow to set it from commandline too. |
No worries, I am busy as well and probably will only get around to testing it on hardware in a month. Certainly though, unrolling should remove the jump instructions needed for the loop and give some performance benefit. -- Unrelated, but I was wondering what the purpose is of the status register manipulation and // inner for loop of FastShiftOut::writeLSBFIRST
for (uint8_t m = 1; m > 0; m <<= 1)
{
uint8_t oldSREG = SREG; // <-
noInterrupts(); // <-
if ((value & m) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
SREG = oldSREG; // <-
} |
The noInterrupts() prevents that interrupts are run. The interrupt flag is part of the SREG (Status Register). Imagine that an IRQ uses the data and/or clock register in the loop.
That could corrupt the content of the dataRegister if the IRQ would happen between step 2 and 3 in the above code. Note: |
Reference timing of the current 0.3.3 version:
|
Not correct test, see belowQuick test: Moving the noInterrupts() out of the loop improves from 21.75 us => 19.11 us (~10%)
IDE: 1.8.19
|
Not correct test, see belowQuick test Unrolling the loop and replacing m by constants improves from 19.11 us ==> 11.94 (~38%)
uint8_t oldSREG = SREG;
noInterrupts();
if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x40) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x20) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x10) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x08) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x04) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x02) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
if ((value & 0x01) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
SREG = oldSREG; IDE: 1.8.19
uses 250 bytes more PROGMEM, |
Not correct test, see belowQuick test replacing &= by = and |= by = for all masking, improves from 11.94 us ==> 4.65 (~60%)
IDE: 1.8.19
roughly 150 bytes more PROGMEM |
Warning: tests above need to be verified, they are just quick tests and may have bugs. Test program was incorrect, code part looks OK
to be continued |
Quick test of the print interface improvement (which has far more overhead)
where the 0.3.3 scored
roughly about 9 us gained per character send? As said before, code need to be verified. Todo test gain for write16, write24 and write32 ==> add to example |
Quick test for write16/24/32 Reference 0.3.3
optimized
Still a ~37% improvement. To be contiued. |
@nt314p Size indication and code snippets seems OK. |
New reference 0.3.3
|
New optimized writeMSBFIRST: FSO object configured as MSBFIRST
Gain writeMSBFIRST is from 22.26 us ==> 13.70 us (~38%) Still interesting. |
Will create a develop branch |
Created develop branch, please verify. |
Woah, you've done a lot of work. One thing to quickly point out is that you can't optimize away the &= and |=. If the clock and data pins reside on the same PORT register, you need to read the updated value first before clearing or setting the bit. Like for a hypothetical 2 bit register that's initialized as 00 (DATA:1, CLOCK:0) before running the shift out function: uint8_t cbmask1 = *_clockRegister | _clockBit; // 01
uint8_t cbmask2 = *_clockRegister & ~_clockBit; // 00
uint8_t outmask1 = *_dataOutRegister | _dataOutBit; // 10
uint8_t outmask2 = *_dataOutRegister & ~_dataOutBit; // 00
*_clockRegister = cbmask1; // this always clocks out DATA = 0
*_clockRegister = cbmask2; |
100% right, good catch, will fix it asap |
@nt314p new figures,
Timing writeMSBFIRST is from 22.26 us ==> 17.36 us (~22%), still worth an update. Need to think hard if more optimizations are possible. |
removed SREG construction and use interrupts() saved one instruction (0.06 us). tried "do not change dataregister if bit is the same as before" and it took more time to track the change :) |
Nice. For sure more optimizations are possible. If the register(s) are known at compile time it is possible to get the timing under 10 us per byte. But compile time pin to port is a whole other issue. Keep in mind that by using Another option could be letting the user manage interrupt enable/disable. I can think of instances where interrupts might be wanted or acceptable during the shift out, and other instances where interrupts would not be desired. |
Good points, The atomic block is an option, dont know the details so I have to do some reading Pins could be known only at runtime, that is similar to the standard shiftout() function.Do not want to break that user expectation. To be continued later this week. |
After a night sleep, interrupts suppression is left to the user. |
I'm still confused as to where the rest of the clock cycles are going. Because 17.36 us is ~278 clock cycles, or ~34.75 clock cycles per bit. The register load-and/or-store should take 5 clock cycles based on |
Found a possible optimization in the clock pulse that shaves of a micro. *_clockRegister |= cbmask1; // set one bit
*_clockRegister &= cbmask2; // reset it
==>
uint8_t r = *_clockRegister;
*_clockRegister = r | cbmask1; // set one bit
*_clockRegister = r; // reset it Results
If there is no interrupt that modifies the clock register it could work. However: oops.... ==> need to restore the interrupts disabling with the SREG in the code again. Will try to fix that today. |
With noInterrupts() added, there is 1 micros gain by the above trick.
Opinion? |
Applying the same trick to the "for loop not unrolled" version
I need time for other repos, so to be continued. |
I agree. If you want the shift out to be robust, you'll need to apply the It's also a matter of how aggressively we use atomic blocks. We could do // ...
ATOMIC_BLOCK
{
*_clockRegister |= cbmask1;
}
ATOMIC_BLOCK
{
*_clockRegister &= cbmask2;
}
// ... or // ...
ATOMIC_BLOCK
{
*_clockRegister |= cbmask1;
*_clockRegister &= cbmask2;
}
// ... or just disable interrupts for each bit or even during the entire shift out. Applying atomic blocks more finely prevents more optimization. For example, your clock pulse optimization cannot work if interrupts are allowed between the clock HIGH and clock LOW. Looking at the assembly, it seems that the program is repeatedly loading the data and clock register pointers into CPU registers. Perhaps try caching these two values at the beginning of the function. Avoiding those loads (the I've been taking a look at the assembly on my end, but I probably should just clone the repository to get more accurate results instead of just copying snippets. |
ATOMIC_BLOCK per bit is preferred as it is also a logic block. ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
{
if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
else *_dataOutRegister |= outmask1;
r = *_clockRegister;
*_clockRegister = r | cbmask1; // set one bit
*_clockRegister = r; // reset it
} https://blog.oddbit.com/post/2019-02-01-atomicblock-magic-in-avrlibc/ it just does disable and set interrupts, so I keep my own noInterrupts() construct. |
Updated the develop branch and created a PR for the 0.4.0 release. |
@nt314p Performance gains are looking good. Snippet from Readme.md
I will only merge after FastShiftInOut has been updated too as these three libs are quite related. |
@nt314p Performance gain is about 25% so 👍 Will start merging the three PR's tomorrow |
Great stuff. Do you intentionally disable interrupts during the entire shift out for the unrolled Also, as I mentioned before, I believe you should cache the volatile registers locally to avoid extra indirect loads to fetch the register pointers for every bit. For example: size_t FastShiftOut::writeLSBFIRST(uint8_t data)
{
// other initialization...
volatile uint8_t* localDataOutRegister = _dataOutRegister;
volatile uint8_t* localClockRegister = _clockRegister;
// ... This should give another performance boost, although I do not have the hardware with me to test it. |
You can test this with only an UNO, nothing connected as there are no return signals. |
Yes, it is faster quite a bit. Only patched the writeLSBFIRST() and there is another 4.4 usec off, see the diff with writeMSBFIRST()
The bit rate is approaching 700.000 bit / second, TODO: update code and readme.md (maybe tomorrow) |
Patched the not unrolled loop too + move the interrupt block out the loop (per byte not per bit)
almost 8 us of 22 = ~35% faster. |
Did an update for FastShiftOut(), ==> roughly 45% improved versus 0.3.3
|
Nice! I actually don't even have an Arduino with me. Really no hardware at all. I'm wondering what hardware setup can be used to verify that the output is behaving as we expect. It's possible we've overlooked something that breaks the shift out that the software tests cannot detect. Somehow ~45% improved feels like an understatement. I think we shaved off 45% of the time, but that made it nearly 100% faster! |
One thing is sure, going twice as fast means that the signals are having more RC effects. to be continued. |
new timing for the FastShiftIn() with local registers ==> crossed the 10 us border,
|
And the numbers for FastShiftInOut() with local registers.
|
For the sake of cleaner code, you might consider using a macro to unroll the loop. For example, #define UNROLL_2(x) x x
#define UNROLL_4(x) UNROLL_2(x) UNROLL_2(x)
#define UNROLL_8(x) UNROLL_4(x) UNROLL_4(x) To make it so the code inside the UNROLL can be repeated, you can shift the byte UNROLL_8(
// ...
if ((value & 0x80) == 0) // ...
// ...
value <<= 1;
); The compiler should recognize this shifting pattern and optimize it to a compile time bit check (generating the same assembly as before). |
Gain is only in maintenance, First step (imho) is to verify current code with hardware. |
@nt314p Arduino UNO, LOOP UNROLLEDYELLOW = DATA
Zooming in on a single bit of the clock, one sees the signal looks good.
Think the most expensive part is now setting the output line. if ((value & 0x01) == 0) *localDataOutRegister &= outmask2;
else *localDataOutRegister |= outmask1; As this is done for every bit again it might be inefficient. |
So all looks good, I'm going to merge after uploaded the FastShiftOut_scope_test.ino. |
- fix #15, loop unroll option, improving performance, kudos to nt314p - fixed bug in test program (see #15) - added flag to select LOOP UNROLL (is optional as it gives larger code size) - optimized the not unrolled loop with ideas of the unrolling version. - corrected type lastValue to uint8_t - add FastShiftOut_scope_test.ino - update readme.md - minor edits
Neat how the changes in the lines affect one another. You could have a bitmask that checks the current and next bits to see if they're 00 or 11. Might be interesting to investigate. One other optimization you can do is eliminating some expensive jumps when setting the data register. You can even see on the scope that after the previous clock pulse, it takes longer to set the data line compared to resetting it. It does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization, which is actually faster. uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d; Should another issue be opened for more optimization discussions? |
Think that is the way forward. |
Please copy the ideas to the new issue |
Hi,
I've been messing around with methods to increase shift out speed (possibly to under 10 us per byte).
One suggestion I have to increase performance is to manually unroll loops within write functions, since the number of iterations (8, 16, 24, etc.) is known at compile time.
This will increase sketch size, so perhaps options to use unrolled function versions could be provided through alternative functions (like writeMSBFIRSTU where U indicates unrolled) or a #define #ifdef for USE_UNROLLED. 🤷
-NT
The text was updated successfully, but these errors were encountered: