Loop unrolling optimization #15

nt314p · 2024-09-01T23:06:37Z

Hi,

I've been messing around with methods to increase shift out speed (possibly to under 10 us per byte).

One suggestion I have to increase performance is to manually unroll loops within write functions, since the number of iterations (8, 16, 24, etc.) is known at compile time.

This will increase sketch size, so perhaps options to use unrolled function versions could be provided through alternative functions (like writeMSBFIRSTU where U indicates unrolled) or a #define #ifdef for USE_UNROLLED. 🤷

-NT

RobTillaart · 2024-09-02T04:38:59Z

Thanks for sharing this idea.
As I am rather busy, it will take time before I will dive into it..
Did you make measurements of the gain / size ?

A define sounds good, would allow to set it from commandline too.

nt314p · 2024-09-02T23:12:30Z

No worries, I am busy as well and probably will only get around to testing it on hardware in a month.

Certainly though, unrolling should remove the jump instructions needed for the loop and give some performance benefit.

--

Unrelated, but I was wondering what the purpose is of the status register manipulation and noInterrupts() call?

// inner for loop of FastShiftOut::writeLSBFIRST
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    uint8_t oldSREG = SREG; // <-
    noInterrupts(); // <-
    if ((value & m) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;
    SREG = oldSREG; // <-
  }

RobTillaart · 2024-09-03T06:46:39Z

The noInterrupts() prevents that interrupts are run. The interrupt flag is part of the SREG (Status Register).
SREG = oldSREG resets all status flags to point before noInterrupts() is called.

Imagine that an IRQ uses the data and/or clock register in the loop.
e.g *_dataOutRegister &= outmask2; is in pseudo machine code.

1 LOAD  _dataOutRegister INTO CPUREGISTER
2 AND CPUREGISTER outmask2
3 STORE CPUREGISTER INTO _dataOutRegister

That could corrupt the content of the dataRegister if the IRQ would happen between step 2 and 3 in the above code.

Note:
The nointerrupts should be out of the loop as it would make the loop more predictable with respect to the arrival time of the individual bits. Could be important, even critical, depending on the receiving end.

RobTillaart · 2024-09-03T06:50:58Z

Reference timing of the current 0.3.3 version:

IDE: 1.8.19
Board: UNO

FastShiftOut_test.ino
FASTSHIFTOUT_LIB_VERSION: 0.3.3

Performance - time in us
        write: 23.51
        write: 45.78
        Delta: 22.27

writeLSBFIRST: 22.51
writeLSBFIRST: 45.77
        Delta: 23.26

writeMSBFIRST: 23.02
writeMSBFIRST: 44.77
        Delta: 21.75

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.60


Test print interface
println("Hello world"): 	328.92
println(1357): 			311.68
println(3.14159265, 4): 	716.04

done ...

RobTillaart · 2024-09-03T06:59:11Z

Not correct test, see below

Quick test:

Moving the noInterrupts() out of the loop improves from 21.75 us => 19.11 us (~10%)

writeMSBFIRST: 23.02
writeMSBFIRST: 42.13
        Delta: 19.11

IDE: 1.8.19
Board UNO

Sketch uses 6544 bytes (20%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

RobTillaart · 2024-09-03T07:09:00Z

Not correct test, see below

Quick test

Unrolling the loop and replacing m by constants improves from 19.11 us ==> 11.94 (~38%)

writeMSBFIRST: 23.02
writeMSBFIRST: 34.96
        Delta: 11.94

  uint8_t oldSREG = SREG;
  noInterrupts();

    if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x40) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x20) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x10) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x08) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x04) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x02) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x01) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;


  SREG = oldSREG;

IDE: 1.8.19
Board UNO

Sketch uses 6792 bytes (21%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

uses 250 bytes more PROGMEM,
NOTE only patched the MSB code to do the test, so it would add twice this amount of PROGMEM.

RobTillaart · 2024-09-03T07:19:10Z

Not correct test, see below

Quick test

replacing &= by = and |= by = for all masking, improves from 11.94 us ==> 4.65 (~60%)
Improvement from reference 21.75 ==> 4.65 is almost 80% (factor 4.6)

writeMSBFIRST: 23.02
writeMSBFIRST: 27.67
        Delta: 4.65


  uint8_t oldSREG = SREG;
  noInterrupts();
  
  uint8_t cbmask1  = *_clockRegister | _clockBit;
  uint8_t cbmask2  = *_clockRegister & ~_clockBit;
  uint8_t outmask1 = *_dataOutRegister | _dataOutBit;
  uint8_t outmask2 = *_dataOutRegister & ~_dataOutBit;

    if ((value & 0x80) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x40) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x20) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x10) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x08) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x04) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x02) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x01) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;


  SREG = oldSREG;

IDE: 1.8.19
Board UNO

Sketch uses 6708 bytes (20%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

roughly 150 bytes more PROGMEM
NOTE only patched the MSB code to do the test, so it would add twice this amount of PROGMEM.

RobTillaart · 2024-09-03T07:30:07Z

@nt314p

Warning: tests above need to be verified, they are just quick tests and may have bugs.

Test program was incorrect, code part looks OK

~~OK, trading RAM for speed makes a call roughly 4.5 times faster which is more than expected.~~
~~4.65 us means the average bit is send in 4.65 / 8 in 0.6 us (is about 1.6 megabit/second)~~

~~Note that the duration of a pulse on the clock or data line is even less.~~
~~Needs to be checked with hardware (scope) to verify if the pulses are still "square" enough.~~

~~This code needs some serious warning in the documentation.~~

to be continued

RobTillaart · 2024-09-03T07:42:55Z

Quick test of the print interface improvement (which has far more overhead)

Test print interface
println("Hello world"): 	218.64
println(1357): 			260.68
println(3.14159265, 4): 	648.12

where the 0.3.3 scored

Test print interface
println("Hello world"): 	328.92
println(1357): 			311.68
println(3.14159265, 4): 	716.04

roughly about 9 us gained per character send?

As said before, code need to be verified.
Furthermore gain for LSB and MSB might differ a bit.

Todo test gain for write16, write24 and write32 ==> add to example

RobTillaart · 2024-09-03T08:05:21Z

Quick test for write16/24/32

Reference 0.3.3

      write16: 45.84
      write16: 91.42
        Delta: 45.58

      write24: 68.09
      write24: 135.93
        Delta: 67.84

      write32: 90.35
      write32: 180.45
        Delta: 90.10

optimized

      write16: 28.74
      write16: 57.22
        Delta: 28.48

      write24: 42.44
      write24: 84.63
        Delta: 42.19

      write32: 56.15
      write32: 112.04
        Delta: 55.90

Still a ~37% improvement.
Writing e.g. a dedicated write32() without function call overhead would be even be faster (would use quite some RAM)

To be contiued.

RobTillaart · 2024-09-03T10:34:50Z

@nt314p
Ooops!, Found bugs in the test program, so the above results are not correct and will be less significant.
Need to rerun tests.

Size indication and code snippets seems OK.

RobTillaart · 2024-09-03T10:43:41Z

New reference 0.3.3

FastShiftOut_test.ino
FASTSHIFTOUT_LIB_VERSION: 0.3.3

Performance - time in us
        write: 23.51
        write: 45.78
        Delta: 22.27

writeLSBFIRST: 22.51
writeLSBFIRST: 44.76
        Delta: 22.25

writeMSBFIRST: 22.51
writeMSBFIRST: 44.77
        Delta: 22.26

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 45.65
      write16: 91.04
        Delta: 45.39

      write24: 67.90
      write24: 135.56
        Delta: 67.66

      write32: 90.16
      write32: 180.07
        Delta: 89.91

Test print interface
println("Hello world"): 	328.92
println(1357): 			311.60
println(3.14159265, 4): 	716.04

RobTillaart · 2024-09-03T10:59:24Z

New optimized writeMSBFIRST:

FSO object configured as MSBFIRST

Performance - time in us
        write: 15.02
        write: 28.80
        Delta: 13.78

writeLSBFIRST: 22.51
writeLSBFIRST: 44.77
        Delta: 22.26

writeMSBFIRST: 13.96
writeMSBFIRST: 27.67
        Delta: 13.70

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 28.74
      write16: 57.21
        Delta: 28.48

      write24: 42.44
      write24: 84.62
        Delta: 42.18

      write32: 56.15
      write32: 112.04
        Delta: 55.89


Test print interface
println("Hello world"): 	218.56
println(1357): 			260.72
println(3.14159265, 4): 	648.16

Gain writeMSBFIRST is from 22.26 us ==> 13.70 us (~38%) Still interesting.

RobTillaart · 2024-09-03T10:59:40Z

Will create a develop branch

RobTillaart · 2024-09-03T11:41:46Z

Created develop branch, please verify.

nt314p · 2024-09-03T12:39:14Z

Woah, you've done a lot of work. One thing to quickly point out is that you can't optimize away the &= and |=. If the clock and data pins reside on the same PORT register, you need to read the updated value first before clearing or setting the bit.

Like for a hypothetical 2 bit register that's initialized as 00 (DATA:1, CLOCK:0) before running the shift out function:

uint8_t cbmask1  = *_clockRegister | _clockBit; // 01
uint8_t cbmask2  = *_clockRegister & ~_clockBit; // 00
uint8_t outmask1 = *_dataOutRegister | _dataOutBit; // 10
uint8_t outmask2 = *_dataOutRegister & ~_dataOutBit; // 00

*_clockRegister = cbmask1; // this always clocks out DATA = 0
*_clockRegister = cbmask2;

RobTillaart · 2024-09-03T13:33:39Z

One thing to quickly point out is that you can't optimize away the &= and |=. If the clock and data pins reside on the same PORT register.

100% right, good catch, will fix it asap

RobTillaart · 2024-09-03T14:10:44Z

@nt314p
Push an updated version of the develop branch.

new figures,

        write: 18.60
        write: 35.97
        Delta: 17.36

writeLSBFIRST: 17.61
writeLSBFIRST: 34.96
        Delta: 17.35

writeMSBFIRST: 17.60
writeMSBFIRST: 34.96
        Delta: 17.36

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 35.84
      write16: 71.43
        Delta: 35.59

      write24: 53.19
      write24: 106.14
        Delta: 52.94

      write32: 70.54
      write32: 140.84
        Delta: 70.29

Test print interface
println("Hello world"): 	265.16
println(1357): 			282.24
println(3.14159265, 4): 	676.80

Timing writeMSBFIRST is from 22.26 us ==> 17.36 us (~22%), still worth an update.
Recall about 10% is gained by disabling the interrupts only once, so loop unroll gives about 12%.

Need to think hard if more optimizations are possible.

RobTillaart · 2024-09-03T15:40:56Z

removed SREG construction and use interrupts() saved one instruction (0.06 us).

tried "do not change dataregister if bit is the same as before" and it took more time to track the change :)

nt314p · 2024-09-03T17:39:40Z

Nice. For sure more optimizations are possible. If the register(s) are known at compile time it is possible to get the timing under 10 us per byte. But compile time pin to port is a whole other issue.

Keep in mind that by using interrupts(), you enable interrupts even if the sketch did not have them enabled before. An alternative could be using ATOMIC_BLOCK and ATOMIC_RESTORESTATE from the avr library (see https://www.nongnu.org/avr-libc/user-manual/group__util__atomic.html). This is a way to create a block of code that is executed without interrupts. I do not think it has any performance benefit though.

Another option could be letting the user manage interrupt enable/disable. I can think of instances where interrupts might be wanted or acceptable during the shift out, and other instances where interrupts would not be desired.

RobTillaart · 2024-09-03T19:21:17Z

Good points,
The SREG code restores the interrupt flag as it was before the call, so that would work.

The atomic block is an option, dont know the details so I have to do some reading

Pins could be known only at runtime, that is similar to the standard shiftout() function.Do not want to break that user expectation.
nued
That said, you propose to remove the interrupts and leave it to the user. Makes some sense, need to sleep over that thought.

To be continued later this week.

RobTillaart · 2024-09-04T07:02:47Z

After a night sleep, interrupts suppression is left to the user.
Will update the code and readme.md a.s.a.p

nt314p · 2024-09-04T12:08:54Z

I'm still confused as to where the rest of the clock cycles are going. Because 17.36 us is ~278 clock cycles, or ~34.75 clock cycles per bit. The register load-and/or-store should take 5 clock cycles based on LD,AND/OR,ST, so you have 5 + 5 for the clock pulse, another 5 for the data line, and even throw in 10 more (generously) for checking if the data bit is set and branching. That gives 25 cycles per bit, which leaves 278 - 25 x 8 = 78 clock cycles for function calling and return overhead? It seems excessive. Is there something I am missing?

RobTillaart · 2024-09-04T15:43:15Z

Found a possible optimization in the clock pulse that shaves of a micro.

  *_clockRegister |= cbmask1;    //  set one bit
  *_clockRegister &= cbmask2;    // reset it

==>

  uint8_t r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it

Results

writeLSBFIRST: 16.22
writeLSBFIRST: 32.20
        Delta: 15.98    <<<<  was 17.17

If there is no interrupt that modifies the clock register it could work.

However:
The original *_clockRegister |= cbmask1; statement could also be interrupted with a change in the clock register.
Imagine the clock is on the same port as Serial1 and data is coming in.

oops....

==> need to restore the interrupts disabling with the SREG in the code again.
It should be as robust as possible.

Will try to fix that today.

RobTillaart · 2024-09-04T15:52:53Z

With noInterrupts() added, there is 1 micros gain by the above trick.

  uint8_t r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it

writeLSBFIRST: 16.42
writeLSBFIRST: 32.57
        Delta: 16.16   <<<  was 17.17

Opinion?

RobTillaart · 2024-09-04T15:58:43Z

Applying the same trick to the "for loop not unrolled" version

writeLSBFIRST: 20.06
writeLSBFIRST: 39.87
        Delta: 19.80   <<<  was 20.94   => 0.3.3 took 22.20 us

I need time for other repos, so to be continued.

nt314p · 2024-09-04T19:47:21Z

I agree. If you want the shift out to be robust, you'll need to apply the noInterrupts()/SREG pattern to make an atomic section of code. You could also use the ATOMIC_BLOCK I suggested earlier. No difference though I believe.

It's also a matter of how aggressively we use atomic blocks. We could do

// ...
ATOMIC_BLOCK
{
  *_clockRegister |= cbmask1;
}
ATOMIC_BLOCK
{
  *_clockRegister &= cbmask2;
}
// ...

or

// ...
ATOMIC_BLOCK
{
  *_clockRegister |= cbmask1;
  *_clockRegister &= cbmask2;
}
// ...

or just disable interrupts for each bit or even during the entire shift out. Applying atomic blocks more finely prevents more optimization. For example, your clock pulse optimization cannot work if interrupts are allowed between the clock HIGH and clock LOW.

Looking at the assembly, it seems that the program is repeatedly loading the data and clock register pointers into CPU registers. Perhaps try caching these two values at the beginning of the function. Avoiding those loads (the ldds) should save a considerable amount. Note that the indirect load and store functions (think reading and writing to *registerPointer) take 2 clock cycles I believe.

I've been taking a look at the assembly on my end, but I probably should just clone the repository to get more accurate results instead of just copying snippets.

RobTillaart · 2024-09-05T06:54:29Z

ATOMIC_BLOCK per bit is preferred as it is also a logic block.
Would be 8 blocks of about 2 us with current timing.

ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
{
  if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
  else                     *_dataOutRegister |= outmask1;
  r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it
}

https://blog.oddbit.com/post/2019-02-01-atomicblock-magic-in-avrlibc/
https://www.nongnu.org/avr-libc/user-manual/group__util__atomic.html

it just does disable and set interrupts, so I keep my own noInterrupts() construct.

RobTillaart · 2024-09-05T08:05:14Z

Updated the develop branch and created a PR for the 0.4.0 release.
Think this version has to be tested more before adding more optimizations to it.
Please download the develop branch and let me know if you find problems.

RobTillaart · 2024-09-10T08:03:51Z

@nt314p
Created a PR for FastShiftIn class today. - RobTillaart/FastShiftIn#18

Performance gains are looking good. Snippet from Readme.md

|  function            |   0.2.3  |   0.3.2  |   0.4.0  |  0.4.0L  |
|:---------------------|---------:|---------:|---------:|---------:|
|  read()              |   19.30  |   20.49  |   20.57  |   12.40  |
|  read16()            |          |   41.04  |   41.11  |   25.64  |
|  read24()            |          |   62.91  |   62.68  |   39.47  |
|  read32()            |          |   83.95  |   82.86  |   51.92  |
|  readLSBFIRST()      |   19.04  |   19.92  |   19.82  |   12.08  |
|  readMSBFIRST()      |   19.04  |   19.92  |   19.81  |   12.07  |
|  reference shiftIn() |  107.82  |  108.20  |  108.20  |  108.05  |

- Note: 0.4.0L measured with loop unrolled flag enabled. 
- Note: 0.4.0 measured with loop unroll flag disabled.

I will only merge after FastShiftInOut has been updated too as these three libs are quite related.

RobTillaart · 2024-09-10T14:50:32Z

@nt314p
Created a PR for FastShiftInOut class today. - RobTillaart/FastShiftInOut#8

Performance gain is about 25% so 👍

Will start merging the three PR's tomorrow

nt314p · 2024-09-10T15:18:19Z

Great stuff. Do you intentionally disable interrupts during the entire shift out for the unrolled writeLSBFIRST function (and in other functions)? I thought it would just be per bit.

Also, as I mentioned before, I believe you should cache the volatile registers locally to avoid extra indirect loads to fetch the register pointers for every bit.

For example:

size_t FastShiftOut::writeLSBFIRST(uint8_t data)
{
  // other initialization...
  volatile uint8_t* localDataOutRegister = _dataOutRegister;
  volatile uint8_t* localClockRegister = _clockRegister;
  // ...

This should give another performance boost, although I do not have the hardware with me to test it.

RobTillaart · 2024-09-10T17:22:23Z

This should give another performance boost, although I do not have the hardware with me to test it.

You can test this with only an UNO, nothing connected as there are no return signals.
I will give it a try

RobTillaart · 2024-09-10T17:28:57Z

Yes, it is faster quite a bit.

Only patched the writeLSBFIRST() and there is another 4.4 usec off, see the diff with writeMSBFIRST()

writeLSBFIRST: 11.76
writeLSBFIRST: 23.26
        Delta: 11.50

writeMSBFIRST: 16.16
writeMSBFIRST: 32.07
        Delta: 15.91

The bit rate is approaching 700.000 bit / second,

TODO: update code and readme.md (maybe tomorrow)

RobTillaart · 2024-09-10T17:38:02Z

Patched the not unrolled loop too + move the interrupt block out the loop (per byte not per bit)

writeLSBFIRST: 14.34
writeLSBFIRST: 28.42
        Delta: 14.09

writeMSBFIRST: 22.26
writeMSBFIRST: 44.27
        Delta: 22.01

almost 8 us of 22 = ~35% faster.

RobTillaart · 2024-09-10T17:51:01Z

Did an update for FastShiftOut(), ==> roughly 45% improved versus 0.3.3


|  function                |  0.2.4  |   0.3.1  |   0.3.3  |   0.4.0  |  0.4.0L  |
|:-------------------------|--------:|---------:|---------:|---------:|---------:|
|  write()                 |  21.66  |   22.48  |   22.27  |   14.10  |   11.51  |
|  writeLSBFIRST()         |  22.94  |   23.37  |   22.25  |   14.09  |   11.50  |
|  writeMSBFIRST()         |  20.30  |   21.86  |   22.26  |   14.08  |   11.50  |
|  reference shiftOut()    |  89.74  |   89.74  |   89.59  |   89.60  |   89.60  |
|  write16()               |   na    |    na    |   45.39  |   29.06  |   23.89  |
|  write24()               |   na    |    na    |   67.66  |   43.12  |   35.40  |
|  write32()               |   na    |    na    |   89.91  |   57.22  |   46.90  |
|  println("Hello world")  |   na    |  328.92  |  328.92  |  222.68  |  189.20  |
|  println(1357)           |   na    |  313.56  |  311.60  |  262.60  |  247.12  |
|  println(3.14159265, 4)  |   na    |  717.36  |  716.04  |  650.68  |  629.96  |

nt314p · 2024-09-10T18:01:45Z

Nice! I actually don't even have an Arduino with me. Really no hardware at all. I'm wondering what hardware setup can be used to verify that the output is behaving as we expect. It's possible we've overlooked something that breaks the shift out that the software tests cannot detect.

Somehow ~45% improved feels like an understatement. I think we shaved off 45% of the time, but that made it nearly 100% faster!

RobTillaart · 2024-09-10T18:05:54Z

It's possible we've overlooked something that breaks the shift out that the software tests cannot detect.

One thing is sure, going twice as fast means that the signals are having more RC effects.
The pulses will be less "squarish", I expect that will be visible on a oscilloscope.

to be continued.

RobTillaart · 2024-09-11T09:26:08Z

@nt314p

new timing for the FastShiftIn() with local registers ==> crossed the 10 us border,
(verification with hardware still to be done.)

|  function            |   0.2.3  |   0.3.2  |   0.4.0  |  0.4.0L  |
|:---------------------|---------:|---------:|---------:|---------:|
|  read()              |   19.30  |   20.49  |   12.71  |    9.51  |
|  read16()            |          |   41.04  |   25.39  |   18.98  |
|  read24()            |          |   62.91  |   39.10  |   29.48  |
|  read32()            |          |   83.95  |   51.42  |   38.60  |
|  readLSBFIRST()      |   19.04  |   19.92  |   11.96  |    8.81  |
|  readMSBFIRST()      |   19.04  |   19.92  |   11.94  |    8.75  |
|  reference shiftIn() |  107.82  |  108.20  |  108.05  |  108.05  |

RobTillaart · 2024-09-11T11:18:29Z

@nt314p

And the numbers for FastShiftInOut() with local registers.

|  function                |   0.1.3  |   0.2.0  |   0.2.0L  |
|:-------------------------|---------:|---------:|----------:|
|  write() (reference)     | no data  |  158.24  |  no data  |
|  write()                 |   25.52  |   17.61  |    12.26  |
|  writeLSBFIRST()         |   25.52  |   17.61  |    12.26  |
|  writeMSBFIRST()         |   25.52  |   17.60  |    12.20  |


- Note: 0.1.3 added from old table.
- Note: reference run on AVR by commenting all optimizations.
- Note: 0.2.0 measured with loop unroll flag disabled.
- Note: 0.2.0L measured with loop unrolled flag enabled.

nt314p · 2024-09-14T13:45:08Z

For the sake of cleaner code, you might consider using a macro to unroll the loop. For example,

#define UNROLL_2(x) x x
#define UNROLL_4(x) UNROLL_2(x) UNROLL_2(x)
#define UNROLL_8(x) UNROLL_4(x) UNROLL_4(x)

To make it so the code inside the UNROLL can be repeated, you can shift the byte value instead of shifting the bit you check each time (currently you do if ((value & 0x80) == 0), if ((value & 0x40) == 0), ...). For MSB first you can do,

UNROLL_8(
  // ...
  if ((value & 0x80) == 0) // ...
  // ...
  value <<= 1;
);

The compiler should recognize this shifting pattern and optimize it to a compile time bit check (generating the same assembly as before).

RobTillaart · 2024-09-14T14:04:47Z

Gain is only in maintenance,

First step (imho) is to verify current code with hardware.
Then release it as "reference".
Then look for additional optimization's.

RobTillaart · 2024-09-19T11:27:33Z

@nt314p
Finally found time to check the code with a scope to see the signal quality.
(disclaimer photo made with phone as scope froze on USB)

Arduino UNO, LOOP UNROLLED

YELLOW = DATA
PURPLE = CLOCK

Note a byte takes between 8 and 9 microseconds.
(no loop unrolled is about 11 micros, no artifacts.).
data line shows nice 0x55 pattern alternating 0b01010101

Zooming in on a single bit of the clock, one sees the signal looks good.
(it has a pull up resistor)

rise time ~20 ns,
fall time ~40 ns
high time ~100 ns
clock pulse ==> 2 clock cycli UNO)
data pulse is rather long compared to clock pulse.
minor noise (ripple) on data line when clock rises / falls (and vice versa)

Think the most expensive part is now setting the output line.

  if ((value & 0x01) == 0) *localDataOutRegister &= outmask2;
  else                     *localDataOutRegister |= outmask1;

As this is done for every bit again it might be inefficient.
For the bit pattern 01010101 one needs to change the dataRegister 8 times.
For a bit pattern 11110000 one only needs to change the dataRegister 2 times
And a bit pattern 11111111 one needs to change only the dataRegister 1 time.
So remembering the status and not change if not needed might strip off one or more clockcycle.

RobTillaart · 2024-09-19T12:10:00Z

So all looks good, I'm going to merge after uploaded the FastShiftOut_scope_test.ino.

- fix #15, loop unroll option, improving performance, kudos to nt314p - fixed bug in test program (see #15) - added flag to select LOOP UNROLL (is optional as it gives larger code size) - optimized the not unrolled loop with ideas of the unrolling version. - corrected type lastValue to uint8_t - add FastShiftOut_scope_test.ino - update readme.md - minor edits

nt314p · 2024-09-19T13:25:33Z

Neat how the changes in the lines affect one another. You could have a bitmask that checks the current and next bits to see if they're 00 or 11. Might be interesting to investigate.

One other optimization you can do is eliminating some expensive jumps when setting the data register. You can even see on the scope that after the previous clock pulse, it takes longer to set the data line compared to resetting it. It does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization, which is actually faster.

uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;

Should another issue be opened for more optimization discussions?

RobTillaart · 2024-09-19T13:44:44Z

Should another issue be opened for more optimization discussions?

Think that is the way forward.
There might be some feedback / questions on the new 040 release too.
(Including comments on the related libs I released today).

RobTillaart · 2024-09-19T13:46:08Z

Please copy the ideas to the new issue

RobTillaart self-assigned this Sep 2, 2024

RobTillaart added the enhancement New feature or request label Sep 2, 2024

RobTillaart added a commit that referenced this issue Sep 3, 2024

Fix #15, loop unroll support

ff767ec

RobTillaart added a commit that referenced this issue Sep 3, 2024

Fix #15, loop unroll support

a79da44

RobTillaart added a commit that referenced this issue Sep 3, 2024

Fix #15, loop unroll support

6dce389

RobTillaart added a commit that referenced this issue Sep 3, 2024

Fix #15, loop unroll support

28ebb93

RobTillaart added a commit that referenced this issue Sep 4, 2024

Fix #15, loop unroll support

4ce85c3

This was referenced Sep 4, 2024

Sync optimizations of FastShiftOUt RobTillaart/FastShiftInOut#7

Closed

Sync optimizations of FastShiftOUt RobTillaart/FastShiftIn#17

Closed

RobTillaart linked a pull request Sep 10, 2024 that will close this issue

Fix #15, optional loop unroll optimization #16

Merged

RobTillaart closed this as completed in #16 Sep 19, 2024

RobTillaart mentioned this issue Sep 20, 2024

Replace jump branch with skip branch optimization #17

Closed

Loop unrolling optimization #15

Loop unrolling optimization #15

Comments

nt314p commented Sep 1, 2024

RobTillaart commented Sep 2, 2024

nt314p commented Sep 2, 2024 • edited Loading

RobTillaart commented Sep 3, 2024 • edited Loading

RobTillaart commented Sep 3, 2024 • edited Loading

RobTillaart commented Sep 3, 2024 • edited Loading

Not correct test, see below

RobTillaart commented Sep 3, 2024 • edited Loading

Not correct test, see below

RobTillaart commented Sep 3, 2024 • edited Loading

Not correct test, see below

RobTillaart commented Sep 3, 2024 • edited Loading

Test program was incorrect, code part looks OK

RobTillaart commented Sep 3, 2024 • edited Loading

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 3, 2024 • edited Loading

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 3, 2024

nt314p commented Sep 3, 2024

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 3, 2024 • edited Loading

RobTillaart commented Sep 3, 2024

nt314p commented Sep 3, 2024

RobTillaart commented Sep 3, 2024

RobTillaart commented Sep 4, 2024 • edited Loading

nt314p commented Sep 4, 2024

RobTillaart commented Sep 4, 2024 • edited Loading

RobTillaart commented Sep 4, 2024

RobTillaart commented Sep 4, 2024

nt314p commented Sep 4, 2024

RobTillaart commented Sep 5, 2024 • edited Loading

RobTillaart commented Sep 5, 2024

RobTillaart commented Sep 10, 2024

RobTillaart commented Sep 10, 2024

nt314p commented Sep 10, 2024 • edited Loading

RobTillaart commented Sep 10, 2024

RobTillaart commented Sep 10, 2024 • edited Loading

RobTillaart commented Sep 10, 2024

RobTillaart commented Sep 10, 2024 • edited Loading

nt314p commented Sep 10, 2024

RobTillaart commented Sep 10, 2024

RobTillaart commented Sep 11, 2024

RobTillaart commented Sep 11, 2024

nt314p commented Sep 14, 2024

RobTillaart commented Sep 14, 2024

RobTillaart commented Sep 19, 2024

Arduino UNO, LOOP UNROLLED

RobTillaart commented Sep 19, 2024

nt314p commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

nt314p commented Sep 2, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 3, 2024 •

edited

Loading

RobTillaart commented Sep 4, 2024 •

edited

Loading

RobTillaart commented Sep 4, 2024 •

edited

Loading

RobTillaart commented Sep 5, 2024 •

edited

Loading

nt314p commented Sep 10, 2024 •

edited

Loading

RobTillaart commented Sep 10, 2024 •

edited

Loading

RobTillaart commented Sep 10, 2024 •

edited

Loading