Introduction: My Response to the WS2811 With an AVR Thing

About: Just a guy really. Like building stuff and like to help other people build stuff too. If you really need to know, am male in my 30's and live close to Brisbane in Queensland, Australia.
First off - would like to say

Good on you Alan Burlison.

This is not meant to be bagging you in any way.  Your code did what it needed to do.  Great success.  My initial response in a forum comment was actually directed at the people who where offering non-working ideas of using a UART to get some hardware help.

My first suggestion of using a timer to help out is partly fleshed out below, but not fully functional.  The reason it is not complete is that when I started to fill out the code it became obvious that with a bit more optimization there is plenty clocks to do the full-monty as Bit-Banging without having to unroll any loops.

The second bit of action shown here is my other suggestion.  One of the "use a UART" people said that you could use an inverter to fix up the START-BIT problem.  I thought  "Well - if you are going to throw a 74XX at it, why not use the SPI and have 140 clock cycles free."  Again this is not a complete solution, but is a "proof of concept" to show how the hardware can help.

Finally the third piece is a version of bit banging out a WS2811 that I came up with.  Sans a WS2811 because I don't have any.  It does not do anything better than Alans code.  It is just a bit more optimized (1/2 size) and easier to read due to no loop unrolling and path-lengthening.

It does not break any new ground, there is no magic in it that no one has ever used.  It is just a little bit me showing off and a little bit of practice for me.  I have been away from the assembler for several years and am just trying to build up my confidence a little bit.

Anyways - On with the show

Step 1: Using TCO to Generate the Waveform

Sorry guys, but I can't work out how to add <code></code>  to this thing.

So I have added a quite useless picture of the code instead.

It at least has the code/comments in glorious technicolor.  If anyone wants the ASM file then send me a mail on here with your real email address and I will FWD it to you.

But back to the point.

This method of generating the pulses actually is slower (by one clock) than just pure bit banging.  However it has one big advantage.  All your free clock cycles (14 of them) are in one contiguous block.  The bit banging version has a total of 15 free clocks, but they are broken up into two blocks AND the output-test must go at the start which limits some of the other tricks you could have used.

The astute out there will notice that the scope shows the waveform at 400Khz.  My AVR on the desk here is clocked at 8Mhz not 16.  So it is apples for apples.

Step 2: Using TOC1A/B and SPI With a 74XX IC

OK - version two of making the serial stream for a WS2811 is a little bit silly.  There are not a lot of good reasons for adding extra hardware to this problem.  You can get buy doing it all in software.  The one big advantage of this method is that you get a whopping 140 clock cycles to do what ever it is you need to between loading a new data BYTE in the SPI register.

This one uses some external 74XX logic.  I this case I used a Hex Open Collector Inverter and did some wired OR logic.  There are many ways this could be done with a single chip.  The other obvious ones are a 7400 and a 74138. 

Three different outputs need to be mixed together to make the final waveform that is trace 2

PD5/OC1A Output Compare 1 A Trace B
PD4/OC1B Output Compare 1 B Trace A
PB6/MISO SPI Master In Slave Out Trace 1

Also Output Compare 1B must be fed back into 

PB7/SCK

to give the master clock for the SPI peripheral in SLAVE mode.  This is the yellow wire in my photo above.

The reason we can get the SPI to work in this way is that in SLAVE mode the module can not insert a stop bit the way it does in MASTER mode.  It is marching to the beat of someone else's drum.  When the next clock comes in, it has to just comply and give out the next data bit (if it is ready) or fail otherwise.  Speaking of failing. You only have 9 clock cycles to load the data register once the last byte is clear.   This means you cutting it a bit fine to use interrupts unless you use a "stupid AVR trick" to shave a few cycles of the interrupt response time.

Step 3: Bit Banging and Saving a Few More Clock Cycles

OK. BIt Banging.

800khz on a 16Mhz AVR is 20 clock cycles.

20 clock cycles on an AVR is a LOT.  We are not talking about PIC12/16C here with 4 ticks per instruction and only one real register.  The AVR does a lot per clock cycle.  If there was not the requirement to shuffle the RGB order then the AVR could do this serial job without breaking a sweat.

In fact the only thing the AVR does not shine at is changing bits in I/O registers.  This takes two clock cycles as shown in the details for the SBI instruction below.  The CPU has to read the register, modify it and write it back.  It is one of the few non-branching instructions in the AVR to take two clocks. (Note: the AVR XMega has fixed this issue and now is only 1 clock)

Using this instruction in time critical paths is not much fun as Alans code showed.  He had to jump and hop all over the place to equalise the path lengths.

     sbrc r19, 7                 ; test hi bit clear
     rjmp 3f                       ; true, skip pin hi -> lo
     cbi  %[port], %[pin]   ; false, pin hi -> lo
3:  sbrc r19, 7                 ; equalise delay of both code paths
     rjmp 4f
4:  nop                             ; pulse timing delay


So if the actual CBI and SBI instructions are going to take 2 clock cycles anyway and then you have to waste 2 clock cycles to equalise the path lengths, why not just do the read modify write yourself.  This will take 3 cycles total.

     IN         R16, PortX        ; Read the current state of the register
     ORI      R16, PinX         ; Set the Xth bit high
     OUT     PortX, R16        ; Write the new value out to the register 

The next thing you can do to save time is move everything outside the loop you can.  Because this code is using 100% of the CPU time, there is no risk something else is going to change PortX.  Also because no other code is running we can use as many CPU registers as we like.

So do this IN-ing and AND/OR-ing way outside the loop.

      IN        PinLo, PortX    ; Make a copy of the byte in PortX
      ANDI  PinLo, 0xFE     ; Modify it to be the value to write to make pin lo
      IN        PinHi, PortX    ; Make a copy of the byte in PortX
      ORI     PinHi, 0x01     ; Modify it to be the value to write to make pin hi

Loop:
      blah
      blah
      out       PortX, PinLo   ; Set the output pin LOW
      blah
      blah
      rjmp    Loop:

This has now made the whole bit toggling, serial shift, bit counting and looping take only 9 clocks.  This leaves 11 clocks free for loading data and shuffling.

Again this would be heaps of time if not for the out of order RGB thing.  Because of the out of order RGB thing we can not just treat each byte read as the next one going out.  We have to make a decision on where to save the newly read byte to a buffer and where from a buffer to get the next byte to send.

This is where the IJMP instruction comes to the rescue.  Its page from the AVR Instruction set is shown above.  We are using it like a case/switch statement in a software state-machine.  In each state we can set what the NEXT state should be without having to do any evaluations.

We can do this because we always know what colour the next byte is going to be

If we are currently processing the RED byte the next byte WILL be GREEN
If we are currently processing the GREEN byte the next byte WILL be BLUE
If we are currently processing the BLUE byte the next byte WILL be RED

eg. In the red state we can simply say 

STATE = GREEN

We don't have to say 

if (SOMETHING) then STATE = GREEN else STATE = BLUE

This saves a few clocks by not having to evaluate anything.


The whole code is shown as a picture here.  Again send me a PM if you want me to email it to you.

The comments in the code are hopefully enough even let someone unfamiliar with AVR-ASM understand it.

Step 4: Using a UART With Out External Inverter

OK - finally I am going to mention here that you COULD use the UART with no external inverter.

Four hints

1, 5 UART bits per 1 WS2811 bit
2, UART in 8 bit mode
3, BREAK
4, The entry point to the serialiser is not bit 0

I am not going to write the code for that as it is a waste of time on the AVR as you still have to count clocks on entry and it does not gain that much free time anyway.  On the XMega with DMA it is a different proposition though.  It could free most of your XMega CPU.


(I didn't know what to do as a photo for this step so I just did the brass robot  from The Etchinator)

Step 5: Conclusion

Ummmm.

What to conclude.

1, Alan did a fine job and it worked.
2, I am a tosser that just wants to show off how you can do things in less clocks on an AVR
3, People leaving comments on HaD should put up or shut up