Not all WS2812Bs light up under certain configurations (not a power problem)

Basic Information:

Printer Model: Ender 3 heavily modified
MCU / Printerboard: BTT Manta E3EZ
Host / SBC: Raspberry CM4 / BTT CB1
klippy.log
klippy.log (93.6 KB)

Describe your issue:


The saga is quite long, spanning over several days, but I want to spare you your time, so here’s the short version.
I wanted to use some long “Neopixels” for my printer (yes, I know, it’s a printer not a Christmas tree, but bare with an old child) using Julian Schill’s library, etc. But I’ve noticed that only a part of the string was driven.
Built a similar test setup, with different components (same board model, albeit with a CB1 instead of CM4, similar PSUs, different brand leds, etc). Wiring is ok, a separate 5V PSU is powering the LEDs at multiple points, with common ground to the 24V one.
On a bare minimal configuration, clean Klipper install, it’s working. Setting

initial_red = 0.5
initial_green = 0.0
initial_blue = 0.0

the whole strip lights red on boot. Changing the colour, e.g.


SET_LED LED=led_strip GREEN=0 RED=0 BLUE=0.502

changes the whole strip to blue. That’s the configuration from the first klippy.log

But adding two configurations makes things not working. Issuing the same as above makes only the first 40-50 LEDs to change to blue, the others stay red.

  1. I had a self-made filament motion sensor, which I don’t use for now, but the configuration remained in printer.cfg. If this is present, then the LEDs are not working, ligthing only partially:
[filament_motion_sensor filament_sensor]
pause_on_runout = True
detection_length = 1.25
extruder = extruder
switch_pin = PC5
runout_gcode = 
	M117 Runout Detected
insert_gcode = 
	M117 Insert Detected

Here’s a log with this inserted in printer.cfg: klippy_filamentsensor.log (84.9 KB)

  1. The Ender 3 had originally a Z-endstop, which I was not using because I have a BL-Touch. I thought it would be nice to use it as an “emergency button” on that connector, so I had
[gcode_button ESTOP_BUTTON]
pin = ^PC6
press_gcode = 
	{action_emergency_stop("'Emergency button pressed!'")}
	RESPOND MSG="Button pressed"

in printer.cfg. If it’s there, again, not all LEDs are working.
Again, here’s a log with this activated in printer.cfg:
klippy_estop.log (71.7 KB)

Electrical connections to the board are the usual ones:

I'm wondering why all this happens. Does anyone have an idea?

Could be a timing issue due to the long chain. The second log contains a warning that the NeoPixel update did not succeed.

You can try modifying the klippy/extras/neopixel.py file and playing with the BIT_MAX_TIME parameter, for example, setting it to .00003.
After the modification, you will need to restart the entire Klipper host with
sudo service klipper restart

1 Like

Mmm, well.
The neopixel code can be interrupted by the interrupts.
That could mess up the timings, and this is what this message is about.

Mmm, if it was a dedicated MCU or if there is no other activity (homing, probing, moving)
It should work.

The disabling of the filament sensor just decreases the number of interrupts that should be handled per unit time.

Hmmm…

114 neopixels.
1250ns per bit, ~800kHz
24 bits per chip.
114 * 24 / 800_000 = 0.00342s

Hmm… well, it could be problematic.

I think it is possible to hack the MCU code, to rebalance the scheduler code, so it will switch back to the “tasks” sooner, where the neopixel handler is executed, and that could allow it to work under load.

Anyway, right now it should work if there is no other activity on the printer.

@Sineos : In my journey I’ve seen at some point also @koconnor suggesting that, so I’ve tried it. No change.
The “Neopixel update did not succeed” message is probably at the moment I’ve issued the SET_LED command.

@nefelim4ag : As you can see, there is no activity on the printer, there is only the bare minimal defined in printer.cfg so that Klipper can boot. The only things that are active, in my opinion, are temperature readings (couldn’t start Klipper without an [extruder] section) and CAN-Bus communication.

Try disabling the gcode_button.
It runs timers much more often then the ADC.

Well, because there’s no explanation for the behavior, I don’t even know if it’s a bug or feature, the workaround was to attach a Pi Zero and let him alone handle the LEDs.

[mcu RP2040]
serial: /dev/serial/by-id/usb-Klipper_rp2040_45533065778AE48A-if00

[neopixel led_strip]
pin: RP2040:gpio12

Well, I did my best above to explain what the problem is.
The code runs, and the protocol is time sensitive.
Interrupt happens, transaction is corrupted.

It is a consequence of the design choices. Not a bug or feature. It is just not intended to actually drive ultra-long neopixels, which require mcu to basically do nothing except drive the neopixel for several milliseconds.
To make it support such long transactions here, the code should be reworked, or the scheduler hacked.
But I am, personally, not sure it is easily possible to hack the scheduler in a way that would make it work on the low-end MCUs (STM32G0B1) in a way that would be unnoticeable to other parts of the code.

On RP2040, on the other hand, you probably would experience it less often, just because it is much faster, and a situation where it would not have enough time between interrupts is less likely.

Hope that clarifies something.

That I understood. But why would a single input change that so dramatically? That’s what puzzles me.

If I understood you correctly, you do not grasp why any additional thing on the board would mess with the neopixel data transmission.

Well, basically, it is a single wire protocol. We have a max bit time and width of the zero and one pulses.
As you may notice, it is around 1.25us.
From the klippy code and neopixel.py, you could notice there is a 4us max bit time .000004s

From the Features/Benchmark page, you do know that one timer (step pulse) could be executed ~1.1 million times per second (1100k) on the STM32G0B1.

That basically means that in the best case, with one, highly optimized scenario timer would take 0.9us.
In case of the arbitrary, random timer, it depends on the environment; it would take more, and this time would be more than 1us.

So, it is a probabilistic thing, but that could happen that at the end of the pulse or between these:
There would be a button timer, for example, +>1us, then the CAN IRQ +>1us, then the ADC timer +>1us.

And we already are pretty close to the default abortion time of 4us.

Buttons are queried with a frequency of at least 500Hz (every 2 ms).
ADC is 3.33Hz, runs the 1000Hz timer for 8 times. (3 times a second, query ADC consequently 8 times with pauses for 1 ms).
CAN IRQ (I don’t really know).

In the case of your example, where you have 114 neopixels and around 24 * 114 = 2736 bits/pulses, which are spread out for 0.00342s (actually more, there are IRQs which will increase the overall time), the probabilities are high enough.

Hope that clarifies things a little.

1 Like

Wow! Thank you for your thoroughly and extensive explanation, and for your time! Wish that my teachers in school had such patience :slight_smile:
My test setup is still on the table at my office, I will try tomorrow to activate more inputs, to see how it behaves. The current strip caps at 51 neopixels, let’s see if with more timers it gets lower. Just for fun.

Made the just for fun tests. Activated in the config two thermistors and four buttons. No change.
Interestingly enough: I set the strip initial to be red. If I set it to be blue, as I said, ~51 LEDs change color. The same with setting to green. But if I try to put them in two R / G / B values

SET_LED LED=led_strip GREEN=0 RED=1 BLUE=1

(yellow, magenta, cyan) only about 40 change colours.

1 Like

Your long chain still makes me puzzled sometimes.
From this GitHub message.
There is a nice explanation of the protocol: NeoPixels Revealed: How to (not need to) generate precisely timed signals | josh.com

Probably, the hack with max bit timings really should help.
It is strange that it does not.

As you do have a separate MCU for the chain, you can probably also hack the neopixel.c:

    neopixel_time_t bit_max_ticks = n->bit_max_ticks;
    while (data_len--) {
        uint_fast8_t byte = *data++;
        uint_fast8_t bits = 8;
        while (bits--) {
            if (byte & 0x80) {
                // Long pulse
                neopixel_delay(last_start, BIT_MIN_TICKS);
                irq_disable();
                neopixel_time_t start = neopixel_get_time();
                gpio_out_toggle_noirq(pin);
                // irq_enable(); <- disable like that

                // And this one is disabled,
                //if (neopixel_check_elapsed(last_start, start, bit_max_ticks))
                //    goto fail;
                last_start = start;
                byte <<= 1;

                neopixel_delay(start, PULSE_LONG_TICKS);
                // irq_disable(); <- disable like that
                gpio_out_toggle_noirq(pin);
                irq_enable();

                neopixel_delay(neopixel_get_time(), EDGE_MIN_TICKS);
            } else {

So, if you do ensure the bit width, it would probably help and max bit time hack would help.

Thanks.

As I read the code snippet above the WS2812B Data is actually bitbanged.

Couldn’t it be implemented in a ways what the FastLED-Library does?
For RP2040 for example this library afaik uses the PIO Hardware, for the ESP32 you can use the RMT-Hardware to offload transmission from the MCU.

There are loads of MCU’s (which I do not know all of them) which have support, maybe for certain processors it could be ported to Klipper?

Well, there are 2 caveats that I can think of:

  • This is a generic code; it would be overkill to reimplement neopixel for every MCU type.
  • Klipper is 3d printer firmware, not the LED driver firmware :smiley:

So, ultimately, someone can hack their MCU to offload work to some HW controller. That would ultimately “fix” the problem.

But every time I do think about it, and try to argue with myself. The argument is like:
“Let’s mess with architecture and add complexity, so on a 3d printer, the long neopixel string would work better!”
It sounds a bit off to me.

I can imagine how important or useful they could be. I did see the examples like:


Where it is used to indicate machines on the farm.

The generic fix that I could imagine without fancy HW accelerators:
Extend the irq_poll(), so we can disable IRQs in the time-sensitive section (I would imagine that we do have ~5us 1 bit width, and probably 40us pause between bits). Then we can call irq_poll() inside the busy wait loop, 1 timer per cycle. That would generally decrease the likelihood of timeouts and increase overall throughput, I think (as long as irq_poll() < 5us).

Similar idea, but in a different way, signal to the timer dispatch that we need control back for a short period of time before time t.

Or even more generic probably, there should be a way to tune the:
#define TIMER_MIN_TRY_TICKS timer_from_us(2), so it would not be a constant, but the sum of time needed to exit the timer code + interrupt + reenter.
If the MCU could do that in less than 2us, that would mean that the task code (neopixel in this case), would progress further more frequently and timeouts would be less lickely.
Like it is done for AVR.

I suspect it would be a bit of work to write the scheduler’s autotune or test each MCU and define a time for it.

To sum up, I’m not against changes here; they probably can be done. But I’m leery of blind use of external libraries or overcomplicating the code to simply drive LEDs.

Thanks.

1 Like

Okay, I’m not sure how stupid it is, but I’ve rewritten it.

Alas, I do not have Neopixels to test it.
I think I wrapped my head and this is just a rebalance of the time of the code.
I don’t know if it would help or not, actually.

It needs to be tested on a slow MCU with a long chain to basically verify my assumptions.

Thanks.

1 Like

Well, I still have the spare Manta E3EZ with the STM32G0B1RE 64MHz MCU (I don’t know if it counts as “slow”) and a 100-150 chain of WS2812B. If you think I can help, just tell me what to do.

I’m no coder nor am I familiar with github. Are your changes merged in the “official” Klipper repository, or is it in your own fork? (Sorry, I’m a noob)

1 Like

According to my small tests: Tune TIMER_MIN_TRY_TICKS
It is slow :smiley:

There are 2 patches actually, one is a small optimization that can decrease load with chain updates, like a progress bar updates (update pixels only up to and including the changed one).

The second is the above hack.

Anyway, to test it out, there is: Testing Klipper Pull Requests
Or, probably shortcut git fetch --all, git checkout d2d8f0a or git cherry-pick d2d8f0a

Because this is an MCU side change, this requires reflashing the MCU afterwards.
The question is, would it allow us to update the pixels significantly further than the current master?

So, in a perfect scenario, there are 2 tests, one before and one after the patch.

Thanks!


PR is the Pull Request, generally read it as “Suggested changes”. If it is merged, they are accepted. If it is in any other state, they are in this “purgatory”.

As they are visible under klipper github repository, they are “suggested changes” to this repository.


I would even say, that generally, I would suggest to think about PRs in a way “Hey, look what I made, maybe you can find this useful!” - so suggestion/proposed changes for reason “X”.

1 Like

Sorry for the delay, I had to find some time - and some space on the work desk, that was harder. Wanted to test your PR, but I saw that you closed it :slight_smile:

1 Like