MCU shutdown after couple hours

Basic Information:

Printer Model: Zero G Nebula
MCU / Printerboard: SKR EZ V3 with BTT PI V2
Host / SBC
klippy.log
klippy (8).zip (1.8 MB)

Fill out above information and in all cases attach your klippy.log file (use zip to compress it, if too big). Pasting your printer.cfg is not needed
Be sure to check our “Knowledge Base” Category first. Most relevant items, e.g. error messages, are covered there

Describe your issue:

Hi all,

At a loss here and looking for assistance.
between 1 two hours into a print i get the MCU communication error, all the time.
Cant finish a print.

This usually happens when I am printing ABS in long runs of +4 hours.
Fail usually happens around 2 hours in.

Ran Graph Klipper Stats and it seems that my MCU is booting in and out for the entire print and at some point decides to shut down.
In the graph it seems to overload to above 250%




If I look at the Klippy log I can see that the MCU is awake and then resets,


Maybe I am looking at this wrong but this is how I am interpreting the data.

Running the following setup:

  • SKR mini 24V powered by meanwell PSU
  • BTT PI V2 powered by meanwell PSU
  • 2x Bigtreetech Tmc 5160T Plus X and Y motor powered by separate 48V PSU
  • EBB36 running CANBUS
  • Beacon
    System is running in CANBUS
    CANBUS cable to EBB36 is shielded and shield is connected to meanwell -24V

Your log only contains the MCU 'mcu' shutdown: Missed scheduling of next digital out event error, but not the one from your screenshot. In any case, both typically have quite similar reasons. Usually and unfortunately, they are tedious to diagnose, often due to subtle hardware instabilities or effects from third-party modifications.

See:

Edit:
It makes sense to remove the modifications and upgrading Klipper to the latest Git version as it contains extended CAN diagnostics.

A new log would be required after any changes.

1 Like

Hi Sineos,

You are correct.
I get both messages regarding the MCU. must have mixed up the logs. but the behavior is the same.
Upgraded to all latest versions and disabled KAMP to see if that is the issue but unfortunately same behavior.

I am starting to think that the 5160T pro’s are the issue here as I also spotted two undervoltage alarms in the logs on X and Y.
Going to double check the wiring.

Thank you for thinking with me.

rechecked all wiring. there was one suspect on the 5160T, replaced the ferrule of a motor wire to the X axis.
Did another print which failed again an hour in with an EBB CAN error.

Load utilization is still all over the place :


klippy (9).log (6.6 MB)

Klippy notes:
No undervoltage alarm! so that ferrule must have caused some issues.

So ABS runs with a hot bed and a hotter hotend. When you run a colder material, does it run ok past the 2 hour mark every time?
I am asking because I once had an issue with layer shifts at 1 to 2 hours in and it was the heat building up through all hardware and doing its worst on a stepper driver. Maybe you are facing something similar now but it is affecting another element? If you can catagorically eliminate heat, you are one step closer I think.

Have two separate stepper drivers for X and Y that are 48v fed.
all electronics expect EBB, hotend, beacon and the actual motors are in the enclosure.

The rest is nice and cool:

I do see the issue more and consistent with ABS and running a hot enclosure though.

In that case I suspect the EBB, which is probably mounted right against the extruder stepper which also gets hot?
The coil cannot be the issue I think, as you are not using it during printing.

This is not indicative of an issue in the first place.
What is an issue that according to the log, you are having bytes_retransmit as well as bytes_invalid in your communication.

This points to either hardware issues or potentially kernel issues on the host. More on this here.

I suspect the EBB and CANBUS system as well

I kept my eye out if Bytes_invalid showed anything other than Zero (o).
But the bytes_retransmit thing, if that shows a value other than zero, there is an issue?

I think the ''overload that you see on the screen is indeed not that indicative as the CB2 that I am using has 4 kernels. so hence it might show > 100%

Now running a script to see the memory and CPU values live from the host CB2.

Look into your klippy.log. The above is not relevant.

I indeed see it now.
bytes_retransmit slowly creeps up from 0 to in the thousands during printing.

So it is the EBB36 giving the issue here. interesting
What could that be.

Without wanting to sound rude, it would really help if you read the provided information and follow it. There is unfortunately no “press button A solution”.

I understand. Thinking out loud in my last comment, not asking for press button A solution. But can understand how that can come across. anyways, I am reading up on your link on the Klipper site :+1:
Do appreciate the assistance!

Could you please post a picture how you mounted your EBB?

sure!




Found a discrepancy:

Ran a query based on the documentation from the link Sineos sent.

Query was to check qlen which should be not more than 128.
Result shows qlen 1024. Lets see if that can be fixed.


Thanks. I’ll get back on the weekend, cause I would like to go deeper on that (time).

Thanks! much appreciated!

I think my problem has been solved.
It was indeed the bytes_retransmit issue which was increasing incremental until the buffer was full and eventually shut down.

Ran the same 4hour print and finished without issues.

Checked the log, no incremental bytes_retransmits

What did I do?
Solution:
The CAN0 file was set to the following:

allow-hotplug can0
iface can0 can static
bitrate 1000000
up ifconfig $IFACE txqueuelen 1024

According to the klipper database this should not be set at 1024 but max 128.

SSH’d into the host, looked up the file and changed the text to the following according to Klipper troubelshooting database:

**allow-hotplug can0
iface can0 can static
bitrate 1000000
up ip link set $IFACE txqueuelen 128
**

The above change seems to have solved my issue.
Lucky me cause I was about to pull the CANBUS and run the hotend wired to the mainboard. Already had the wiring ready for it haha! Guess I get to use that for a new build.

Thanks @Sineos for pointing me in the right direction!!!

I hope everyone that sees this thread and has the same issue can fix it now.
It was a very frustrating issue to find.