How do I figure out why MCU was disconnected?

See the issue I created for full details.

My printer disconnected suddenly. Any tips on how to figure out why?

thermal-runaway-ender.log.zip (468.5 KB)

To address issues in the FAQ:

  • Data USB cable, power supply, and power cable are the same I’ve used for years. This is the first time I’ve had this happen.
  • I can’t think of any reason any of the printer wires would be messed up (especially such that it doesn’t show up for 3 hours), but I’ll take a closer look before printing again.

Edit: It looks like the MCU overheated

Hello @jdanders !

Those things usually do not cause a thermal runaway issue.

It’s not quite obvious if the issue appeared with the bed or the hotend.
You may check with both:

  1. Has the thermistor detached from it’s position.
  2. Is there a cable lose or broken wire of the thermistor.
  3. Does the thermistor work properly.
  4. For the hotend thermistor: Has the part cooling fan moved ad cools the thermistor too much.
  5. Is the wiring of the heater ok: No lose screws, no wire breaks.
  6. Are the MOSFETs doing their jobs.

Thanks @EddyMI3D.

The reason I think it’s related to the MCU disconnect is that was the last useful message in the log:

Timeout with MCU 'mcu' (eventtime=9908.386273)
Transition to shutdown state: Lost communication with MCU 'mcu'
...
Stats 9908.4: gcodein=2318820 mcu: mcu_awake=0.086 mcu_task_avg=0.000248 mcu_task_stddev=0.000241 bytes_write=10671832 bytes_read=2228163 bytes_retransmit=703 bytes_invalid=4394 send_seq=229431 receive_seq=229426 retransmit_seq=229431 srtt=0.003 rttvar=0.000 rto=3.200 ready_bytes=614 stalled_bytes=5582 freq=16000046 heater_bed: target=100 temp=99.9 pwm=1.000 sysload=0.36 cputime=333.872 memavail=1749128 print_time=9901.681 buffer_time=2.395 print_stall=0 extruder: target=245 temp=244.7 pwm=0.709

The final control state has the bed pwm a 1 and extruder pwm was 0.709, which made things get toasty. So I think the hardware just kept the last state with the heaters turned on. Once I restarted the klipper firmware, temperature and control returned and things worked as expected.

A thermal runaway issue caused by thermistor or heater issue would have started reporting a wrong temperature and only been repairable by fixing the issue, not restarting the firmware.

So I don’t have any evidence that anything malfunctioned with the thermistor or related wiring. See the issue that @Sineos referred to in my gitlab issue: Heaters Stay on after "Move exceeds maximum extrusion" with hotend melting the part. · Issue #3666 · Klipper3d/klipper · GitHub

As far as I know, the Klipper code works as intended. The heater checks described at Configuration checks - Klipper documentation are designed to exercise this code on every printer. I recommend you rerun these tests on your printer (in a controlled setting where you can safely remove power should a failure occur). Similarly, you may want to try to recreate the specific failure scenario - for example, by turning on a heater and unplugging the USB cable to simulate a disconnect.

There are a number of subtle hardware failures that can result in a heater mosfet being stuck on (in such a way that the software can not disable them). You may wish to investigate that before continuing to use the hardware.

-Kevin

Even if USB tries to be differential, it is in not really, it uses low voltages and absolute potential for datalines. So the signal can easily be interferences by noise. That means that power spikes or bad GND connection can disturb the signal.
My favorite connection is a USB cable with unconnected/cut +5V you don’t want to transfer power from the printer mainboard to the host (or the other way). Possibly also an additionally GND cable (from printer mainboard to host).
Don’t let the USB cable go parallel to the heaters power cable. There are heavy changes in the current what can induce a lot of noise.
Sadly this is only one of many options what could go wrong, but noise problem will appears really often and are hard to track down without the background knowledge.

And ofc allways have look at dmesg:

ssh to the host:

sudo dmesg -wH

It will show you what’s happen with your hardware :slight_smile:

Hope I could help.
Best regards and good luck!

There are a number of subtle hardware failures that can result in a heater mosfet being stuck on (in such a way that the software can not disable them). You may wish to investigate that before continuing to use the hardware.

But the logs show that the MCU software did not try to disable them. It stopped responding to the host. And apart from that, BOTH the bed and hotend mosfets got stuck at the same time? Right after or at the same time the MCU becomes unreachable from the host? How could this happen, even in the case of whatever odd or subtle hardware failure or user error? That seems very unlikely.

Maybe let’s go through some scenarios, maybe start with the hardware issues EddyMI3D mentioned:

  1. Has the thermistor detached from it’s position.

Both thermistors, the one on the bed and the one on the hotend at the same time? Very, very unlikely.
And even if they did, there is logic inside Klipper to detect that. If it drives the PWM and the thermistor temp values don’t increase, Klipper will do a safety shutdown.

  1. Is there a cable lose or broken wire of the thermistor.

See point 1

  1. Does the thermistor work properly.

See point 1

  1. For the hotend thermistor: Has the part cooling fan moved ad cools the thermistor too much.

Shouldn’t matter. I’ve actually had that issue, Klipper correctly did a safety shutdown because it noticed it was heating the hotend but the hotend wouldn’t get hotter.

  1. Is the wiring of the heater ok: No lose screws, no wire breaks.

See point 1

  1. Are the MOSFETs doing their jobs.

See point 1. If they do not switch on, Klipper should detect it, same if they get stuck in on position


Or maybe it was USB communication issues, as masterq wrote?

Not possible. Although indeed USB and the serial protocol do not guarantee error-free communication, the communication layer above (Klipper’s protocol) seems designed and implemented properly with checksumming, sequence numbers, retransmits etc. If something went wrong there, Klipper would detect it. Also, even if Klipper was unable to detect such issues, it’s unlikely that whatever random garbage bytes arrive at the MCU get interpreted as “run both mosfets with fixed PWM values forever”.

So long story short, Klipper obviously has some well-designed and working safety features and is able to detect when something isn’t right and also monitors the connection between MCU and host computer.

Now what other things could possibly have led to both heaters staying on forever?

Quoting OP jdanders from the Github issue:

Since bed pwm was a 1 and extruder pwm was 0.709 things got toasty. I think the issue I want to raise is this:
If the mcu gets disconnected from the Pi, shouldn’t the mcu turn off the heaters?

That’s a valid question IHMO. There is a feature inside Klipper that shuts down the MCU and heaters in case the MCU cannot talk to the Pi anymore. Why did that not work as designed? What odd or subtle hardware failure scenarios could have possibly led to this?

From my experience with microcontrollers, the error jdanders reported could have happened like this:

  • The main CPU part of the microcontroller chip running Klipper code crashed or got stuck for whatever reason: Brownout, subtle hardware failure, bitflips etc. by running high amp PWM cables near the electronics, software bugs, …
  • The brownout detection and/or watchdog did not reset the chip
  • Thus, the PWM circuitry kept running with the last active PWM values

I’ve actually lost a model airplane some years ago because of the brownout scenarios. Brownout occured, the Atmega328 microcontroller responsible for the remote control crashed, but PWM was still active with the last steering values, being “fly straight ahead”. And this is what it did until the battery ran out many kilometers away, never found the plane …

However, looking at the logs jdanders posted on Github, this didn’t seem to be the case as the mcu became reachable again on it’s own if I see it correctly, i.e. jdanders did not power-cycle the MCU.

So, here, the MCU must either have:

  • Reset itself through brownout detection and/or watchdog. However this did not happen, as the heaters stayed on. Watchog/Brownout reset would’ve turned PWM circuitry off also

  • Not have reset itself and been in some kind of state that kept PWM on for whatever reason. But such a state should never occur assuming there are no software bugs and brownout detection and watchdog is working correctly.

So again, what was it that caused this? In theory, this should never happen. But it happened.

Thanks for the comments, all! After some experiments and observations, I think what happened was warmer weather. I started messing around with printing ABS again this winter and built an enclosure around the entire printer with a 100W heater and ds18b20 sensor for climate control. There is a gap for fresh air near the front (heater is in the back), which I was hoping would be enough to keep the control electronics cool. The printer lives in the garage, where ambient temperatures rarely got above 50F. This last week was warmer and the garage was around 55-65F.

The print that failed had the enclosure temp set to 38C, and failed after 2.75 hrs. I tried another print two days later (after lots of investigation of HW) without the enclosure actively heated (enclosure temp still reaches ~33-34C) and it failed as well in the exact same way after about an hour. I then took down the enclosure and printed in open air and the print continued successfully for 15.5 hrs until I stopped it because my ABS layers delaminated because of the cold.

So I’m assuming the air in the enclosure was too warm and overheated the MCU. I must have been running really close the the threshold during the winter, because the ambient air was only 10-20F warmer than similar successful prints. I don’t know how a MCU like that fails with overheating, but I assume it just stopped processing code, which would leave the HW controls stuck at the last instruction, meaning there’s nothing Klipper could do in this situation.

My next step to print with the enclosure is to print an air duct to be sure the MCU is sucking outside air only. And maybe with the threat of fire, I’ll actually put more effort into the enclosure than just using an old Amazon cardboard box :fire::fire::fire: :melting_face:.

This is an interesting observation. Yes, it came back on its own, so maybe my conclusion that Klipper couldn’t do anything about it is not accurate. What would the MCU be doing between loss of communication (while running warm) and the firmware reset command?

Nothing changed between overheat condition and issuing the firmware reset (enclosure was still maintaining temp). So apparently the MCU was healthy enough to respond to the firmware reset command. If the MCU just missed a beat, or self reset, then it seems like it should have been able to detect the runaway. Is there anything after a firmware reset that shuts of the PWM circuitry? (@enderbender’s comment seems to imply there is). That seems like it would be a good idea. If not, it’s back to the idea that if the MCU is not seeing external commands, then turn off the heaters!

If it was overheating, the watchdog should’ve triggered eventually which should have reset everything (the main CPU as well as the PWM circuitry) automatically.

Maybe check MCU temps, I’m not sure if your MCU supports it, something like this in printer.cfg

[temperature_sensor mcu_temp]
sensor_type: temperature_mcu
min_temp: 0
max_temp: 100

Unfortunately not:
Error: MCU temperature not supported on atmega1284p

I just tried this. I set the extruder and bed to 100C and pulled the USB after extruder was heated and bed was at 52. Once pulled the display on the printer shows temp stayed at 100/52. I plugged the cable back in and nothing changed. After the firmware restart, extruder registered at 32 and bed at 41, so apparently the heaters turned off.

And to top it all off, I learned today that the motherboard fan on the Ender-3 is hardwired to the part fan. I leave the part fan off for ABS printing hoping to get better layer adhesion. So I’ve been doing all my prints without any MCU cooling. Nice.

I’ve looked at the watchdog code in the meantime, if I see it correctly, the watchdog is running as a task that is run by sched.c. That unfortunately does not cover all cases that could go wrong, i.e. if the code in the other tasks is still running (and finishing within time) the watchdog will get tickled and no MCU reset will be issued.

It would be better if the watchdog tickling would be tied to some internal variable or other event that always occurs. Or to some other sanity check, so that cases like the above would have the watchdog reset the MCU.

As luck would have it, I also had the exact same issue with the exact same chip yesterday. Print stopped and hotend and bed maxed out at 280 and 110c respectively. (Atmega1284p - on an anet a8 board in an ender3 clone that works suprisingly well still) I was also looking at the logs, and figured the same thing out. Today I was here when it happened again, and I have noticed something in the logs. Right before it disconnected, it failed to receive bytes, this also happened during a longer print at the 1or so hour mark. I did several small prints after, nothing happened, only on longer prints. I have added a heatsink onto the ATMega chip now, and upgraded my fan in the box. Now started a longer print, will see what happens.