There are a number of subtle hardware failures that can result in a heater mosfet being stuck on (in such a way that the software can not disable them). You may wish to investigate that before continuing to use the hardware.
But the logs show that the MCU software did not try to disable them. It stopped responding to the host. And apart from that, BOTH the bed and hotend mosfets got stuck at the same time? Right after or at the same time the MCU becomes unreachable from the host? How could this happen, even in the case of whatever odd or subtle hardware failure or user error? That seems very unlikely.
Maybe let’s go through some scenarios, maybe start with the hardware issues EddyMI3D mentioned:
- Has the thermistor detached from it’s position.
Both thermistors, the one on the bed and the one on the hotend at the same time? Very, very unlikely.
And even if they did, there is logic inside Klipper to detect that. If it drives the PWM and the thermistor temp values don’t increase, Klipper will do a safety shutdown.
- Is there a cable lose or broken wire of the thermistor.
See point 1
- Does the thermistor work properly.
See point 1
- For the hotend thermistor: Has the part cooling fan moved ad cools the thermistor too much.
Shouldn’t matter. I’ve actually had that issue, Klipper correctly did a safety shutdown because it noticed it was heating the hotend but the hotend wouldn’t get hotter.
- Is the wiring of the heater ok: No lose screws, no wire breaks.
See point 1
- Are the MOSFETs doing their jobs.
See point 1. If they do not switch on, Klipper should detect it, same if they get stuck in on position
Or maybe it was USB communication issues, as masterq wrote?
Not possible. Although indeed USB and the serial protocol do not guarantee error-free communication, the communication layer above (Klipper’s protocol) seems designed and implemented properly with checksumming, sequence numbers, retransmits etc. If something went wrong there, Klipper would detect it. Also, even if Klipper was unable to detect such issues, it’s unlikely that whatever random garbage bytes arrive at the MCU get interpreted as “run both mosfets with fixed PWM values forever”.
So long story short, Klipper obviously has some well-designed and working safety features and is able to detect when something isn’t right and also monitors the connection between MCU and host computer.
Now what other things could possibly have led to both heaters staying on forever?
Quoting OP jdanders from the Github issue:
Since bed pwm was a 1 and extruder pwm was 0.709 things got toasty. I think the issue I want to raise is this:
If the mcu gets disconnected from the Pi, shouldn’t the mcu turn off the heaters?
That’s a valid question IHMO. There is a feature inside Klipper that shuts down the MCU and heaters in case the MCU cannot talk to the Pi anymore. Why did that not work as designed? What odd or subtle hardware failure scenarios could have possibly led to this?
From my experience with microcontrollers, the error jdanders reported could have happened like this:
- The main CPU part of the microcontroller chip running Klipper code crashed or got stuck for whatever reason: Brownout, subtle hardware failure, bitflips etc. by running high amp PWM cables near the electronics, software bugs, …
- The brownout detection and/or watchdog did not reset the chip
- Thus, the PWM circuitry kept running with the last active PWM values
I’ve actually lost a model airplane some years ago because of the brownout scenarios. Brownout occured, the Atmega328 microcontroller responsible for the remote control crashed, but PWM was still active with the last steering values, being “fly straight ahead”. And this is what it did until the battery ran out many kilometers away, never found the plane …
However, looking at the logs jdanders posted on Github, this didn’t seem to be the case as the mcu became reachable again on it’s own if I see it correctly, i.e. jdanders did not power-cycle the MCU.
So, here, the MCU must either have:
-
Reset itself through brownout detection and/or watchdog. However this did not happen, as the heaters stayed on. Watchog/Brownout reset would’ve turned PWM circuitry off also
-
Not have reset itself and been in some kind of state that kept PWM on for whatever reason. But such a state should never occur assuming there are no software bugs and brownout detection and watchdog is working correctly.
So again, what was it that caused this? In theory, this should never happen. But it happened.