I experienced this same situation a few nights ago. My raspberry pi locked up. I could not access it via ssh or more importantly via the local console. I am running klipper on a 3B but my power supply was inadequate and it was complaining about under voltage. I dont have logs or my config readily available at the time of this post. What I can tell is that the rpi was running the latest raspbian os minimal, my MCU is the skr mini e3 v3.
The pi locked up and my printer failed hot…215c hotend and a 65c bed. I am new to klipper and I am not running anything fancy in terms of macros or adv configuration.
Since all of klipper’s file IO happens on the reactor (at least while printing; mostly here klipper/klippy/extras/virtual_sdcard.py at master · Klipper3d/klipper · GitHub), outside of a swap operation, the other threads probably continue running, including timers that are scheduling the periodic sets that are required to keep the MCU for shutting down as per the max_duration in the config_*_out at setup time.
The periodic sends appear to involve the main thread calling send on a CommandWrapper, which interacts with the serialqueue. As long as neither of these threads does anything VFS related, or tries to read an unmapped portion of a mapped file, they’d keep running.
This is probably where the reactor would get stuck.
Because this is potentially dangerous, it might be worth to add a watchdog (with SIGALARM or similar) to the reactor greenlet, so that if the reactor dispatches to a task, and that task blocks, the main thread can enter an error state and shutdown the rest of klippy in a safe-ish manner.
Note that there are some special concerns around signals in greenlets, and handlers always go to the main thread, and not to the thread that received the signal, this also means that negative interactions with sleep (the glibc one) and setitimer(which I don’t believe python exposes).
Those are more to interface with a physical watchdog timer, so that the linux machine reboots is the kernel locks, or if other conditions fail.
This would need to be something internal to klipper, since the state of its various threads are opaque (though a bpf probe may be able to detect a stall externally).
I’ve toyed around with simulating storage failure with ptrace, using an mcu with a fake heater, to see if symptoms match.
I agree it is possible that the main thread could get stuck due to an OS error while the Klipper background thread continues to maintain the last requested temperature.
I don’t think I would categorize that as a serious issue though, as Klipper should in that situation maintain the last requested temperature. For what it is worth, I think there could be many real-world situations where one may not be able to issue new commands to Klipper (eg, loss of networking between rpi and user’s desktop).
That said, it should be possible to add additional checks to the code. One possibility is to add code to the heater logic to disable heating if the main thread appears unresponsive.
For example, totally untested:
--- a/klippy/extras/heaters.py
+++ b/klippy/extras/heaters.py
@@ -14,6 +14,7 @@ KELVIN_TO_CELSIUS = -273.15
MAX_HEAT_TIME = 5.0
AMBIENT_TEMP = 25.
PID_PARAM_BASE = 255.
+MAX_MAINTHREAD_TIME = 5.0
class Heater:
def __init__(self, config, sensor):
@@ -41,6 +42,7 @@ class Heater:
self.lock = threading.Lock()
self.last_temp = self.smoothed_temp = self.target_temp = 0.
self.last_temp_time = 0.
+ self.verify_mainthread_time = self.printer.get_reactor().NEVER
# pwm caching
self.next_pwm_time = 0.
self.last_pwm_value = 0.
@@ -66,7 +68,8 @@ class Heater:
self.printer.register_event_handler("klippy:shutdown",
self._handle_shutdown)
def set_pwm(self, read_time, value):
- if self.target_temp <= 0. or self.is_shutdown:
+ if (self.target_temp <= 0. or self.is_shutdown
+ or read_time > self.verify_mainthread_time):
value = 0.
if ((read_time < self.next_pwm_time or not self.last_pwm_value)
and abs(value - self.last_pwm_value) < 0.05):
@@ -129,6 +132,8 @@ class Heater:
target_temp = max(self.min_temp, min(self.max_temp, target_temp))
self.target_temp = target_temp
def stats(self, eventtime):
+ est_mcu_time = self.mcu_pwm.estimated_print_time(eventtime)
+ self.verify_mainthread_time = est_mcu_time + MAX_MAINTHREAD_TIME
with self.lock:
target_temp = self.target_temp
last_temp = self.last_temp
For what it is worth, I think there could be many real-world situations where one may not be able to issue new commands to Klipper (eg, loss of networking between rpi and user’s desktop).
Shouldn’t that also cause an idle timeout at some point, ideally? In general, if you stop at some arbitrary position and keep the heaters on for a long time, it doesn’t really feel like a good place to be.
I guess you could imagine some sort of timeout based on “no commands received in N seconds” instead of your proposed “have not heard from the main thread in N seconds”? Although N would have to be less aggressive for the former; I thought there already were some kind of (configurable) 15-minute timeout or something in this situation, but I haven’t checked.
Ah, that is true - if it is an external hardware fault (eg, networking between rpi and user’s desktop) then the idle timeout should eventually run. However, an internal hardware fault causing the host software to partially fail would not necessarily result in the idle timeout running.
I’m not sure there are any good solutions to that type of “internal hardware fault” though. To wit, the software can’t shutdown if it can’t shutdown.
Some simple checks to the heating code (as outlined above) may mitigate the biggest issues though. I suppose we could completely disable all background thread communication with the mcu if the main thread appears to fail - but I fear that may actually further delay getting the heaters disabled and I can’t think of anything other than heaters where continuing communication could be problematic.
Thinking about it further, I suppose the background mcu reading thread (the same thread that controls the heater temperatures) could send an explicit mcu shutdown message to the mcu if that thread detects the main thread has failed. However, the diagnostics for that type of situation would be painful - as we don’t typically send an explicit shutdown request without logging and we can’t reliably log anything if some of the threads have failed.
In theory, I guess everything could heartbeat everything, but there’s something about going overboard here, too; solving the general problem of a distributed system with potential partial failures is… complex. And I guess there are enough issues with timeouts under load as it is.
I don’t know enough about how the threads in question are split up (and what lives on the reactor and what does not); I’m perhaps just a bit surprised that the virtual SD card stopping is different from a timeout coming from a remote host. And I’m also a bit surprised that this logic would be in the heater and not just a general failsafe (again, I don’t know how all of this is split up).
For my part, I set up cooling better so that this happens more rarely, but that doesn’t feel like a general solution either.