[BUG] Klipper does not properly turn off heaters on host crash

Crimson · December 5, 2024, 1:47pm

I experienced this same situation a few nights ago. My raspberry pi locked up. I could not access it via ssh or more importantly via the local console. I am running klipper on a 3B but my power supply was inadequate and it was complaining about under voltage. I dont have logs or my config readily available at the time of this post. What I can tell is that the rpi was running the latest raspbian os minimal, my MCU is the skr mini e3 v3.

The pi locked up and my printer failed hot…215c hotend and a 65c bed. I am new to klipper and I am not running anything fancy in terms of macros or adv configuration.

Laikulo · December 13, 2024, 2:01am

Since all of klipper’s file IO happens on the reactor (at least while printing; mostly here klipper/klippy/extras/virtual_sdcard.py at master · Klipper3d/klipper · GitHub), outside of a swap operation, the other threads probably continue running, including timers that are scheduling the periodic sets that are required to keep the MCU for shutting down as per the max_duration in the config_*_out at setup time.

The periodic sends appear to involve the main thread calling send on a CommandWrapper, which interacts with the serialqueue. As long as neither of these threads does anything VFS related, or tries to read an unmapped portion of a mapped file, they’d keep running.

This is probably where the reactor would get stuck.

github.com

Klipper3d/klipper/blob/master/klippy/extras/virtual_sdcard.py#L243


      
              return self.reactor.NEVER
          self.print_stats.note_start()
          gcode_mutex = self.gcode.get_mutex()
          partial_input = ""
          lines = []
          error_message = None
          while not self.must_pause_work:
              if not lines:
                  # Read more data
                  try:
                      data = self.current_file.read(8192)
                  except:
                      logging.exception("virtual_sdcard read")
                      break
                  if not data:
                      # End of file
                      self.current_file.close()
                      self.current_file = None
                      logging.info("Finished SD card print")
                      self.gcode.respond_raw("Done printing file")
                      break

Because this is potentially dangerous, it might be worth to add a watchdog (with SIGALARM or similar) to the reactor greenlet, so that if the reactor dispatches to a task, and that task blocks, the main thread can enter an error state and shutdown the rest of klippy in a safe-ish manner.

Note that there are some special concerns around signals in greenlets, and handlers always go to the main thread, and not to the thread that received the signal, this also means that negative interactions with sleep (the glibc one) and setitimer(which I don’t believe python exposes).

hcet14 · December 13, 2024, 3:07pm

Are you talking about something like this?

Laikulo · December 13, 2024, 3:27pm

Those are more to interface with a physical watchdog timer, so that the linux machine reboots is the kernel locks, or if other conditions fail.

This would need to be something internal to klipper, since the state of its various threads are opaque (though a bpf probe may be able to detect a stall externally).

I’ve toyed around with simulating storage failure with ptrace, using an mcu with a fake heater, to see if symptoms match.

hcet14 · December 13, 2024, 4:02pm

Ok, I get you. Looks like just koconnor can help here.

koconnor · December 17, 2024, 8:00pm

I agree it is possible that the main thread could get stuck due to an OS error while the Klipper background thread continues to maintain the last requested temperature.

I don’t think I would categorize that as a serious issue though, as Klipper should in that situation maintain the last requested temperature. For what it is worth, I think there could be many real-world situations where one may not be able to issue new commands to Klipper (eg, loss of networking between rpi and user’s desktop).

That said, it should be possible to add additional checks to the code. One possibility is to add code to the heater logic to disable heating if the main thread appears unresponsive.

For example, totally untested:

--- a/klippy/extras/heaters.py
+++ b/klippy/extras/heaters.py
@@ -14,6 +14,7 @@ KELVIN_TO_CELSIUS = -273.15
 MAX_HEAT_TIME = 5.0
 AMBIENT_TEMP = 25.
 PID_PARAM_BASE = 255.
+MAX_MAINTHREAD_TIME = 5.0
 
 class Heater:
     def __init__(self, config, sensor):
@@ -41,6 +42,7 @@ class Heater:
         self.lock = threading.Lock()
         self.last_temp = self.smoothed_temp = self.target_temp = 0.
         self.last_temp_time = 0.
+        self.verify_mainthread_time = self.printer.get_reactor().NEVER
         # pwm caching
         self.next_pwm_time = 0.
         self.last_pwm_value = 0.
@@ -66,7 +68,8 @@ class Heater:
         self.printer.register_event_handler("klippy:shutdown",
                                             self._handle_shutdown)
     def set_pwm(self, read_time, value):
-        if self.target_temp <= 0. or self.is_shutdown:
+        if (self.target_temp <= 0. or self.is_shutdown
+            or read_time > self.verify_mainthread_time):
             value = 0.
         if ((read_time < self.next_pwm_time or not self.last_pwm_value)
             and abs(value - self.last_pwm_value) < 0.05):
@@ -129,6 +132,8 @@ class Heater:
             target_temp = max(self.min_temp, min(self.max_temp, target_temp))
         self.target_temp = target_temp
     def stats(self, eventtime):
+        est_mcu_time = self.mcu_pwm.estimated_print_time(eventtime)
+        self.verify_mainthread_time = est_mcu_time + MAX_MAINTHREAD_TIME
         with self.lock:
             target_temp = self.target_temp
             last_temp = self.last_temp

-Kevin

hcet14 · December 17, 2024, 8:17pm

Is it worth a PR from you?

Sesse · December 17, 2024, 10:10pm

For what it is worth, I think there could be many real-world situations where one may not be able to issue new commands to Klipper (eg, loss of networking between rpi and user’s desktop).

Shouldn’t that also cause an idle timeout at some point, ideally? In general, if you stop at some arbitrary position and keep the heaters on for a long time, it doesn’t really feel like a good place to be.

I guess you could imagine some sort of timeout based on “no commands received in N seconds” instead of your proposed “have not heard from the main thread in N seconds”? Although N would have to be less aggressive for the former; I thought there already were some kind of (configurable) 15-minute timeout or something in this situation, but I haven’t checked.

koconnor · December 18, 2024, 5:44pm

Ah, that is true - if it is an external hardware fault (eg, networking between rpi and user’s desktop) then the idle timeout should eventually run. However, an internal hardware fault causing the host software to partially fail would not necessarily result in the idle timeout running.

I’m not sure there are any good solutions to that type of “internal hardware fault” though. To wit, the software can’t shutdown if it can’t shutdown.

Some simple checks to the heating code (as outlined above) may mitigate the biggest issues though. I suppose we could completely disable all background thread communication with the mcu if the main thread appears to fail - but I fear that may actually further delay getting the heaters disabled and I can’t think of anything other than heaters where continuing communication could be problematic.

-Kevin

koconnor · December 18, 2024, 6:10pm

Thinking about it further, I suppose the background mcu reading thread (the same thread that controls the heater temperatures) could send an explicit mcu shutdown message to the mcu if that thread detects the main thread has failed. However, the diagnostics for that type of situation would be painful - as we don’t typically send an explicit shutdown request without logging and we can’t reliably log anything if some of the threads have failed.

Not sure.
-Kevin

Sesse · December 19, 2024, 11:13am

In theory, I guess everything could heartbeat everything, but there’s something about going overboard here, too; solving the general problem of a distributed system with potential partial failures is… complex. And I guess there are enough issues with timeouts under load as it is.

I don’t know enough about how the threads in question are split up (and what lives on the reactor and what does not); I’m perhaps just a bit surprised that the virtual SD card stopping is different from a timeout coming from a remote host. And I’m also a bit surprised that this logic would be in the heater and not just a general failsafe (again, I don’t know how all of this is split up).

For my part, I set up cooling better so that this happens more rarely, but that doesn’t feel like a general solution either.

koconnor · January 3, 2025, 2:40am

I opened up a PR for this: Disable heater if it appears main thread has stopped updating by KevinOConnor · Pull Request #6777 · Klipper3d/klipper · GitHub

-Kevin

system · March 4, 2025, 2:40am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MCU safety control General Discussion	13	45	December 4, 2024
MCU crashing when heating bed General Discussion	35	875	March 18, 2024
SV08 Mainline Klipper random shutdowns General Discussion	9	281	April 21, 2025
Klipper Shutdown Error 'verify_heater' - Help with Klippy.log General Discussion	13	276	December 23, 2024
Klipper Disconnecting General Discussion	38	101	December 14, 2024

[BUG] Klipper does not properly turn off heaters on host crash

Related topics