[BUG] Klipper does not properly turn off heaters on host crash

Basic Information:

Printer Model: Sovol SV08 (Klipper from mainline)
MCU / Printerboard: BTT CB1
Host / SBC: SBC
klippy.log: Unavailable (see below)

Hi,

My printer has a habit of having the eMMC crash on certain conditions (this is of course not Klipper’s fault, and I will follow up on that in a different medium). When this happens, printing eventually hangs (presumably because it can’t read the G-code to figure out what to do), I can’t SSH to the printer, and Fluidd is unresponsive. Also, klippy.log is of course not being written to, nor any other logs (it took me a fair bit of time to track down that the I/O was indeed the issue, and not some other crash).

However, the heaters do not turn off, and the MCUs stay active. This seems to be in direct disagreement with the FAQ, which claims that if the host is dead for five seconds, the MCUs will go into auto-shutdown.

This combination is super-bad. At some point, I was away and checked on my print, the host died, and then I was an hour away from a printer with the nozzle on 270 and bed on 100, which I couldn’t turn off. When I came back, it had charred a fair bit of plastic, although thankfully everything else was fine.

My assumption (without having dug into the code in detail; I know Klipper on a superficial basis only) is that whatever thread is responsible for tickling the MCU’s watchdog can survive just fine without I/O, even if the actual printing thread is stuck. Is it possible to make the system more resilient, so that the watchdog also checks that the rest of the system is responsive before telling the MCU that everything is fine? I.e. the Moonraker threads or something also need to ping the watchdog for the system to be considered healthy.

I’ve reproduced the issue a bunch of times by now, although thankfully, most of the time it happens in a situation where the heaters are off.

Please post the latest klippy.log. While you feel that it doesn’t have relevant information, there are things that can be found there that will be useful in diagnosing the issue that you’re having.

Sure, here it is. You can see it cuts off abruptly (there are AFAIK two crashes in this log, check the truncation at line 20522).

In this case, it happens after the print, and I am fairly sure it is because the creation of the final timelapse causes enough I/O to kill the eMMC. But again, the Klipper bug isn’t that the I/O goes away, it is how it’s handled.

klippy.log.zip (2.3 MB)

This in your logs:

Git version: 'v0.12.0-348-geeb2678ec-dirty'

We don’t know what is scrambled up in your Klipper installation.

1 Like

Nothing very interesting:

sesse@amalie:~/klipper$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	host.mcu
	host.mcu.old
	host_mcu_klipper.bin
	klippy/extras/probe_pressure.py
	klippy/extras/shaketune
	klippy/extras/z_offset_calibration.py
	toolhead.mcu
	toolhead.mcu.old
	toolhead_mcu_klipper.bin

nothing added to commit but untracked files present (use "git add" to track)
sesse@amalie:~/klipper$ git diff
sesse@amalie:~/klipper$

It is not clear what you are reporting, respectively what is happening on your side.
From your log it does not seem that anything unexpected has happened.

I can confirm that, e.g. pulling the USB connection to the host, resetting it mid-running, or killing the klippy host process will lead to turning off the heaters and fans going full blast.

From your log it does not seem that anything unexpected has happened.

That is because the I/O dies, and thus of course nothing can be logged. You will not see these failures in the logs.

I can confirm that, e.g. pulling the USB connection to the host, resetting it mid-running, or killing the klippy host process will lead to turning off the heaters and fans going full blast.

But that is an entirely different situation, where the MCU actually detects that the host goes away. That’s evidently not happening in this situation.

To rephrase:

  • The Klippy process still runs.
  • Klippy cannot read or write from disk, causing some threads to hang but not others.
  • This is a “zombie” situation that should have lead to a shutdown but does not.

This is pretty much the same as situations you can get in complex distributed systems, where it’s much better that a node is conclusively dead than appearing to be healthy and not is (a form of Byzantine failure).

A more relevant test: What happens if you pull out the SD card on your host while printing? That’s essentially what’s happening here.

This seems to be a strange edge case and I have never heard of similar reports before.
So even if there was a second watchdog process and following your information: Who tells you that the second watchdog is not continuing to run leading to exactly the same behavior as described?

My personal opinion:

  • You will never be able to catch each and every eventuality
  • If it seems sufficiently unlikely then it makes no sense to go to great lengths try-fixing a “once in a blue moon” event

Well, a reasonable start would be that if you don’t receive any commands or heartbeats in a set idle time (five minutes or whatever is configured), you go for an orderly shutdown of the heaters? There’s already code for that in normal situations, right?

I’m not sure if bad SD cards is an obscure edge case; surely that must happen to others sometimes, too, even if they don’t report it? (For me, it now happens pretty much every day now, but of course I’m working on fixing that. Still, it would be nice to have a layer of protection.)

Well, it is quite simple: As long as the klippy process on the host is running and answering requests from the MCUs the MCUs continue to function as they should.
If the host process stops responding in a predictable manner then errors like Timer too close or Missed scheduling of next digital out event are raised.

So, the case, where the host is not correctly responding is covered, but this does not seem to fit your edge case. Apparently the host process continues to run and thus also keeps the MCUs alive but other parts of the host system are no longer working correctly.
This seems quite unusual and, for me, the report and its validity are still not clear.

Well, I don’t know what else I can say. Could we at least then update the FAQ? It says: “Will the heaters be left on if the Raspberry Pi crashes? The software has been designed to prevent that.” but it seems it only applies to some kinds of crashes and not others (what you keep calling an “edge case”). I mean, rare crashes are crashes too, and the printer won’t set things less on fire just because they’re not of an approved kind :slight_smile:

FWIW, I’ll probably be installing watchdog(8) with the --sync option, which will hard-reboot the system when this happens next (as far as I know, anyway). It really seems to me like there should be some heartbeat internal to Klipper to make this unneccessary, but if I/O issues or other process hangs are explicitly out-of-scope for Klipper’s fire safety, I’ll just have to deal with that.

While I could be wrong, to me it looks like this is due to your explicit configuration to not turn off the heaters when it goes idle.

[gcode_macro _IDLE_TIMEOUT]
gcode = 
        {% if printer.print_stats.state == "paused" %}
        RESPOND TYPE=echo MSG="No operations in 10min!"
        {% else %}
        M84
        TURN_OFF_HEATERS
        {% endif %}

printer.print_stats.state is not tracking the pause state from someone hitting “PAUSE”. It instead has something to do with the virtual_sdcard state so it seems plausible that an IO hang would result in this being set to paused.

Empirically, if I set the bed heater to 100 degrees (to do a heat soak) and then do nothing, Klipper eventually turns it off again. But I guess it’s a different story while printing, you’re right about that; it would be interesting to know exactly what this state signifies.

In any case, the safe response on a I/O hang would presumably be a hard shutdown of the MCUs, right, independent of any _IDLE_TIMEOUT macro being set?

From tracking the code:

The only way you can get state = “paused” is if something calls note_pause(). This is only ever called from one place, namely virtual_sdcard.py.

klippy/extras/virtual_sdcard.py is the code that deals with “printing from SD card”, i.e. when you give Klippy a full G-code file and then ask it to print that (as opposed to e.g. OctoPrint feeding one and one line of G-code over the simulated serial port).

It keeps going until it self.must_pause_work is true, then logs “Exiting SD card print (position %d)”. must_pause_work is set in response to a few things, including pauses and cancels. I do not see any way it can be set to true from an I/O hang. On the contrary, if there’s a hang, the loop should be stuck forever in the line “self.current_file.read(8192)”. If there’s an I/O error, on the other hand, it seems it could exit the loop. But if there’s an error, the code never sets the paused status (there’s a test before that, so that it takes another path). And if the file is completed normally, then it sets self.current_file = None and the paused path is never hit.

So as far as I can see, the only way you really can get to the paused status is for do_pause() to be called. So I think this is a red herring.

I disagree with your assessment that this can’t happen when there are IO problems.

Inside the loop we have:

                try:
                    data = self.current_file.read(8192)
                except:
                    logging.exception("virtual_sdcard read")
                    break

which will break out of the loop if an exception occurs.

This then results in a pause because error_message is None and self.current_file is not None:

        if error_message is not None:
            self.print_stats.note_error(error_message)
        elif self.current_file is not None:
            self.print_stats.note_pause()
        else:
            self.print_stats.note_complete()

If there’s an I/O hang, there’s no exception. The syscall just hangs forever in the kernel.

You’re right in that there’s a bug here, though; it should probably have set error_message on an I/O exception. But it doesn’t hit that path; the I/O hangs, it doesn’t start returning errors. (The same with all other processes on the system, e.g. sshd doesn’t give an error; you just stay there waiting, even after an hour)

I may be completely wrong here, but the hardware temperature sensors have code on the MCU that detects out of bounds temperatures and shuts the mcu down. I don’t see that code for ADC thermistors. If I understand the initial problem klippy is running enough that the MCU doesn’t shut down, but klippy is not running enough to regulate temperatures. There is another complication in that the thermistor is running on a separate mcu (although it looks like it is the same mcu as the heater pin).

If I understand the initial problem klippy is running enough that the MCU doesn’t shut down, but klippy is not running enough to regulate temperatures.

No, that is not the problem. The temperature is kept stable, but that is the problem. It’s not good to have a printer that stands completely still with the extruder at high constant temperatures (there were scorch marks in the plastic when I discovered this).