Potential bug in Thermal runaway protection. (keeps heating forever!)

Printer Model: CR-10 S5
MCU / Printerboard: Raspberry Pi 3b / SKR Mini E3 V2
klippy.log (I removed the first 29k lines because of the upload limitation of 8mb)
TLDR:
When setting the hotend temp to 25°C with the thermistor removed from the block and reading 20.5°C room temperature, the thermal runaway protection does not kick in, and keeps heating at 100% forever!!!
Setting the temp to 30°C triggers the protection after about 20s as it should be.

What happened:

Today I had a print stopped by the thermal runaway protection. I guess there was an issue with the cable to the thermistor since the temperature reading made weird jumps. That can be seen starting from line 155 in the log (temp jumps almost 10°C in one second). Luckily the protection worked and shut down the printer after a while of the temperature being way off target. The temperature was fluctuating a lot also before that event but that is due to the part cooling fan and also the hotend fan blowing directly on the hotend (I bought the printer used from some incompetent person, who mounted the fan cover totally crooked and I did not bother to fix it yet)
Also high extrusion rate of about 20mm³/s.

Because of that event, (which cost me almost half a kg of petg) I took the hotend apart to replace the thermistor. Out of curiousity (to check how the thermal runaway protection works) I set the temp to 25° (log line 5904) while the thermistor was removed from the hotend and to my surprise the heater just stayed at 100% until the hotend was waaaay too hot (probably way in excess of 300°C, it started to smoke). At that point I cut the power to the printer and inserted the probe back into the block to check the temperature. That triggered the thermal runaway right away. (log line 6169)
After restarting klipper I set the temperature to 30° (log line 7171) with the probe not inserted which correctly triggered the protection after about 20s)
I repeated the test with 25°C (log line 8670) and it kept heating again for ever.

This is also reproducible after restarting the raspberry pi.

Since I have no verify_heater for the extruder, the protection should kick in after the default 20s during heatup (which it did when setting the temp to 30), or after an accumulated error of 120, which would be about 27s in my case. (120 / 4.5)

EDIT:
I did an experiment putting my hand against the thermocouple. While it read about 27° I set the temp to 31° and I had the same result as before, no shutdown. Only after letting it cool down to room temperature again after a while it did the shutdown. So the problem seems to occour if setting a temperature only a little higher than the temperature readout.

klippy (1).log (5.2 MB)

Taking a quick look at the Klipper code (verify_heater.py), I see this:

        if temp >= target - self.hysteresis or target <= 0.:
            # Temperature near target - reset checks
            if self.approaching_target and target:
                logging.info("Heater %s within range of %.3f",
                             self.heater_name, target)
            self.approaching_target = self.starting_approach = False
            if temp <= target + self.hysteresis:
                self.error = 0.
            self.last_target = target
            return eventtime + 1.

The hysteresis parameter defaults to 5. To me that reads as if the initial difference between the current temp and target is less than 5 deg it will keep heating and since the heater’s temperature never changes, it never continues into the function to figure out that the temperature never changes.

Do you get the thermal runaway protection if you set the hysteresis parameter of the extruder to less than 5?

I’m not sure how Klipper could detect this issue. I will say that I don’t think anyone would ever intentionally set the hotend temp to room temperature, and the thermistor also needs to be working and fully detached from the hotend to create this set of circumstances. I’m not sure I view this as a problem that would ever happen in real world use.

Can not check that right now (2:30 AM, other people in household sleeping) but that seems to be correct, although I don’t really know python. I had misunderstood the description. I thought the hysteresis was the max acceptable deviation from the target, at which the protection would immediately kick in and within that range (±5°) the error would accumulate until it reaches 120 and be reset “if it is close to the target temperature”. Now that I think about it, it makes sense that it will not immediately trigger when ±5° in the default config. That would falsely trigger quite often if the temp is lowered after the first layer and the temp undershoots a bit. But that 5 degrees hysteresis seems a little excessive for a default setting. That means the extruder temp could be 7 degrees off for a whole minute before triggering the protection. I can not imagine a normal situation where such a margin would be nescessary. I will change it to 2 or 3.
I still find it odd, that there aren’t more checks, to make sure the temperature vs the duty cycle of the heater element make sense. I guess it would be hard to implement something like that, that works with most 3d printers reliably but there could be a calibration sequence, where the heater heats up to the set max temperature at 100% duty cycle and the rate of change is documented for every 5 or 10 degrees. Then this data could be referenced and if the heater would usually heat up by X degrees per second at Y temperature but now the duty cycle is at 100% for an amount of time without the element heating up…

I think the documentation should more clearly explain how this works. It is correctly explained if you read the whole thing and pay attention to details, but before today I just glanced over it (as most people probably do) and it could be made a little more obvious what “close to the target temperature” and “target range” means, although it is explained below in the hysteresis section.
For example:

# The maximum “cumulative temperature error” before raising an
# error. Smaller values result in stricter checking and larger
# values allow for more time before an error is reported.
# Specifically, the temperature is inspected once a second and if it
# is close to the target temperature (target temperature ± hysteresis) then an internal "error
# counter" is reset; otherwise, if the temperature is below the
# target range then the counter is increased by the amount the
# reported temperature differs from that range (target temperature ± hysteresis). Should the counter
# exceed this “max_error” then an error is raised. The default is
# 120.

I did not fully understand how the max_error is calculated and what the hysteresis does exactly. I think I do now. You are right. Those circumstances are pretty unlikely to happen in real world use. There should still be some kind of a check for such a situation.

While it’s unlikely that this would happen in real life, the way to detect it is to rework the logic in the function to first check that heater temperature is changing. I haven’t looked at the full details of the function so I don’t know how difficult that would be but there is a way.

If the thermistor was connected and working, the temperature would have been moving up and down a few tenths of a degree.

Moving up down a few tenths would not have causes this issue to be caught since it would have still been within the 5C difference.

However, in the real world with the thermistor connected, it would have moved much more than tenths of degrees since it would have caught Klipper applying some power to the heater.

So, I think that both of us are in agreement. The only reason this was a problem for the OP was that the thermistor was removed from the heater block AND the target was withing 5C of the ambient temp.

Yes, and I don’t think it’s possible to detect this situation without creating a lot of unwanted thermal shutdowns. Reducing the default hysteresis won’t completely eliminate it, and we would be flooded with complaints about failed prints.

Just to have a common understanding:

  • Only when the target temperature has been reach, Klipper will check aginst the “corridor” of target_temp ± hysteresis
  • If the system leaves this corridor, then the deviation is summed up, e.g. if you leave it for 10 seconds by 12 °C then your error sum is 120
  • If this error sum reaches max_error Klipper errors out
  • If the temperature returns into the “corridor” then the error sum is reset

Similar thing is done with check_gain_time and heating_gain

  • During heating the temperature must change by the value of heating_gain in the timeframe of check_gain_time seconds
  • For example, during heating up to 190°C, the extruder idles at 60°C for over 20 seconds then an error is thrown

In summary it can be stated:

  • Your heating hardware and temperature measurement hardware has to be capable of reaching, reporting and maintaining the correct temperatures
  • Klipper checks if this is happening in due time and without a too big error and shuts down the system before a catastrophic failure occurs

I just ran into this (which also should not have been possible)

Bug caused thermal runaway and might have been catastropic - General Discussion - Klipper

Your report is entirely unrelated to what was reported here.

1 Like