Fatal Errors not so fatal ....?

Recently had a few long prints fail due to various drive issues. The latest was due to extruder being OverTempWarning.

While this is something that should be handled manually, would it make sense to have the option to not kill the print and instead pause the print to allow for a manual intervention to lower the chamber temp, reduce drive current, etc? Not sure if something that has been discussed before.

Would be willing to investigate and create a PR, if this is something that would make sense. Allowing the user to define a given pause macro under certain error conditions, instead of hard failing.

Obviously I can see the flip side in not wanting to do this to not cause serious electrical malfunctions as well.

TMC 'extruder' reports DRV_STATUS: 001e0101 otpw=1(OvertempWarning!) t120=1 cs_actual=30
Stats 234583.7: gcodein=0  mcu: mcu_awake=0.005 mcu_task_avg=0.000011 mcu_task_stddev=0.000011 bytes_write=3042467 bytes_read=2139857 bytes_retransmit=9 bytes_invalid=0 send_seq=141309 receive_seq=141309 retransmit_seq=2 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=16 upcoming_bytes=0 freq=179999963 BTT_EBB42: mcu_awake=0.004 mcu_task_avg=0.000012 mcu_task_stddev=0.000019 bytes_write=3790895 bytes_read=1661914 bytes_retransmit=0 bytes_invalid=0 send_seq=127199 receive_seq=127199 retransmit_seq=0 srtt=0.001 rttvar=0.001 rto=0.025 ready_bytes=32 upcoming_bytes=0 freq=63998882 adj=63998810 sd_pos=236596 MellowSB2040v2: temp=23.2 heater_chamber: target=50 temp=47.8 pwm=1.000 heater_bed: target=110 temp=109.9 pwm=0.456 sysload=0.37 cputime=12859.061 memavail=3266472 print_time=2681.766 buffer_time=3.605 print_stall=0 extruder: target=260 temp=260.1 pwm=0.302
Stats 234584.7: gcodein=0  mcu: mcu_awake=0.005 mcu_task_avg=0.000011 mcu_task_stddev=0.000011 bytes_write=3042804 bytes_read=2140291 bytes_retransmit=9 bytes_invalid=0 send_seq=141334 receive_seq=141334 retransmit_seq=2 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=179999966 BTT_EBB42: mcu_awake=0.004 mcu_task_avg=0.000012 mcu_task_stddev=0.000019 bytes_write=3790999 bytes_read=1662066 bytes_retransmit=0 bytes_invalid=0 send_seq=127205 receive_seq=127205 retransmit_seq=0 srtt=0.001 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998881 adj=63998825 sd_pos=236596 MellowSB2040v2: temp=23.3 heater_chamber: target=50 temp=47.8 pwm=1.000 heater_bed: target=110 temp=109.9 pwm=0.456 sysload=0.37 cputime=12859.097 memavail=3239032 print_time=2681.766 buffer_time=2.604 print_stall=0 extruder: target=260 temp=260.1 pwm=0.287
TMC 'extruder' reports DRV_STATUS: 001e0000 cs_actual=30
Stats 234585.7: gcodein=0  mcu: mcu_awake=0.005 mcu_task_avg=0.000011 mcu_task_stddev=0.000011 bytes_write=3043690 bytes_read=2140779 bytes_retransmit=9 bytes_invalid=0 send_seq=141361 receive_seq=141361 retransmit_seq=2 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=179999963 BTT_EBB42: mcu_awake=0.004 mcu_task_avg=0.000012 mcu_task_stddev=0.000019 bytes_write=3791277 bytes_read=1662196 bytes_retransmit=0 bytes_invalid=0 send_seq=127212 receive_seq=127212 retransmit_seq=0 srtt=0.001 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998878 adj=63998815 sd_pos=236663 MellowSB2040v2: temp=23.3 heater_chamber: target=50 temp=47.8 pwm=1.000 heater_bed: target=110 temp=109.9 pwm=0.529 sysload=0.37 cputime=12859.124 memavail=3508396 print_time=2683.457 buffer_time=3.295 print_stall=0 extruder: target=260 temp=260.1 pwm=0.323
TMC 'extruder' reports DRV_STATUS: 001e0101 otpw=1(OvertempWarning!) t120=1 cs_actual=30

Personally, I’m not in favor of such ideas at all:

  • A hardware error is a hardware error
  • It should not have happened in the first place and points to an underlying problem
  • Potentially leaves the printer in an undefined state, e.g. power to the motors cut thus needing rehoming etc.

WRT to your example:

  • AFAIK otpw=1 is only reported but does not lead to a shutdown, since it is only a warning
  • otpw=1 plus ot=1 would be an Overtemp Error and shutdown. In this case, the above mentioned points do apply.

Meant to respond to this sooner,

That makes sense, it is a little frustrating from a user perspective. Though I don’t have a lot of experience with this kinda of hardware/real time systems.

That log exists which I initially overlooked as the actual error.

klippy.log:TMC 'extruder' reports DRV_STATUS: 001e01c3 otpw=1(OvertempWarning!) ot=1(OvertempError!) ola=1(OpenLoad_A!) olb=1(OpenLoad_B!) t120=1 cs_actual=30

At least you could introduce some ‘isCritical = yes|no’ option to sensors like chamber temperature.

So that a failing temp sensor (like ambient/enclosure) would be ignored instead of killing the whole print.

Depending on what TMC driver you have, you might be able to use a delayed_gcode loop to poll its internal temperature and issue a PAUSE command if it gets to within (say) 10 degrees of the shutdown threshold.

Even if your driver doesn’t report the internal temperature, you might be able to poll drv_status to look for otpw=1(OvertempWarning!) and issue a PAUSE or some other custom gcode before it gets to the point of ot=1(OvertempError!).

But of course, that should only be used as a protection mechanism to try to avoid losing a failed print, not as a solution to whatever is causing the warning and error in the first place.

Yeah, I will try to look into this more and try to figure out what might be reasonable ways to reduce halts on failures.

A few things to consider,

  1. If the errors are raised from the board firmware and require rebooting the firmware that may be out of reach or reasonableness.
  2. If there are some other errors that can be safely resumed, than it may be reasonable to offer a way to not require firmware restart
  3. This error in particular is good example where I think theophile makes a reasonable point. It would have been good to know this type of warning was occurring prior to the actual error. I think something in Mainsail or Fluid UI that shows a driver status or reports on driver warnings would be a reasonable place to start on making any changes.