Timeout on wait for 'tmcuart_response' response

Basic Information:

Fork of a fork of a Voron 0
BTT SKR 1.3/Fysetc SHT36 V2.0/Raspberry Pi 3b+/ Waveshare CAN hat
klippy.log.txt (7.4 MB)

I get this error nearly after every print:
Timeout on wait for 'tmcuart_response' response Once the underlying issue is corrected, use the "FIRMWARE_RESTART" command to reset the firmware, reload the config, and restart the host software. Printer is shutdown

It happens nearly every time. I guess only 1 out of 20 prints has no error. Mostly it happens some minutes after the print is finished. I could live with that, but yesterday it happened mid print. The attached log is from that failed print.

Somewhere I had read that in similar cases it might help to increase the TRSYNC_TIMEOUT, so I did that together with the TRSYNC_SINGLE_MCU_TIMEOUT. Then I printed the failed hours long print multiple times (without filament). I am not totally sure but I think 4 times it worked fine, but the next had the error again.

The printer worked fine till I changed to the CAN toolhead board.

What could be the reason for this? What can I do to test more? It seems it might have to do with the steppers getting switched off after print, because it is mostly happening some minutes after print end.
But just powering the steppers on and off doesn’t do it and the two times that happened mid print don’t fit to this idea.

That’s certainly odd. Alas, the log doesn’t show enough information to discern the root cause. If you can reproduce this, could you modify the code with the following change and reproduce. It might be possible to track this down with several more log files showing the issue.

--- a/klippy/mcu.py
+++ b/klippy/mcu.py
@@ -484,6 +484,9 @@ class RetryAsyncCommand:
             query_time = self.reactor.monotonic()
             if query_time > first_query_time + self.TIMEOUT_TIME:
                 self.serial.register_response(None, self.name, self.oid)
+                logging.error("n=%s o=%d qt=%.6f fqt=%.6f tt=%.6f c=%s",
+                              self.name, self.oid, query_time, first_query_time,
+                              self.TIMEOUT_TIME, cmds)
                 raise serialhdl.error("Timeout on wait for '%s' response"
                                       % (self.name,))
             self.serial.raw_send(cmd, minclock, minclock, cmd_queue)

-Kevin

Sorry, didn’t notice your answer. I am too seldom here. Now that I got an answer I will promise to poll more often, so I can work active on the issue.

In the meantime I have replaced the SHT36 with a EBB36 and renewed the wiring. I got the same error today 2 times, this time mid print :(. I have now the Waveshare Can Hat in suspicion. I will add the log output to the code and will let it run some hours without filament.

klippy (17).zip (1.5 MB)
Sorry it is a long file. It happened in the beginning of the second print.
In line 20934 you can find: n=tmcuart_response o=2 qt=224865.966924 fqt=224860.956439 tt=5.000000 c=[[68, 2, 5, 234, 3, 40, 32, 195, 10]]

That log indicates that the error is the result of a real problem - the host tried to send a message to the extruder’s tmc2209 chip (on the “EBBCan mcu”) and was unable to get a response. It sent the message 10 times over 5 seconds with no response.

It’s not an issue with the canbus nor toolhead mcu, as communication proceeds normally for those 5 seconds. The code just doesn’t get a response to the tmc driver queries.

It’s likely the first log had the same event - on a closer inspection, the first log was missing the “sht36v2 mcu” debugging info. That missing information was what confused me on the original analysis. It’s unclear why the original log was incomplete.

Admittedly, the error message is way too cryptic and could be improved in klipper.

FWIW, given that you’ve swapped toolhead boards and are still seeing the issue, you might want to look at temperatures, wiring, and voltage. In particular on the extruder stepper.

-Kevin

EDIT: On closer look, it does appear that the root cause could be a canbus issue.

On closer look, the canbus does look suspect. Both logs show incrementing bytes_invalid reports on the canbus. This is an indicator of reordered packets on the canbus. Reordered packets could result in a loss of messages that could eventually cause timeouts as reported here.

I’ve not heard of reports of reordering with the waveshare can hat. Some people have reported that certain Linux versions show reordered packets. Not sure what the resolution will be for your case, but you will definitely need to find a solution so that you no longer get incrementing bytes_invalid reports.

-Kevin

I started with the most simple change first and went from buster to bullseye and started a print, but after half an hour I have already >500 bytes_invalid.
I have ordered a BTT U2C module. That should move away the dependency to the used distribution, but this will need some weeks.
I have a Rock 3a that comes with native CAN, but I am too stupid to set it up. It show the CAN interface, but when I try get it up I get an error message that lead me to nothing.
My last option is a fan mount that I designed to blow on the EBB, but I would use this only to verify that it is a temperature issue.

Oh wait, I am really stupid! As I am testing without filament anyway, I could just print with bed off and very small run_current for the extruder stepper. The hotend shouldn’t do much for the temps, but I could go down with that too.

Looking into the current log. I get the first 32 bytes_invalid at a chamber temp of 25 °C. If this is temperature related then CAN is not for me.

Stats 6658.3: gcodein=0  mcu: mcu_awake=0.004 mcu_task_avg=0.000010 mcu_task_stddev=0.000012 bytes_write=197077 bytes_read=141326 bytes_retransmit=9 bytes_invalid=0 send_seq=6062 receive_seq=6062 retransmit_seq=2 srtt=0.001 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=99995536 EBBCan: mcu_awake=0.003 mcu_task_avg=0.000012 mcu_task_stddev=0.000017 bytes_write=65710 bytes_read=63030 bytes_retransmit=0 bytes_invalid=193 send_seq=2268 receive_seq=2268 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=63999710 adj=64001883 cpu: temp=47.8 chamber: temp=23.9 sd_pos=79734 heater_bed: target=0 temp=29.3 pwm=0.000 sysload=1.11 cputime=47.511 memavail=594436 print_time=509.951 buffer_time=3.411 print_stall=0 extruder: target=0 temp=30.5 pwm=0.000

It is not a temperature problem, sigh.

There is a chance that I have fixed it. At least I am printing now for real and the bytes_invalid are constant 0.
In the last try I made only a small failure and changed 2 things at once. I exchanged the waveshare hat with one that I had used for tests, and I rooted the can cables farther away from the PSU. So I am not sure what the reason was. These waveshare hats are not that complicated, I don’t believe that it is possible that they only work partly.
To look for the bytes_invalid was the information that I needed, thanks for this. But of course a problem can have always more than one cause. The print will need some hours. I will be back if the problem persists.