Intermittent TMC UART weirdness

I am thrilled to report that it appears that I found a solution with a simple workaround for now, and indirectly a confirmation of the root cause.

The issue is definitely related to timing on the TMC UART. While testing the work-tmc-20210715 branch I decided to tweak some TMC2209 settings that I read about in the data sheet. Specifically, since I suspected some potential timing issue, the SLAVECONF register was of interest. Ultimately, changing the SENDDELAY value from the default of 8 to 6 or lower completely eliminates all occurrences of all TMC read faults. In many hours of testing with both my Duet Mini and the SKR Mini I was not able to trigger even a single retransmit! This was done with baud rates of 9,000, 40,000 and 57,600 and with the work-tmc-20210715 branch as well as the main (trunk) branch of Klipper. It looks like Klipper expect the response from the drivers a little bit sooner than it receives it with the default SENDDELAY setting of 8.

Hopefully this is easy to fix, but for now I am using the following macro as a workaround:

[gcode_macro TMC_SENDDELAY]
gcode:
  SET_TMC_FIELD STEPPER=stepper_x FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=stepper_y FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=stepper_z FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=stepper_z1 FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=extruder FIELD=SENDDELAY VALUE=2

Peter

EDIT: Upon another look at the (very confusing) TMC data sheet I realized that the default value for SENDDELAY is actually “8 bit times” which I believe is equivalent to SENDDELAY VALUE=0. I am therefore retesting with SENDDELAY VALUE=2 (one step increase above default). This, however, does not invalidate my previous results that show flawless operation with SENDDELAY VALUE=6. But it does make some of my statements above invalid, such as Klipper expecting the response sooner - it seems that the response comes too soon instead and must be delayed.

EDIT2: I can confirm that there are no TMC UART retransmits with SENDDELAY VALUE=2. I have revised the above macro accordingly. I would consider this part of the investigation closed.

Interesting!

Changing SENDDELAY is not hard. I’m pretty sure this isn’t a Klipper issue, but it may be an errata in the tmc2209 chips. (In particular, I wonder if the chips are being confused by each other’s responses when a single uart line is shared between multiple chips.)

Would you be able to confirm that explicitly setting SENDDELAY=0 still causes issues? (That is, can you confirm this isn’t a misunderstanding of the default tmc2209 setting for SENDDELAY.)

-Kevin

Indeed quite interesting!

I can swear that I read somewhere that SENDDELAY has to have a minimum value other than zero when using multiple slaves, but I am unable to find it in the data sheets so I think I may be confusing it with something else. RRF also explicitly sets the value to 0 stating // we don't need any delay between transmission and reception.

I can also confirm that explicitly setting SENDDELAY=0 causes the retransmissions to return.

Does this help? TMC-API/TMC2209_Fields.h at 558113493a8cde0eb68a3794b77752622e9ed39e · trinamic/TMC-API · GitHub

Thank you for the link, I was not aware that TMC had a Github repository.

I am not sure how Kevin would like to proceed from here. Setting SENDDELAY=2 as default in Klipper would be extremely simple, I imagine. But the question remains why the default value results in missed responses while it (maybe) works fine with RRF, etc. If there is nothing obvious in Klipper code then I would imagine a question to Trinamic engineering would perhaps be appropriate.

I think we should do two things: 1) set SENDDELAY=2 on tmc2209 drivers and 2) increase the UART speed to 40000 on non-AVR micro-controllers.

I hope to get a PR up with the above sometime this week.

-Kevin

1 Like

That’s perfect for my installation :slightly_smiling_face:

Thank you very much,
Peter.

I am posting this for the sake of documenting some additional information.

I was chatting with Desuuuu who is using an SKR board with TMC2209 drivers. His board has independent UART interfaces for each driver. He did some testing with the additional debug logging patch and had 0 failed reads at both 9kbps and 40kbps. At 200kbps he gets a few failed reads per minute. The results were exactly the same with SENDDELAY set to 2.

This at least confirms that the issue is isolated to TMC drivers that share a single physical UART.

Peter.

Is there a reason SENDDELAY=2 is only set for the 2209 and not for the 2208?
I just updated my klipper installation and started to see similar issues with TMC2208s on my Trigorilla board.

Our testing showed that even TMC2209 was not affected by this issue in configurations where each driver has a separate UART interface. Only installations with TMC2209 where multiple drivers (up to 4) are using a single physical UART interface with soft slave addressing were showing this issue. TMC2208 does not support multiple slave addressing and each driver has to use a separate dedicated UART and therefore is not impacted.

I am also not personally aware of any prior reports of UART retransmit errors on 2208.

Interesting … I just ran into the issues again (everytime it’s a different stepper and/or register it fails for). I restartet the Klipper host multiple times and everytime I tried homing it failed again. Then I added

self.fields.set_field(“SENDDELAY”, 2)

to tmc2208.py and restarted the host. Everything worked fine afterwards. So while I might have a different problem, the fix might still be the same.

@Nitek May I ask, if the change solved your problems permanently?

I also have a Trigorilla (1.0) board and installed the TMC2208 v3 a few days ago. I notice the same strange behaviour (different stepper fails, always repeatable with a G28/M84/G28 combo, etc.) and tried to solve it with sending a different SENDDELAY value, but that didn’t solve the problem. Then I have to turn off the printer, wait some time, turn it back on again and only send a G28 at beginning of a print, but not manually…

Generally, if I issue the G28, the printer starts to home, but stops the steppers randomly and mcu is shutdown. So sadly, everytime the stepper communication fails, I am unable to execute a DUMP_TMC so I cannot get further information.

What I find really strange is that I was able to succesfully end a 3 hour print, BUT in my end gcode is a G28 and it failed again.

EDIT: Oh, I see, it seems to be same as in: TMC2208: periodically gets errors while homing - #7 by massild

Reverted to my old stepper drivers for now.

Well, yes and no. It doesn’t happen after every print anymore, but I noticed that it again after a series of prints just before shutting down the printer. So for me the situation seems to have at least improved.

I appear to also now be having these issues since updating Klipper to the latest release yesterday. Previously have been using TMC2209 over UART (each with its own UART pin) for months without issue. I suppose this is likely related to the new change to periodically check the status of the drivers.

The issue only appears to occur during homing - printing is fine then the error is thrown during homing at the end.

Given it was previously working perfectly, is there any way to disable the new functionality or another workaround to prevent these crashes?

There is no option to disable the periodic driver checks.

If you’re having an issue with the latest code, best would be to open a new topic here on Discourse with the full Klipper log file and a description of the steps necessary to produce the error. Hopefully, someone will be able to help identify the problem.

Separately, a few people have reported that issuing explicit SET_STEPPER_ENABLE commands for the steppers prior to homing them has improved stability. It is unknown why this is the case.

-Kevin

I’m just catching up on this after someone mentioned this issue on Discord and it got me thinking… any reason why I can’t or shouldn’t just add this to my TMC2209 powered printer?

[delayed_gcode TMC_SENDDELAY]
initial_duration: 5.
gcode:
  M117 Running startup macro...
  SET_TMC_FIELD STEPPER=stepper_x FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=stepper_y FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=stepper_z FIELD=SENDDELAY VALUE=2
  SET_TMC_FIELD STEPPER=extruder FIELD=SENDDELAY VALUE=2
  M117 Startup macro done!

The latest Klipper code automatically sets SENDDELAY to 2 on tmc2209 drivers. So, no manual intervention is necessary.

-Kevin

1 Like

Jumping into the discussion, I am experiencing similar problems with my TMC2209 drivers on a BTT Big Octopus board. The occurrence of the error is very erratic, even with the same gcode, it does not occur at the same point. Everything seems to point at communications errors with the tmc2209.

This is the error that is reported (always on x):

TMC 'stepper_x' reports error: DRV_STATUS: 001107c3 otpw=1(OvertempWarning!) ot=1(OvertempError!) ola=1(OpenLoad_A!) olb=1(OpenLoad_B!) t120=1 t143=1 t150=1 cs_actual=17
Once the underlying issue is corrected, use the
"FIRMWARE_RESTART" command to reset the firmware, reload the
config, and restart the host software.
Printer is shutdown

Because of this error, I started checking the temperatures of the stepper drivers using thermocouples. It turns out there is no overheating before the error occurs. Only áfter the error, it seems that the driver gets locked into a state where it starts heating like crazy (see image below).

I cut the power before anything disastrous could happen.

Things I have tried:

  • Swapping tmc drivers (x drive for one of the z steppers). Error still occurs on x.
  • Swapping x and y cables and adjusting configuration. Error still occurs on x.
  • Increasing motor current (initially seemed to decrease frequency of error occurring), but still keeps happening
  • Change baudrate of tmc drivers to 100000 (again initially seemed to do something, I was able to finish an 8 hour print). But the error reoccurred this morning just after the first layer of a print.
  • changed the SENDDELAY value to 2. This didn’t make a difference, but as I understand from the posts above is that this is already implemented in Klipper.

I am a bit at a loss at the moment as to what I can do next. And I am especially concerned about the stepper driver getting locked into a state where it just starts heating up and becomes unresponsive.

I have added the code to tmc_uart.py to enable the logging. What is not clear is where I should be able to find this logging information. Is that supposed to land in klippy.log?

I’m not sure that this is really the same issue. So far I have not seen any reports of false-positive OvertempWarning.
I’m equally not sure if measuring the temperature on the drivers outside is really representative compared to its internal measurement (with respect to actual temperature and time delay until the temperatures reaches the probe).

Have you actually tried to add active cooling? At which current do you run the steppers?

The printer (Voron 2.4) has active cooling. The thermocouples were placed directly on top of the driver chip on the underside using heat conductive tape, so I don’t expect much delay / difference when it comes to actual temperatures inside the chip. I have tried currents from 0.8 to 1.4 A. With 1.4 A the drivers get warmer of course, but nothing to worry about. Moreover, with increasing current, the error seemed to occur less frequently.