Intermittent TMC UART weirdness

(Note that I have posted an almost identical message on Discord, but I think this is probably a much better place for it. I don’t think that Discord is the best place for non-real-time messaging.)

There has recently been a bit of activity on Discord, GitHub and seemingly here relating to some reported TMC UART anomalies. I have been using the Duet3D Mini with TMC2209 for a while now and I have initially experienced a number of unrelated issues, all related to significant modifications I installed on my printer at the same time as the Duet Mini installation, the learning curve, etc. All these issues have now been resolved except for two randomly occurring “glitches”.

One of the glitches that I have seen a few times now is related to TMC UART. The difference is that in my case it generally happens extremely infrequently (once every few weeks), it always happens on the same TMC driver (z1 in my config), and it has now happened with two different Duet Mini boards. Another common theme is that it appears to happen when homing after the printer has been powered up and not printing for extended periods of time. I have never seen any errors during printing.

It is also worth noting that in the first attached case there has been a brief power outage overnight that I believe restarted the printer. Unfortunately I am not able to determine the duration of the power outage, but I do know that the Raspberry Pi power supply in my printer has a much shorter hold-up time than the main 24V power supply.

The attached logs show the above referenced event as well as another event that just occurred this morning. Both happened when I tried homing the printer after it has been powered up and idle for some time. I would appreciate if anyone could shed some light on what might be going on…

I have been chasing some EMI issues on the printer, but in this case I don’t believe that EMI is the issue. It always happens on Z1 stepper and thus far always after homing following prolonged idle time.

Thank you,
Peter.

klippy - Unable to read tmc uart ‘stepper_z1’ register IFCNT - 2021-06-24.zip (21.0 KB)


klippy - Unable to read tmc uart ‘stepper_z1’ register IFCNT - 2021-06-27.zip (173.0 KB)

FWIW, I just had a chat with Luke @Lukeslaboratory on Discord and he has also seen similar errors once in a blue moon.

That’s odd.

The log seems to indicate that Klipper is doing what it is intended to do. It reprograms the tmc registers when the drivers are enabled. It fails to get a response from the stepper_z1 driver after 5 retries over a period of 70ms and thus reported an error.

It is odd in that previous commands to that stepper driver completed without issue just a few milliseconds earlier. Both logs show the same pattern.

I did notice that the steppers had started homing by that point, so it’s possible noise from the steppers are interfering with transmission.

One thing you could try is commenting out the tx_pin from all the tmc2209 sections (and thus use uart_pin for both rx and tx).

Separately, note that if motor power is removed from the stepper drivers while the mcu power is still applied, then it can put the drivers into a “half way reset” state which can cause odd things like this. So, if you had removed motor power during these downtimes then it could also explain issues like this.

-Kevin

Thanks very much for taking your time to look at the logs Kevin!

You may recall from one of my posts on GitHub that I have indeed been chasing some strange EMI issues with the BL Touch, but this seems to now be completely resolved. I was wondering if EMI could be a suspect here as well, but it just seems a bit too coincidental that it always occurs on the same stepper driver, always at the beginning of homing, and now on two different samples of the Mini 5+. I have also never had any issues while printing (so far?) and I’ve now accumulated over 30 hours of print time in this configuration at various speeds from 10 mm/s to 100mm/s. My other thoughts were that this type of behaviour could be caused by some corner case comms timing issue (as in for example the driver responding just a little bit too slowly to the MCU perhaps). Timing could be thermally driven, but I’m just speculating here of course.

My USB cable has the 5V power conductor disconnected (covered with Kapton tape), so the drivers and the MCU are always powered from the same main 24V power supply. I also know for certain that in the most recent event there were no power glitches and the printer has only been idle for about two hours.

I commented out all the tx_pin entries from my TMC2209 config sections per your recommendation and I will keep an eye on it.

Thanks again,
Peter.

It just happened again with modified config using single UART pin. Again while homing, but this time the printer has only been idle for less than an hour, so the idle time is likely a red herring…

klippy - Unable to read tmc uart ‘stepper_z1’ register IFCNT - 2021-07-01.log (143.7 KB)

If you modify the code in klippy/extras/tmc.py and change the line that says TMC_BAUD_RATE = 9000 to TMC_BAUD_RATE = 40000 does it fix the issue? (After any code change it is necessary to restart via sudo service klipper restart.)

-Kevin

Thanks Kevin, I am about to do this. Coincidently, while you were typing your response I was putting together a message to ask if this may be of any significance:

// Define the baud rate used to send/receive data to/from the drivers.
// If we assume a worst case clock frequency of 8MHz then the maximum baud rate is 8MHz/16 = 500kbaud.
// We send data via a 1K series resistor. Even if we assume a 200pF load on the shared UART line, this gives a 200ns time constant, which is much less than the 2us bit time @ 500kbaud.
// To write a register we need to send 12 bytes and receive 8 bytes after a programmable delay. To read a register we send 4 bytes and receive 8 bytes after a programmable delay.
// In testing I found that 500kbaud was not reliable. Minimum baud rate is 9000.
constexpr uint32_t DriversBaudRate = 100000; // at 100kbaud a transfer may take up to 2ms
constexpr uint32_t TransferTimeout = 6; // any transfer should complete within 6 ticks @ 1ms/tick. 5 wasn’t quite enough.

This is straight from RRF Pins_Duet3Mini.h

Note that the TMC_BAUD_RATE seems to be in tmc_uart.py. I changed the value to 100000 per the RRF, hope that’s ok.

A quick update. I was not able to induce any TMC UART errors with TMC_BAUD_RATE set to 100,000 over 48 hour period of intense attempts to break it. I then went searching for the upper baud rate limit and it seems that things start to break somewhere above 140,000. At around 160,000 I started getting somewhat frequent UART errors when homing.

I now reverted back to your recommended 40,000 for longer term testing.

I have not experienced any TMC UART errors with TMC_BAUD_RATE set to 40,000. It appears that increasing the baud rate from 9000 to 40000 fixed the intermittent issue.

Would you like me to open an issue on GitHub @koconnor ?

Thank you very much!

Thanks. Certainly an interesting result.

I doubt the baud itself is related to the issue. What I suspect is occurring is that the faster baud allows the initialization sequence to complete before homing starts. The real question for me is - why is the micro-controller unable to communicate with the drivers when they are homing? There doesn’t seem to be any issues during a print. (Or, maybe there is an issue, but that driver stepper_z1 isn’t moving frequently during a print?)

-Kevin

So for the “sake of science” I pulled the latest commit of Klipper from GitHub, effectively reverting back to the baseline baud rate of 9000. Within 30 minutes of fiddling with the printer I received the same error, once again on the z1 stepper.

I was going to swap the z and z1 drivers on the Duet Mini to at least try to isolate the hardware side. If the error moves from z1 stepper to z stepper then it’s likely somehow related to Mini hardware. If the issue remains on z1 stepper then it’s more likely related to software / hardware interaction - perhaps because it’s the last stepper being addressed during homing?

I am also going to let the printer run overnight with a sequence of the following moves:

g1 z349 f7.5
g1 z10 f7.5

This will run z stepper and z1 stepper at 100 microsteps per second (1.9 degree steppers set to 32 micro steps with 8mm lead screws) to keep them “busy”.

Is there anything else you can think of me doing in terms of lower level troubleshooting?

Thanks,
Peter.

If you’re comfortable making code changes, you could try something like this in klippy/extras/tmc_uart.py:

--- a/klippy/extras/tmc_uart.py
+++ b/klippy/extras/tmc_uart.py
@@ -220,10 +220,13 @@ class MCU_TMC_uart:
         reg = self.name_to_reg[reg_name]
         if self.printer.get_start_args().get('debugoutput') is not None:
             return 0
-        for retry in range(5):
+        for retry in range(25):
             val = self.mcu_uart.reg_read(self.instance_id, self.addr, reg)
             if val is not None:
                 return val
+            logging.info("TMC read %d fail on '%s' register %s at %.6f",
+                         self.name, retry,
+                         reg_name, self.printer.get_reactor().monotonic())
         raise self.printer.command_error(
             "Unable to read tmc uart '%s' register %s" % (self.name, reg_name))
     def get_register(self, reg_name):

That should increase the retries and log each failure. Maybe that will give further info on how many intermittent failures there are.

Be sure to do a full sudo service klipper restart after any code change.

-Kevin

Absolutely no issue fiddling with the code, thank you for the patch!

I had a bit of a delay because I needed a bit of a crash course to understand that white space is not necessarily white space in Python. This led to no logging despite the changes being patched-in. I did not realize this until a few days into testing after I finally set the baud rate to 200,000 and was getting consistent failures but no logging. :confused:

Once I fixed the white space (tab) issue I finally got some interesting data, see the attached two log files. Both files were produced with a fresh printer power-on followed by homing, changing the idle timeout to one hour, a (simulated - no extruding) short print with very fast X,Y,Z,Z1 stepper moves, finally followed by one hour of idle time allowing Klipper to power down the steppers.

The 9,000 baud log shows 149 occurrences of single (read 0) TMC read faults on random steppers, with steppers powered-up and even when not printing. The 40,000 baud log on the other hand shows only 3 occurrences of single (read 0) TMC read faults. There were no cases of sequentially increasing (repeating) read failures at all. I was able to induce sequentially increasing read failures leading to printer shutdown once I pushed the baud rate above 160,000.

This data would certainly explain, at least statistically speaking, why increasing the baud rate eliminated the shutdown issues in my (limited duration) testing. Would you expect to see any read faults at all on well functioning setup? Are the very infrequent and seemingly random single read faults normal?

I now changed the baud rate back to 9,000 with the logging patch applied and I will continue tinkering with the printer until I induce a printer shutdown or at least some occurrences of sequentially increasing read failures.

Peter.

klippy-09kb.log (273.0 KB)
klippy-40kb.log (271.4 KB)

That’s certainly surprising. I would have expected less issues with noise at a lower baud rate.

I wonder if we’re hitting some kind of interaction with a low baud or if it is noise related. If you increase the baud just a little, to 12000 baud, does that show a different pattern?

Would you expect to see any read faults at all on well functioning setup? Are the very infrequent and seemingly random single read faults normal?

I’ve not run this kind of test and I don’t know of anyone else that has. So, I can’t say how often retransmits occur. A few retransmits wouldn’t bother me.

Thanks for working on this.
-Kevin

It’s my pleasure, I hope we can get to the bottom of this!

The plot thickens. After changing the UART speed to 12,000 baud and repeating the exact same test I logged 862 (!) occurrences of single (read 0) TMC read faults. This now looks to me like some kind of asynchronous frequency “beating” interference. It seems unlikely to be caused by EMI / electrical noise.

EDIT: Another interesting observation is that the log is suddenly full of ‘Resetting prediction variance’ messages that, if I understand correctly would indicate clock drift between MCU and host? Bizarre!

EDIT2: I am not able to repeatably recreate the occurrence of ‘Resetting prediction variance’ at 12,000 baud. I was able to recreate it several times yesterday but not today.

EDIT3: I just now realized that the ‘Resetting prediction variance’ messages were related to the ‘rpi’ mcu, not the Duet. So much to learn…

Peter.

klippy-12kb.log (341.2 KB)

I was able to capture the ‘event’ today in the logs. The printer was started up this morning with 9000 baud setting and it was idle all day with me randomly homing it. I did not experience a shutdown, but reviewing the log just now there was an occurrence of 20 consecutive ‘stepper_z1’ read failures. This survived the increased threshold of 25 but would have caused printer shutdown with unmodified Klipper. Once again it appears to have occurred immediately after homing.

I have a spare SKR Mini v2.0 board that I will try to power up with nothing attached just to see if it also sees are any TMC2209 UART read errors. I posted a request on Discord in case anyone else was wanting to help with some data but I have not seen any responses yet.

Thank you,
Peter.

klippy-09kb-event.log.zip (156.1 KB)

EDIT: I have set-up my new SKR Mini E3v2 with a fresh FluiddPi installation on a separate Raspberry Pi. The SKR was powered with a linear precision lab power supply and the Pi was powered-up with a high quality switch-mode USB wall adapter. The installation only had the hot end thermistor plugged in with nothing else - see the attached photograph.

After powering-up I simulated homing by manually shorting the limit switch input pins and let it sit idle with stepper drivers powered-up (super long idle timeout). The attached log shows a similar number of retransmits as I am seeing on my Duet Mini at the same baud rate of 9,000. It seems that there is nothing unique about the Duet retransmits.

klippy-09kb-SKRminiE3v2.log (126.4 KB)

I have one more interesting data point. For kicks I set the baud rate to 57,600. I then completed a 6 hour print at reasonably high speed and the log only registered a total of four occurrences of single read faults…

Thanks. Quite the mystery.

I’m wondering if there is an issue with retransmits occurring too fast. If you checkout the code on the work-tmc-20210715 branch, does it improve things for you?

cd ~/klipper ; git fetch ; git checkout origin/work-tmc-20210715 ; sudo service klipper restart

-Kevin

Separately, there isn’t an issue with increasing the baud to ~40000 on 32 bit mcus. It may be worthwhile to do that no matter the outcome of this investigation.

Since Klipper uses bit-banging to implement the tmc uart, higher baud rates do increase cpu load. That’s not an issue for 32bit machines, but older AVR micro-controllers may suffer from a higher baud rate.

-Kevin