Intermittant Timer Too Close Fault during Eddy Home/Tap

Basic Information:

Printer Model: ZeroG Mercury 1.1 370

MCU / Printerboard: Octopus Pro v1.1 H723

USB to CAN bus Bridge @1MBd to SHT36 V3 Max

Host / SBC: RPI 3B

klippy.log

SHT36ttc.log (453.5 KB)

Describe your issue:

I’m running into intermittent Timer Too Close faults with my SHT36 Tool head when using the eddy probe for homing/tap. I posted on discord and Timofey Titavets very helpfully noted the that the last sent trsync entries repeatedly show an earlier clock. Generally it jumps back to the tune of 80-100M ticks. @12Mhz, I believe that’s 6-8 seconds. I was able to reproduce recently by repeatedly triggering eddy probe homing or taps after rolling the log which is attached.

I thought I’d post here for easier tracking.

Welcome Hawk37,

interesting printer.

Is your CAN cable ok?

Hi, thanks for taking a look!

The cable is new for this setup. Its the mellow integrated one with a shielded twisted pair and 24V lines. I have the shield connected at the board side, and CAN lines plug right in to the Octopus v1.1 jst port.

I am not an expert on CAN, but I believe I read that Tx Retries are fairly typical when the bus is busy. The lower priority message is resent if two controllers try to talk at the same time. As I have no RX or TX errors I didn’t see the retries as a red flag.

It looks like the retry count actually started in the log at 835 and ended at 1124. So, while the total is 1124 it actually only increased by 289 counts during the testing. I don’t have another reference, is that abnormally high?

JFYI:
tx_retries, on the RP2040 is an bus arbitration
And by so, it is okay.

Appreciate the discussion. I am planning to take your advise from discord Timofey and start backing out add-ins. Plan to try one or two tonight.

Thank you, I didn’t know that.

So I uninstalled Auto-tune, disabled Auto_speed and motors_sync. I also updated klipper as there was a release on the 12th and again today on the 15th. I was able to run 50 home/tap combos and no faults. I added everything back and ran another 50, still no timer too close. I’m not sure if its just not triggering or if one of one of the recent commits somehow solved it. I’ll keep an eye out for it coming back. :crossed_fingers: for now.

Reoccurred today on my first z tilt. So, unfortunately it just didn’t trigger in those prior 100 sequences. With that, I set up a fresh SD card with vanilla Klipper. So far so good, but unfortunately time will tell. I did add in the motors sync again as it is pretty important for my awd setup. I’ll keep an eye on things. I was potentially thinking of adding in one add-on at a time until it comes back, but with 100 home/tap operations and it not reproducing it might take a while to decide this stripped down setup is not reproducing it.

Some more updates, it occurred again with just the motors sync. So, I setup just vanilla klipper, no add-ins. On the initial Z Tilt it triggered again. I’ll try and run it a few more times to get a cleaner log as this log has my initial bring up efforts in it. Unfortunately it does appear to be something inherent to Klipper itself. @nefelim4ag Anything else you want me to try?

VanillaTimeout.log (496.8 KB)

Reproduced while setting up for a print. Much shorter log file. Interesting that there is a jump forward in this one after the jump back. Looks like it logged one more send.

CleanerVanilla.log (1.1 MB)

The time skipping is a sign of your Pi not giving enough resources to klippy.

Are you running cameras or other USB percrifrails?

While waiting on @nefelim4ag you could try setting i2c_speed: 1000000 in your [probe_eddy_current]. I think that should reduce the number of CAN packets from the toolhead and help the Pi keep up with the U2C.

No, I’m not running any significant peripherals. Aside from the USB going to the MCU the only other one goes to a touch panel for klipper screen.

I don’t have a lot of experience with them, but this doesn’t appear to be the typical timer too close trigger mode. As I understand it, the host is sending a timeout to the MCU for the homing action to be completed by, trsync_set_timeout. Generally that would be in the future, however In this case its sending a clock value that is a good 6-8 seconds behind the last one it sent. That value being so far off appears to be what then triggers the fault.

It is still the same, just with a 32-bit overflow

Sent 93 6124.411547 6124.411385 9: seq: 1c, sos_filter_set_active oid=4 n_sections=0 coeff_frac_bits=0
Sent 94 6124.505381 6124.504664 49: seq: 1d, query_status_ldc1612 oid=1, queue_digital_out oid=17 clock=82903705 on_ticks=140918, trsync_start oid=2 report_clock=83248214 report_ticks=90000 expire_reason=4, trsync_set_timeout oid=2 clock=83548214, trigger_analog_home oid=3 trsync_oid=2 trigger_reason=1 error_reason=5 clock=83248214 monitor_ticks=48000 monitor_max=3
Sent 95 6124.605145 6124.605046 7: seq: 1e, query_status_ldc1612 oid=1
Sent 96 6124.706215 6124.706116 7: seq: 1f, query_status_ldc1612 oid=1
Sent 97 6124.807055 6124.806782 17: seq: 10, query_status_ldc1612 oid=1, queue_digital_out oid=17 clock=86503705 on_ticks=48613
Sent 98 6124.829509 6124.829331 11: seq: 11, trsync_set_timeout oid=2 clock=83545525
Sent 99 6124.832643 6124.832465 11: seq: 12, trsync_set_timeout oid=2 clock=4279446609
...
Receive: 98 6124.829097 6124.807055 13: seq: 11, trsync_state oid=2 can_trigger=1 trigger_reason=0 clock=83248239
Receive: 99 6124.833988 6124.832643 11: seq: 13, shutdown clock=83300079 static_string_id=Timer too close

Until we have an idea of a fix for that, I guess, you can decrease the probability of such an event with:

Z_TILT_ADJUST METHOD=scan

So, there is not trsync involved.

1 << 32 = 4294967296
(1 << 32) - 4279446609 = 15520687
(83300079 + 15520687) / 12_000_000 = 8.235

It still seems like a command from the past.
In other words, I think there is some ordering bug, and for whatever reason, one command is delayed and gets sent only after some time.

As long as the I2C speed/rate is enough to read data in time, there is no difference from a traffic perspective. If it is not enough (100kHz may not be enough), we will miss the sensor data.

-Timofey

Appreciate the potential reduction in faults. I’ll give it a go.

On the positive side, as it only seems to occur during homing, it is only a minor inconvenience when it occurs.

:light_bulb: I can probably avoid it further by homing less frequently too. I’ve seen a some start print macros that forgo homing if the printer is already homed. Might implement those too.

Thanks for testing and providing the log. It does seem like the Klipper software is doing something incorrect. It is not clear what the issue is right now. I’ll take a further look at the logs to see if I can track down what the issue is. Unfortunately, it’s likely still a few days out before I can look at the issue in detail.

-Kevin

No problem! Timofey’s suggestion to change to scan for z tilt along with making my start print homing conditional a has really reduced the frequency.

I really appreciate all the work you folks do to make klipper so great!

I was able to take a look at this issue. Unfortunately, I have been unable to track down the root cause.

The logs seem to indicate a timer rollover issue in the low-level host c code. The error does not appear to be a delayed message being sent at a later time.

In particular, all the logs I’ve seen (4 logs - 3 here and 1 from discord) show the same pattern. There is a tsync_state message from the main mcu (shown here after post-processing using logextract.py):

Receive: 98 1150213.712746 1150213.708287 14: seq: 10(13791), trsync_state oid=16 can_trigger=1 trigger_reason=0 clock=1577674306(1150213.711630)

The code in klippy/chelper/trdispatch.c correctly responds by creating a trsync_set_timeout message for the SHT36 mcu:

mcu 'SHT36': Sent 99 1150213.712969 1150213.712783 12: seq: 11(4465), trsync_set_timeout oid=2 clock=2877267925(1150205.477910)

However, the time sent is ~8.259 seconds less than what it should have been. 8.259 seconds is nearly 2**32 clock ticks on the main mcu.
So, in the extracts above, it seems that trdispatch.c should have converted clock 1577674306 to time 1150213.711630 (as logextract.py does in the snippet above), but instead it seems to convert it to time 1150205.452910 . The mcu then correctly raises an error when it receives the incorrect command.
All the logs show the same high-level pattern.

Unfortunately, I can see no reason why the trdispatch.c code would do that. The code seems correct, this code hasn’t changed for years, and the same code is run on all homing/probing attempts on every printer running Klipper.

You might want to try wiping out the entire klipper/ directory and recloning the code (eg, rm -rf ~/klipper/ ; git clone https://github.com/Klipper3d/klipper ; sudo service klipper restart). I guess it might be possible that something corrupted the chelper build and rebuilding from scratch may clear the issue.

If you are comfortable with programming, you could also try adding debugging information to the trdispatch.c code. Something like the following may help:

--- a/klippy/chelper/trdispatch.c
+++ b/klippy/chelper/trdispatch.c
@@ -88,8 +88,23 @@ handle_trsync_state(struct fastreader *fr, uint8_t *data, int len)
     }
 
     // mcu is still working okay - update last_status_clock
+    struct clock_estimate old_ce;
+    memcpy(&old_ce, &tdm->ce, sizeof(old_ce));
+    uint64_t old_clock = tdm->last_status_clock;
     serialqueue_get_clock_est(tdm->sq, &tdm->ce);
     tdm->last_status_clock = clock_from_clock32(&tdm->ce, clock);
+    if (tdm->last_status_clock <= old_clock)
+        errorf("tr_state %p: lsc=%lld ce=%lld,%lld,%.6f,%.3f"
+               " o=%lld oce=%lld,%lld,%.6f,%.3f"
+               , tdm, (long long)tdm->last_status_clock
+               , (long long)tdm->ce.last_clock
+               , (long long)tdm->ce.conv_clock
+               , tdm->ce.conv_time, tdm->ce.est_freq
+
+               , (long long)old_clock
+               , (long long)old_ce.last_clock
+               , (long long)old_ce.conv_clock
+               , old_ce.conv_time, old_ce.est_freq);
 
     // Determine minimum acknowledged time among all mcus
     double min_time = PR_NEVER, next_min_time = PR_NEVER;
@@ -222,5 +237,8 @@ trdispatch_mcu_setup(struct trdispatch_mcu *tdm
     tdm->expire_ticks = expire_ticks;
     tdm->min_extend_ticks = min_extend_ticks;
     serialqueue_get_clock_est(tdm->sq, &tdm->ce);
+    errorf("tr_mcu_setup %p: ce=%lld,%lld,%.6f,%.3f"
+           , tdm, (long long)tdm->ce.last_clock, (long long)tdm->ce.conv_clock
+           , tdm->ce.conv_time, tdm->ce.est_freq);
     pthread_mutex_unlock(&td->lock);
 }

If changing the code, be sure to do a full restart of the Klipper host software (sude service klipper restart).

Maybe that helps a little,
-Kevin