Timer Too Close during drip moves (Happy Hare related)

Hello, I have been debugging Timer Too Close errors that is affect a subset of ERCF users using Happy Hare Timer too close error at Unload sequence · Issue #280 · moggieuk/Happy-Hare · GitHub and I’ve been able to find where the TTC is triggered. During homing, _update_drip_move_time will call reactor.pause to pause for a small amount of time. The greenlet running this function does not resume in reasonably short period of time, leading to a delay to the next set of queue_step commands being sent to the MCU triggering a TTC error.

Filament changes are a multi-MCU homing move, with the motor belonging to the ERCF MCU and 1 or more endstop are used in the toolhead.

Investigation

Two things were clear at the beginning:

  • TTC is almost always triggered during a filament unload move.
  • TTC is sometime triggered at the beginning of a print when it doing regular xyz homing.
  • There was no increase in error bytes or bytes re-transmitted during a TTC

Installing Happy Hare was new to me, so I figured there may be something it was doing something wrong so I attempted various software optimisations (including converting klipper-led_effects to use numpy) and wiring changes and the only variable that has any significant effect was whether leds were enabled or not.

I tried adding logs to check_timers to see if there if there is a task that is taking a bit too long to execute and I noted a few things (some of which will be specific to my set up):

  • When not running a print, the number to tasks in self._timers sits at around ~20.
  • During filament changes (managed by Happy Hare), self._timers grows to ~50 tasks.
  • Sometimes tasks can run for > 0.1 seconds, but my logging didn’t pick up any tasks taking a long time at the time a TTC is raised.
  • Sometimes paused tasks would not resume until over a minute later (in one case I saw a task resume after 500s).

After staring at the my logs and the command queue dump in klippy.log I had a realisation: the drip move would continue around ~0.3s after the last time it ran, so I increased the buffer time by 0.3s and now I longer get errors.

Given that I could not see any obvious long running tasks, I am running under the assumption that reactor._check_timers is running nominally. If this was something low level such as CAN bus issues or Kernel bugs then I would expect TTC errors during printing, but it only occurs during homing.

The issue

I believe the issue is with how reactor.pause works by creating a callback and adding it to the end of self._timers, then starts to process the timers list from the beginning.

So if the prior ~40 tasks takes 0.3 to run, _update_drip_move_time is delayed for significantly longer than the asked pause time.

And since the pause task is added to the end of the queue, and so if a earlier task pauses itself, a run of the dispatch loop starts from the beginning of the queue further delaying resumption of later pause tasks.

Fortunately it very easy to reproduce this behaviour. This test creates a few tasks that are “busy” for 0.05s, and a sequence of pausing tasks for 0.5 seconds. Ideally each of the pausing tasks resumes in at most 0.5 seconds since we do not specific a pause duration but rather the time the task should resume.

import logging
import time
from reactor import EPollReactor

reactor = EPollReactor()

def end(waketime):
    logging.info("ending")
    reactor.end()
    return reactor.NEVER

def pausing_callback(waketime):
    start = time.time()
    logging.info(f"{waketime:.4f} pausing")
    reactor.pause(waketime + 0.5)
    end = time.time()
    logging.info(f"{waketime:.4f} end pause, resumed after {end - start:.4f}s")
    return reactor.NEVER

def busy_callback(waketime):
    time.sleep(0.05)
    return waketime + 0.05

def test_reactor():
    logging.info("starting")
    start = reactor.monotonic()
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(busy_callback, start)
    reactor.register_timer(pausing_callback, start)
    reactor.register_timer(pausing_callback, start + 0.1)
    reactor.register_timer(pausing_callback, start + 0.2)
    reactor.register_timer(end, reactor.monotonic() + 1)
    reactor.run()
    logging.info("done")

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO,
                        format="%(asctime)s %(threadName)s %(funcName)s %(levelname)s - %(message)s")
    test_reactor()

The result is that the first pausing task isn’t resumed until ~0.9 seconds after the pause is started:

2024-07-05 14:58:24,515 MainThread test_reactor INFO - starting
2024-07-05 14:58:24,816 MainThread pausing_callback INFO - 9646.3028 pausing
2024-07-05 14:58:25,116 MainThread pausing_callback INFO - 9646.6036 pausing
2024-07-05 14:58:25,417 MainThread pausing_callback INFO - 9646.9043 pausing
2024-07-05 14:58:25,718 MainThread pausing_callback INFO - 9646.3028 end pause, resumed after 0.9022s
2024-07-05 14:58:25,718 MainThread pausing_callback INFO - 9646.6036 end pause, resumed after 0.6017s
2024-07-05 14:58:26,019 MainThread end INFO - ending
2024-07-05 14:58:26,019 MainThread pausing_callback INFO - 9646.9043 end pause, resumed after 0.6017s
2024-07-05 14:58:26,019 MainThread test_reactor INFO - done

I haven’t been able to determine if this is the root cause however; there could be some hardware or other software configuration issue.

Remediation

What has worked for me is to increase the look ahead time in _update_drip_move_time:

diff --git a/klippy/toolhead.py b/klippy/toolhead.py  
index 4149d53b..3ca030d6 100644  
--- a/klippy/toolhead.py  
+++ b/klippy/toolhead.py  
@@ -506,7 +506,7 @@ class ToolHead:  
            curtime = self.reactor.monotonic()  
            est_print_time = self.mcu.estimated_print_time(curtime)  
            wait_time = self.print_time - est_print_time - flush_delay  
-            if wait_time > 0. and self.can_pause:  
+            if wait_time > 0.3 and self.can_pause:  
                # Pause before sending more steps  
                self.drip_completion.wait(curtime + wait_time)  
                continue

0.3 seconds being the upper bound for the delay I saw from the command queue.

Potential updates in Klipper

The most obvious change and easiest change would be to make this look ahead time configurable. I’m not entirely clear of what the downsides of increasing this look ahead time are though.

The next idea is to improve the observability of Klipper itself, such as:

  • Making the log format configurable (especially to add timestamps such as the test script above).
  • Monitoring of timers in reactor._timers to keep track of long running tasks, the time a task is started vs when its waketime expects it to be ran, and excessive pause times.
    • I guess the challenge is to add observability with the lowest performance impact possible. This is a bit outside of my day job, but maybe eBPF could be the way to implement this?
  • Monitoring of reactor.check_timers to keep track of potential greenlet leakage.

I suspect you’ll have to report this to the “happy hare” developers. If you can reproduce the issue on pristine Klipper code, I’ll definitely take a look at it (attach the full klipper log showing the error and the steps you used to reproduce it).

Unfortunately, a modification to the Klipper code (by changing code or adding code) can easily destabilize Klipper causing errors like you’ve described (even causing errors that appear to occur in the common code). So, there’s not much I can do unless the issue is also reproducible in the unmodified code.

Cheers,
-Kevin