Hello, I have been debugging Timer Too Close errors that is affect a subset of ERCF users using Happy Hare Timer too close error at Unload sequence · Issue #280 · moggieuk/Happy-Hare · GitHub and I’ve been able to find where the TTC is triggered. During homing, _update_drip_move_time
will call reactor.pause
to pause for a small amount of time. The greenlet running this function does not resume in reasonably short period of time, leading to a delay to the next set of queue_step
commands being sent to the MCU triggering a TTC error.
Filament changes are a multi-MCU homing move, with the motor belonging to the ERCF MCU and 1 or more endstop are used in the toolhead.
Investigation
Two things were clear at the beginning:
- TTC is almost always triggered during a filament unload move.
- TTC is sometime triggered at the beginning of a print when it doing regular xyz homing.
- There was no increase in error bytes or bytes re-transmitted during a TTC
Installing Happy Hare was new to me, so I figured there may be something it was doing something wrong so I attempted various software optimisations (including converting klipper-led_effects to use numpy) and wiring changes and the only variable that has any significant effect was whether leds were enabled or not.
I tried adding logs to check_timers
to see if there if there is a task that is taking a bit too long to execute and I noted a few things (some of which will be specific to my set up):
- When not running a print, the number to tasks in
self._timers
sits at around ~20. - During filament changes (managed by Happy Hare),
self._timers
grows to ~50 tasks. - Sometimes tasks can run for > 0.1 seconds, but my logging didn’t pick up any tasks taking a long time at the time a TTC is raised.
- Sometimes paused tasks would not resume until over a minute later (in one case I saw a task resume after 500s).
After staring at the my logs and the command queue dump in klippy.log
I had a realisation: the drip move would continue around ~0.3s after the last time it ran, so I increased the buffer time by 0.3s and now I longer get errors.
Given that I could not see any obvious long running tasks, I am running under the assumption that reactor._check_timers
is running nominally. If this was something low level such as CAN bus issues or Kernel bugs then I would expect TTC errors during printing, but it only occurs during homing.
The issue
I believe the issue is with how reactor.pause
works by creating a callback and adding it to the end of self._timers
, then starts to process the timers list from the beginning.
So if the prior ~40 tasks takes 0.3 to run, _update_drip_move_time
is delayed for significantly longer than the asked pause time.
And since the pause task is added to the end of the queue, and so if a earlier task pauses itself, a run of the dispatch loop starts from the beginning of the queue further delaying resumption of later pause tasks.
Fortunately it very easy to reproduce this behaviour. This test creates a few tasks that are “busy” for 0.05s, and a sequence of pausing tasks for 0.5 seconds. Ideally each of the pausing tasks resumes in at most 0.5 seconds since we do not specific a pause duration but rather the time the task should resume.
import logging
import time
from reactor import EPollReactor
reactor = EPollReactor()
def end(waketime):
logging.info("ending")
reactor.end()
return reactor.NEVER
def pausing_callback(waketime):
start = time.time()
logging.info(f"{waketime:.4f} pausing")
reactor.pause(waketime + 0.5)
end = time.time()
logging.info(f"{waketime:.4f} end pause, resumed after {end - start:.4f}s")
return reactor.NEVER
def busy_callback(waketime):
time.sleep(0.05)
return waketime + 0.05
def test_reactor():
logging.info("starting")
start = reactor.monotonic()
reactor.register_timer(busy_callback, start)
reactor.register_timer(busy_callback, start)
reactor.register_timer(busy_callback, start)
reactor.register_timer(busy_callback, start)
reactor.register_timer(busy_callback, start)
reactor.register_timer(busy_callback, start)
reactor.register_timer(pausing_callback, start)
reactor.register_timer(pausing_callback, start + 0.1)
reactor.register_timer(pausing_callback, start + 0.2)
reactor.register_timer(end, reactor.monotonic() + 1)
reactor.run()
logging.info("done")
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(threadName)s %(funcName)s %(levelname)s - %(message)s")
test_reactor()
The result is that the first pausing task isn’t resumed until ~0.9 seconds after the pause is started:
2024-07-05 14:58:24,515 MainThread test_reactor INFO - starting
2024-07-05 14:58:24,816 MainThread pausing_callback INFO - 9646.3028 pausing
2024-07-05 14:58:25,116 MainThread pausing_callback INFO - 9646.6036 pausing
2024-07-05 14:58:25,417 MainThread pausing_callback INFO - 9646.9043 pausing
2024-07-05 14:58:25,718 MainThread pausing_callback INFO - 9646.3028 end pause, resumed after 0.9022s
2024-07-05 14:58:25,718 MainThread pausing_callback INFO - 9646.6036 end pause, resumed after 0.6017s
2024-07-05 14:58:26,019 MainThread end INFO - ending
2024-07-05 14:58:26,019 MainThread pausing_callback INFO - 9646.9043 end pause, resumed after 0.6017s
2024-07-05 14:58:26,019 MainThread test_reactor INFO - done
I haven’t been able to determine if this is the root cause however; there could be some hardware or other software configuration issue.
Remediation
What has worked for me is to increase the look ahead time in _update_drip_move_time
:
diff --git a/klippy/toolhead.py b/klippy/toolhead.py
index 4149d53b..3ca030d6 100644
--- a/klippy/toolhead.py
+++ b/klippy/toolhead.py
@@ -506,7 +506,7 @@ class ToolHead:
curtime = self.reactor.monotonic()
est_print_time = self.mcu.estimated_print_time(curtime)
wait_time = self.print_time - est_print_time - flush_delay
- if wait_time > 0. and self.can_pause:
+ if wait_time > 0.3 and self.can_pause:
# Pause before sending more steps
self.drip_completion.wait(curtime + wait_time)
continue
0.3 seconds being the upper bound for the delay I saw from the command queue.
Potential updates in Klipper
The most obvious change and easiest change would be to make this look ahead time configurable. I’m not entirely clear of what the downsides of increasing this look ahead time are though.
The next idea is to improve the observability of Klipper itself, such as:
- Making the log format configurable (especially to add timestamps such as the test script above).
- Monitoring of timers in
reactor._timers
to keep track of long running tasks, the time a task is started vs when itswaketime
expects it to be ran, and excessive pause times.- I guess the challenge is to add observability with the lowest performance impact possible. This is a bit outside of my day job, but maybe eBPF could be the way to implement this?
- Monitoring of
reactor.check_timers
to keep track of potential greenlet leakage.