Configurable timeout

I created a PR to make the timout for single and multi mcu configurable via the config.

Why?
Some people (including me) have (sometimes) issues with a multi mcu setup with a timeout during z-homing.
The default timeout is 25ms, in my case it just don’t workout every time. Increasing the limit to 50ms fixes this issue for me entierly, but needs to be done by modifying the mcu.py file after every klipper update.
This PR makes the timeout configurable via the config while keeping the default values as is.
I am aware of the issues caused by longer timeouts (any already tried every suggested solution), but the user should be able to set this in the config instead of the python code. Additonally it will then be visible in the log and simplify debugging instead of changing it in the python file.

PR

This setting has the potential to cause a serious head crash that could damage a printer, so I personally don’t think it should be user facing. I especially don’t think it should be adjustable without an upper bound on the value. Perhaps the default is too strict, but I have personally never experienced this issue on either of my printers where I’ve used a variety of CAN boards (Huvud 0.5, Huvud 0.61, EBB42, Fly SHT42, and SB2040).

If you are having issues with the default 25ms, then I think it would be good to post the Klipper log files here from sessions containing the error (from code running the pristine Kipper code). The system should be able to reliably respond within 25ms, and if it isn’t there is a low-level reason for that inability. If at all possible I think we want to track down the root cause of the issue instead of adding config options to ignore a real problem.

-Kevin

@koconnor Thank you very much for you help :slight_smile:
I will try to reproduce the problem and post the logs.
To get additonal information, I need to emergency stop after the timeout occures, correct?

Yes, for best results, issue an M112 command (or hit the emergency stop button) immediately after the undesirable event.

-Kevin

Haven’t got any time yet to create the logs.

But I found this recommendation in the Voron discord to adjust the timeout to 50ms für all Happy Hare Users with a can board.

@koconnor Here are the logs, I could reproduce it with 25ms timeout.
If you have some spare time to have a look at it, it would be great! Appreciating your help.
I used by branch, but there is virtually no difference to the main branch when using 25ms timeout.
There are no bytes_retransmit or bytes_invalid, klipper is also pinned to one core of the CPU which reduced this issue significantly.

klippy.log (349.4 KB)
moonraker.log (21.9 KB)

It looks like you’re running an old version of Klipper in “usb to canbus bridge mode” that is known to have problems ( Upgrade if using Klipper USB to CANbus bridge mode ). You should definitely update, as we know there were issues with older versions of the code.

If the issue does reoccur after updating, I really need the logs from a pristine version of Klipper in order to investigate. Issues like this are very hard to track down, and they’re basically impossible if I don’t have the exact code that is running.

-Kevin

@koconnor
Here are the logs for the new Version: v0.12.0-58-g25bc649c
Everything was flashed to the new firmware

The test I was running: Z homing and moving up in a loop.
But I didn’t even get 2 or 3 homes till the Communication timeout during homing z kicked in.

G28
G1 F3000 Z200
G28 Z
G1 F3000 Z200
G28 Z
G1 F3000 Z200
G28 Z
G1 F3000 Z200
...

klippy.log (395.6 KB)
moonraker.log (11.1 KB)

Alas, the log does not contain all the messages being sent at the time of the event. The time between the event and the shutdown request was too long.

Also, the Klipper code was modified with gcode_shell_command.py code. Best to reproduce with the pristine Klipper code.

You can try modifying Klipper to raise a shutdown event immediately after a timeout event - that can make it easier to ensure the log is complete:

--- a/klippy/extras/homing.py
+++ b/klippy/extras/homing.py
@@ -135,6 +135,7 @@ class HomingMove:
             if error is None:
                 error = str(e)
         if error is not None:
+            self.printer.invoke_shutdown("Forcing shutdown for homing debug")
             raise self.printer.command_error(error)
         return trigpos
     def check_no_movement(self):

If modifying the code, be sure to run a full sudo service klipper restart after making any changes.

Also, if you are able to reproduce, it would help if you could provide the canbus log from the event as described at CANBUS Troubleshooting - Klipper documentation . Just compress the resulting log file and attach it here.

Finally, there is a spike in system load just a few seconds before the timeout is reported (without a spike in klipper processing load - which indicates it is external from klipper):


(note the spike of the cyan line near the right side of the graph).
If you’ve got any kind of background processes (webcams, timelapse software, remote monitoring, etc.) then you may want to check if the problem persists with them disabled.

-Kevin

Thank you for your very fast response!

  • The load peak was due to opening up the webcam view, but I also have the issue with deactivated webcam (deactivated in moonraker but still physically connected).
  • The gcode_shell_command.py is only used for config backups being pushed to github, see here, this should not have any effect on the homing.

This time it went straight into the homing error not even having moved 20mm down, everything should now be contained in the logs.

You find all files (including the candump) attached.
klippy.zip (118.4 KB)

I also removed the webcam config from crowsnest, so the webcam should also no longer be active (I think).
Now the system load is less than 10% when the timeout occures.
I also found the docu for the graphs :wink:


klippy.zip (637.0 KB)

The root cause of the error in your log appears to be a communication hiccup of at least 15ms in the USB system (either USB hardware, Linux kernel handling, or Klipper’s CANBUS bridge). Specifically, the main mcu attempted to send a message to the host and the Linux kernel doesn’t report receiving it until 15ms has elapsed. Specifically, this can be seen in the log (after processing with logextract.py):
Receive: 86 850.286296 850.263383 14: seq: 1f(1102), trsync_state oid=13 can_trigger=1 trigger_reason=0 clock=1580327894(850.271477)
The mcu sent the message at 850.271477 but it was not received at the host until 850.286296.

It’s unclear what caused the issue at the (likely) USB level.

However, it’s not ideal that Klipper raised a communication timeout due to 15ms of latency. (The goal is to avoid 25ms of latency.) I think some scheduling improvements within Klipper could, in practice, reduce the chance of a timeout.

Could you try the code at GitHub - KevinOConnor/klipper-dev at work-trsync-20240102 ? (Be sure to issue sudo service klipper restart after making the change.)

-Kevin

I just tested work-trsync-20240102 several times:

  • 01: after sudo service klipper restart
  • 02: after timeout at 01 and firmware restart ← took almost 10 min of homing to get the timeout, but I forgot to activate candump
  • 03: after removing the klipper.service pinned to just 1 core of the CPU (full reboot of the system) CPUAffinity=3 in /etc/systemd/system/klipper.service
  • 04: after timeout at 03 and firmware restart

01:

02:

03:

04:

klipper_dev.zip (1.1 MB)

  • 05: I was just able to home for 25min without candump and without a timeout.
  • 06: Enabling the webcam again I got a timeout.

05:

06:

klipper_logs_05-06.zip (465.3 KB)

Just for completeness, my setup:

  • MKS SKIPR

  • Mellow Fly SB2040v2

  • SKIPR: USB 3.0 → USB Hub → Webcam | WiFi Stick | Touchscreen

  • SKIPR: USB 2.0 → MCU USB Host (CAN via USB bridge), I can’t use the internal (PCB) connection for CAN via UART if I wan’t to use the CAN out of the mcu. Thats something klipper doesn’t support currently as far as I know / understand.

  • SKIPR: CAN → SB2040 v2

Okay, thanks for testing.

Can you summarize your results? Is the code on that development branch an improvement, the same, or worse?

Have you been able to identify a pattern that leads to an increase in timeouts?

Can you pull that development branch again (now at commit a493fe7f) and see if that improves behavior?

FYI, all of the logs I’ve seen so far indicate the same basic behavior - sporadic latency of 15+ms at the (likely) USB interface. This is in contrast to your connection’s normal latency of ~1ms. FWIW, this does seem to be a hardware problem (or system driver problem).

-Kevin

Also, have you tried hooking up the “SKIPR” board to the usb hub? A USB3 hub may actually improve scheduling (many hubs can off-load processing from the host).

-Kevin

@koconnor I have Octopus V1.1 with BTT U2C and BTT SC2209 can bus. I’m getting the Communication time out during probe error. I can limit the error by increasing my timing from .25 to .50 but i know that isn’t a fix. I did try something else you suggested in some of your documentation by changing two lines but that didn’t work because it didn’t like it and i had to update/ installed klipper again to get things running again. If i reboot my pi I will still get the error but if i power off the whole printer i will get get a few prints fine until the error appears again. I have done a lot of research but everyones setup is different. Where can I go from here. Thank you for any direction. I’m trying to learn more with this also not just make it work,
klippy (20).log (4.0 MB)

If you want to try the development branch, you’d need to ssh into the host machine and switch to that branch - typically with something like: sudo service klipper stop ; cd ~/klipper ; git fetch https://github.com/KevinOConnor/klipper-dev.git work-trsync-20240102 ; git checkout work-trsync-20240102 ; sudo service klipper restart

If you are still having problems after changing to that branch, you should remove any local changes (no z_calibrate.py modification), and be sure to run an M112 command immediately after the timeout event occurs. Then attach the resulting log file here.

-Kevin