I created a PR to make the timout for single and multi mcu configurable via the config.
Some people (including me) have (sometimes) issues with a multi mcu setup with a timeout during z-homing.
The default timeout is 25ms, in my case it just don’t workout every time. Increasing the limit to 50ms fixes this issue for me entierly, but needs to be done by modifying the mcu.py file after every klipper update.
This PR makes the timeout configurable via the config while keeping the default values as is.
I am aware of the issues caused by longer timeouts (any already tried every suggested solution), but the user should be able to set this in the config instead of the python code. Additonally it will then be visible in the log and simplify debugging instead of changing it in the python file.
This setting has the potential to cause a serious head crash that could damage a printer, so I personally don’t think it should be user facing. I especially don’t think it should be adjustable without an upper bound on the value. Perhaps the default is too strict, but I have personally never experienced this issue on either of my printers where I’ve used a variety of CAN boards (Huvud 0.5, Huvud 0.61, EBB42, Fly SHT42, and SB2040).
If you are having issues with the default 25ms, then I think it would be good to post the Klipper log files here from sessions containing the error (from code running the pristine Kipper code). The system should be able to reliably respond within 25ms, and if it isn’t there is a low-level reason for that inability. If at all possible I think we want to track down the root cause of the issue instead of adding config options to ignore a real problem.
@koconnor Here are the logs, I could reproduce it with 25ms timeout.
If you have some spare time to have a look at it, it would be great! Appreciating your help.
I used by branch, but there is virtually no difference to the main branch when using 25ms timeout.
There are no bytes_retransmit or bytes_invalid, klipper is also pinned to one core of the CPU which reduced this issue significantly.
It looks like you’re running an old version of Klipper in “usb to canbus bridge mode” that is known to have problems ( Upgrade if using Klipper USB to CANbus bridge mode ). You should definitely update, as we know there were issues with older versions of the code.
If the issue does reoccur after updating, I really need the logs from a pristine version of Klipper in order to investigate. Issues like this are very hard to track down, and they’re basically impossible if I don’t have the exact code that is running.
Alas, the log does not contain all the messages being sent at the time of the event. The time between the event and the shutdown request was too long.
Also, the Klipper code was modified with gcode_shell_command.py code. Best to reproduce with the pristine Klipper code.
You can try modifying Klipper to raise a shutdown event immediately after a timeout event - that can make it easier to ensure the log is complete:
@@ -135,6 +135,7 @@ class HomingMove:
if error is None:
error = str(e)
if error is not None:
+ self.printer.invoke_shutdown("Forcing shutdown for homing debug")
If modifying the code, be sure to run a full sudo service klipper restart after making any changes.
(note the spike of the cyan line near the right side of the graph).
If you’ve got any kind of background processes (webcams, timelapse software, remote monitoring, etc.) then you may want to check if the problem persists with them disabled.
I also removed the webcam config from crowsnest, so the webcam should also no longer be active (I think).
Now the system load is less than 10% when the timeout occures.
I also found the docu for the graphs
The root cause of the error in your log appears to be a communication hiccup of at least 15ms in the USB system (either USB hardware, Linux kernel handling, or Klipper’s CANBUS bridge). Specifically, the main mcu attempted to send a message to the host and the Linux kernel doesn’t report receiving it until 15ms has elapsed. Specifically, this can be seen in the log (after processing with logextract.py): Receive: 86 850.286296 850.263383 14: seq: 1f(1102), trsync_state oid=13 can_trigger=1 trigger_reason=0 clock=1580327894(850.271477)
The mcu sent the message at 850.271477 but it was not received at the host until 850.286296.
It’s unclear what caused the issue at the (likely) USB level.
However, it’s not ideal that Klipper raised a communication timeout due to 15ms of latency. (The goal is to avoid 25ms of latency.) I think some scheduling improvements within Klipper could, in practice, reduce the chance of a timeout.
SKIPR: USB 3.0 → USB Hub → Webcam | WiFi Stick | Touchscreen
SKIPR: USB 2.0 → MCU USB Host (CAN via USB bridge), I can’t use the internal (PCB) connection for CAN via UART if I wan’t to use the CAN out of the mcu. Thats something klipper doesn’t support currently as far as I know / understand.
Can you summarize your results? Is the code on that development branch an improvement, the same, or worse?
Have you been able to identify a pattern that leads to an increase in timeouts?
Can you pull that development branch again (now at commit a493fe7f) and see if that improves behavior?
FYI, all of the logs I’ve seen so far indicate the same basic behavior - sporadic latency of 15+ms at the (likely) USB interface. This is in contrast to your connection’s normal latency of ~1ms. FWIW, this does seem to be a hardware problem (or system driver problem).
@koconnor I have Octopus V1.1 with BTT U2C and BTT SC2209 can bus. I’m getting the Communication time out during probe error. I can limit the error by increasing my timing from .25 to .50 but i know that isn’t a fix. I did try something else you suggested in some of your documentation by changing two lines but that didn’t work because it didn’t like it and i had to update/ installed klipper again to get things running again. If i reboot my pi I will still get the error but if i power off the whole printer i will get get a few prints fine until the error appears again. I have done a lot of research but everyones setup is different. Where can I go from here. Thank you for any direction. I’m trying to learn more with this also not just make it work, klippy (20).log (4.0 MB)
If you want to try the development branch, you’d need to ssh into the host machine and switch to that branch - typically with something like: sudo service klipper stop ; cd ~/klipper ; git fetch https://github.com/KevinOConnor/klipper-dev.git work-trsync-20240102 ; git checkout work-trsync-20240102 ; sudo service klipper restart
If you are still having problems after changing to that branch, you should remove any local changes (no z_calibrate.py modification), and be sure to run an M112 command immediately after the timeout event occurs. Then attach the resulting log file here.