CANBUS Communication timeout while homing Z

Whenever I let me printer heat soak for an hour before printing, the probe always gives a “Communication timeout while homing Z” error. Like it will fail 5 times in a row. Restarting the machine with firmware restart immediately fixes the problem. When I try running another print after that one completes, it fails again. Restart and it works again.

I’m running an MKS Canable 1.0 with candlelight, 500000bps 1000txqueuelen, and BTT EBB 36 1.0

I suspect there’s some clock drift that gets worse over time and restarting syncs the two controllers together again. Otherwise how would I begin troubleshooting this?

And also Communication timeout during homing probe

Communication time out during homing fix

Changed TRSYNC_TIMEOUT value in “/home/pi/klipper/klippy/mcu.py” file from 0.025 to 0.050, i.e. was “TRSYNC_TIMEOUT = 0.025”, and became “TRSYNC_TIMEOUT = 0.050”, and the error “Communication timeout during homing probe” disappeared

Is there anywhere I can see the actual “ping” times that I’m getting. While this is nice, it doesn’t solve the problem of clock drift if that is the problem, or otherwise explain the behavior I’ve reliably caught now.

It’s impossible to say what could cause that without seeing the full Klipper log from the event.

-Kevin

Here’s one log file with many of them in it:

klippy.log.2022-08-09.zip (3.1 MB)

Your log is showing an incrementing “bytes_invalid” counter for the “tool” mcu. This was a symptom of an older version of candlelight firmware that was severely broken (it would reorder packets). You should confirm that the candlelight firmware is the latest and ensure that the “bytes_invalid” counter is no longer incrementing.

If it is that old version of candlelight, then it must be fixed as Klipper’s CANbus implementation is unlikely to be stable with reordered packets (even if you don’t get “communication timeout during homing” errors).

-Kevin

i’ll flash the latest version again but the candlelight isn’t that old, it’s from about a month ago now.

I flashed latest candlelight firmware last night and then left the printer overnight and just tried homeing again and got the same issue. I also lowered my transmission rate to 250k so my cabus should be fully compliant with the distance I’m running. Fully twisted pair, termination resistors at each end. Still failing to probe and still incrementing invalid bytes.

Same problem appears to be coming up here

I did some research and I’m also fairly certain that the candlelight firmware Kevin was talking about was from far before I started using canbus toolheads. Unfortunately I don’t think that is where the problem here is.

I analyzed your first log and it shows that the timeout error occurs due to lost messages between toolhead and host. At the same time, the “invalid_bytes” counter increases for the “tool” mcu. So, the homing issue is definitely a direct cause of the issue causing incrementing “invalid bytes”. Whatever the root issue is, it will need to be fixed to get a stable connection.

It is possible that the “invalid bytes” issue is a result of lost canbus messages between toolhead and host. However, it is odd that there is no indication of lost host messages sent to the toolhead (the retransmit counters are not incrementing).

I do not recommend using a canbus speed below 500000. If anything, you’ll want to go up to 1000000. A speed below 500000 does not provide enough bandwidth to accurately perform adxl345 resonance measurements. Lower speeds also notably increase the message round-trip-time, which tends to exasperate communication issues.

Finally, if you don’t mind experimenting, you could try flashing Klipper in “usb to canbus bridge” mode to your canbus adapter. If the problem persists with Klipper on the adapter then it would likely rule out any issue with candlelightfw.

Cheers,
-Kevin

Oh, another way for you to debug the issue is to perform a Linux capture of the canbus, and then align the homing failure with the actual messages on the canbus. (If you go this route, you’ll need to research the low-level protocol, research the canbus protocol, and align the timestamps between captures/logs - so expect to invest notable time on it).

The candump utility can be used to take canbus captures - for example: candump -t z -Ddex can0,#FFFFFFFF

Cheers,
-Kevin

I’m curious which part of my log you were looking at. There’s a big portion where the bytes_invalid isn’t incrementing.

I’ve built as CANable_MKS_fw (I have an MKS canable 1) and reflashed and am not seeing bytes_invalid at all. I will leave the printer for a while and see if the problem is reproducible after time.

This theory about bytes_invalid is important, but I don’t see how that relates to the behavior I’m seeing where after a restart and for about 30 minutes it’s perfectly fine and then after waiting, (especially after the printer has been idle for hours) it fails reliably.

this does seem to have fixed the issue. is it possible the reordering of packets causes clock drift?

Hi guys, i have got same problem whit my ebb 1.2 connected to raspberry pi through usb c cable. Problem is appearing randomly while doing 9x9 bed mesh with my klicky probe.

Do you have any hints? I will provide log as soon as i can Save it to my pc.

Thanks.

candump-2022-09-18_180603.log (115.8 KB)
canbusload_rp2040_ebb36_12_homing.txt (3.6 KB)
klippy-rp2040.log (2.5 MB)
iplinkstats.txt (1.7 KB)

I’m on the same boat with a BTT EBB36 1.2 toolhead board. I’ve tried with a BTT U2C 1.1 adapter, using the firmware on the BTT github first, and then I’ve tried the canable.io web flasher and the candlelight_fw v2.0 release.

The web flasher didn’t result in a usable board for me but the other two flashed from stm32cubeprogrammer worked ok.

My other board is an SKR Mini E3 2.0 and I was using it over uart originally but tried flashing it to USB for troubleshooting.

After reading this thread I was able to flash a raspbery pico board with the USB to CAN Bus bridge option and a can transceiver connected.

All my tries have the same result. I can see in mainsail that the board connects, and I’ve verified that extruder motor, fans, heater, MAX31865 with PT1000 and slideswipe probe all work ok. I get communication timeouts while homing. Sometimes it works for half a second, sometimes for a bit more. A couple of times I could finish homing but then failed to probe a bed mesh.

I’ve tried 250.000 and 500.000 baud. Without canboot first and with canboot later. Two different DIY cables, using different connectors (molex first, jst/ferrules later) to discard bad crimps. Bus resistance is measusing 60 ohms between can_h and can_l.

I’m attaching a capture with canbusload and candump while trying to home.

What should I try next?

I use 1M baud and have no issues.

Alas, the log did not indicate the cause of the issue. Please retry and issue an “emergency stop” immediately after the “communication timeout” event. This will cause Klipper to write additional information to the log. Please attach that full log here. Please also indicate what usb to canbus adapter you were using during that log (canable, pico, etc.).

-Kevin

Hi Kevin, thanks so much for answering. I’m attaching another capture of a single homing try and e-stop. I’m using the raspberry pico canbus adapter.
ipLinkStatsBeforeAfter.txt (1.8 KB)
canbusload_rp2040_ebb36_12_homing-2.txt (1.4 KB)
candump-2022-09-20_090655.log (33.6 KB)
klippy-can2040-homing-estop.log (152.0 KB)

I’m not sure what else to try. I have a second unopened EBB36 1.2 but do you think it’s a hardware fault?

The log indicates that the canbus was unable to send messages from mcu to host for ~28ms. The micro-controller correctly detected the loss of communication and halted the homing action.

It’s unclear why the bus was unable to communicate for an extended period of time. Eventually all messages were received, so whatever the root case was, it eventually cleared without any software based retransmits.

I’d say double check the wiring and terminating resistors, but it sounds like you’ve done that already.

The candump capture you attached correlates with the klippy.log file - both show the communication lapse.

I’m not sure what the underlying issue is. Next step in debugging would be to place a logic analyzer on the can_rx/can_tx lines to see what is happening on the bus at the time of the failure. That is a lot of work though.

-Kevin