CANBUS Communication timeout while homing Z

I had tons of issues with that canbus adapter and just gave up on it. It has a weak processor and no cache or something. TBH I never really cared enough to figure it out. Switched to Canable and used the actual candlelight github to flash it (not the canable.io thing) and have had no issues ever since.

Thanks so much for taking a look, Kevin. Just to be 100% sure…

the canbus was unable to send messages from mcu to host for ~28ms.

When you say “mcu” you mean the EBB36 right? because the skr mini e3 is connected via USB and that’s the one called “mcu” in my config.

If this was a cable issue the messages would just not arrive, or arrive corrupted right? This sounds to me more like a software issue, like my pi 3B has those messages stuck in some OS network queue and is just late to interpret them or sth like that. Is there a way I can analyze the latency on the linux side? I might try to reinstall MainsailOS, maybe use a different Pi. I don’t think i’ve done anything out of the ordinary, other than using USB boot instead of SD cards in a effort to improve reliability.

I bought the cheapest AliExpress logic analyzer some months ago and never even got to try it. It must be somewhere around here. TBH to me it sounds like the hardest part would be to learn to decode and understand the canbus protocol, the klipper protocol on top of that… I’d have to invest some time that I don’t have now.

Hi Charles, I started using a BTT U2C 1.1 which has an STM32F072C8T6 mcu and i’ve switched to an Raspberry Pico with Klipper’s rp2040 usb to can link firmware and an SN65HVD230 transceiver module. Those are significantly different architectures and the issue remains the same. Which MCU processor is on the canable board?

The MKS canable I use has a STM32F072C8Tx. It’s possible that my success is coming from using candlelight firmware.

Yes. The EBB36 was unable to transmit to the host for an extended period.

No, because canbus messages that are corrupted are typically automatically retransmitted by the hardware. The most likely explanation is signalling problems on the can bus.

It is possible there is some software or scheduling problems, but it seems less likely to me. Issues like that would have other symptoms in the log, which I don’t see. (Jitter in other transmissions, jitter in log message reporting, high system load, etc.)

Yes - it is a time intensive activity. Alas, I don’t have any further advice. You could certainly try swapping the ebb board and/or reinstall the rpi - I’m not sure that will help though.

If you can hook up the logic analyzer to the ebb canrx/cantx lines and provide a “pulseview capture” of an event (along with the klipper log), I can try to take a look. No promises though, as it is a time intensive activity.

Cheers,
-Kevin

This is bonkers. I’m probably one of the few to use canboot to flash a regular “usb” connected firmware, but I did it and I get the same timeouts during homing. For the test I’m only using the canbus connector for 24V and USB for signal, and it’s the same behaviour. So it’s not a canbus problem. It’s a multi-mcu problem.

I’ll be replacing the EBB36 board next, but the other one I have is probably identical. I’ll let you know.

Well, after trying another EBB36 board and getting nowhere, I flashed a raspberry pi zero 2w with a fresh mainsailos and copied the same klipper config and can interface settings over.

No timeouts this time. I’m homing and probing meshes and… It was the bloody Pi 3b. So basically the cure was to change everything until something works. Arg I’m tired but happy it works at last.

Bufff… Think I a reinstall all… And try

Sorry if i sounded frustrated before. I still want to thank Kevin for developing such an awesome software, enabling this great community and even being so generous to offer to debug my problem! That was amazing.
I’m happy we didn’t need to get into the weeds of logic analyzers :smiley:

So in your opinion is the pi board cpu(usb bus?) that can’t handle multiple cpu signals?

I have been fighting this issue for a bit too. I have yet to get an error during actual printing, nor have I seen a single interface error since moving from the SPI canhat to the klipper can bridge. The silly homing error still pops up though. A couple of things I’ve noticed (these are anacdotal, but logs dont seem to have a smoking gun), If my bed dropped or I need to run ztilt, it will fail consistently. Once it’s pretty level, it seems to work well. It’s almost like it’s expecting a trigger before it should and just deciding it must be a comm error not the bed actually being another .005mm away. Like timing out before it actually triggers. I really don’t understand how if this problem was “real” how it wouldn’t impact printing whatsoever. Bed mesh is the “acid test” it love to fail that. I run per print mesh, and it’s about 50/50 on if it will print, when it doesn’t, I just hit reprint and then most of the time it works… I’d much rather figure out what’s really happening here than messin with the timeout in mcu.py but I dunno where to take it from here. I do have a scope and logic analyzer, canhat, octopus in bridge, and pico in bridge with transceiver, ebb36 and SHT36 boards, and even some time. Happy to help here, but I haven’t seen anything that really points at legit comms issues, so that didn’t seem like a useful exercise in isolation.

The log is critical to diagnose issues like this. Make sure you are running unmodified Klipper code, reproduce the error, issue an “emergency stop” immediately after the event, then attach the full unmodified log here.

-Kevin

klippy (18).log (617.6 KB)
Here is a log with the issue. It timed out on the last probe of the bed, right before starting the print(adaptive mesh). The only non-stock thing I’m running is the dockable probe module. I was able to reset, and reprint the same gcode right after this err. Let me know if there is anything else you’d like me to check/capture.

Looks like I’m getting the same “incrementing invalid bytes, no retries” thing. I am runnining the klipper USB can bridge, so it shouldn’t be a candlelight issue as suggested above. I also do not see any error frames on the interface after hundreds of MB xfer’ed. I suspect my next step would be trying candump, does wireshark support CAN? (dumping network interface makes me think of of ethereal/WS, and it’s familar to me) I also have can decode on my scope, but I doubt it will give us a “dump” in a useful format.

PS - I’m not doing any canboot right now, have no bootloader at all on the toolboard, and stock SD flashing bootloader on the octopus. I don’t know if this is relevant.

That’s odd. Yes, it would help if you could reproduce the issue while running candump (candump -t z -Ddex can0,#FFFFFFFF > can0.log) and then attach both the klipper log and the can log here. Also, please be sure to issue the “emergency stop” immediately after the event (within 2 to 3 seconds).

FYI, the Klipper “usb to can bridge” code does not report interface errors to Linux, so you’ll never see that increment.

-Kevin

I don’t know if candump montioring had an effect or I just got real (un)lucky this time. I got the error on first homing run
can0.log (782.2 KB)
klippy.canerr.log (125.4 KB)
. Was planning on running a mesh since that likes to do it, but this was… prompt. I’m gonna upload the files, and test a lil more with candump running. Thanks for looking into this…

@tshackelton

Shot in the dark here, but did you use the jumper for the termination resistor on your tool head side of the canbus?

Yes, it’s terminated and verified with a meter. In testing with candump I’ve also found that although monitoring doesn’t impact anything, the load of writing all the can messages to SD does cause the error pop up more. I don’t know if that related to the real root cause or not (it’s a good card). I ran a few longer prints few hrs each, with candump running and set to dump any errors to console and nothing. I’m running a pi3b currently, but I may try a fresh build on a pi4 for kicks and a sanity check. It’s not spare though, so I kinda hope it doesn’t work.

Can you sanity check the RPI with a PC running klipper? Then you won’t have to buy a $100 RPI 4 just to test.

Unfortunately, the error here seems to be related to something strange occurring in the Linux Kernel. During the failure, two consecutive status messages from the main mcu got reordered and thus could not be parsed by the host. The host correctly aborted the homing operation as a result.

The first occurrence is:

 (030.470769)  can0  RX - -  109   [6]  A2 F3 10 0D 48 7E
 (030.470643)  can0  RX - -  109   [8]  0E 14 58 0B 01 00 8E 8F

And the second:

 (030.480394)  can0  RX - -  109   [6]  90 E1 46 F2 B3 7E
 (030.480264)  can0  RX - -  109   [8]  0E 15 58 0B 01 00 8E 90

Oddly, it seems Linux knows that it has reordered the packets as the timestamps are not in incrementing order.

What is the host hardware you are using? What operating system are you using (uname -a ; cat /etc/os-release)?

-Kevin