CANBUS Communication timeout while homing Z

Nothing special, a Pi3b running PiOS bulleye.

PRETTY_NAME=“Debian GNU/Linux 11 (bullseye)”
NAME=“Debian GNU/Linux”
VERSION_ID=“11”
VERSION=“11 (bullseye)”
VERSION_CODENAME=bullseye
ID=debian
HOME_URL=“https://www.debian.org/
SUPPORT_URL=“Debian -- User Support
BUG_REPORT_URL=“https://bugs.debian.org/
pi@boxy:~/klipper_logs $

Stock current kernel:
Linux boxy 5.15.61-v8+ #1579 SMP PREEMPT Fri Aug 26 11:16:44 BST 2022 aarch64 GNU/Linux

Not that this necessarily has anything to do with your issue, but it appears that you are running the 64 bit version of bullseye kernel?

I can tell you that even on a Pi Zero 2 W and a Pi 3B (not +) it works fine here. BUT I get the occassional timeout error on homing as well. However, only times I ever encountered an error during printing was with an USB adapter… I now use an SPI-connected board (CAN hat on the 3, generic MCP2515 board on the zero 2) and while I have the timeout on home very rarely with these, I not once had a problem during a long print.

I did remove the SPI speed entry in the boot.cfg and set the tx buffer to 1024 in the can network entry, though.

I was told me to come here as I am having a timeout issue as well. recreated issue and hit estop. hardware: fly utoc1 and fly sb2040 on a raspberry pi3 using recommended config. I have tried moving the speed to 1m with no change and the only way I can get around it is if I change the mcu.py trsync_timeout as well.
klippy.log (141.1 KB)

I started out with the canhat, and it just wasn’t stable for me. I fiddled with the settings, but someone else mentioned that the SPI ones were problematic, and I had planned on doing the bridge anyways, so went that direction. I haven’t had a single print issue since moving to the klipper usb bridge, just these silly homing timeouts.

@ReXT3D Yes, it’s 64bit…

I’m going to dig around in the kernel can code and see if there are any buffers or queuing on that can be disabled on that side of the house. Thanks for all the suggestions!!! Seems like this issue is popping up more as canbus picks up steam, so hopefully we can put it to bed.

The candleLight_fw GitHub might be a good place to drop by and perhaps ask a question since it seems that a number of Linux kernel developers are hanging out there supporting CAN development:

Issues · candle-usb/candleLight_fw (github.com)

1 Like

So I might be able to help here, I have been messing with my HermitCrab CAN, and have had the most success following CANBUS rates that end up with a full bit. EX. 115200, 230400, 460800, 576000, etc. My Hermitcrab is very happy at 460800, with the timeout stuff set to 0.050. Worth a shot.

@tshackelton @koconnor

I am going to run some testing over the weekend (or as time permits ASAP) and post my results. I suspect that EMI noise or something else like that might be to blame. Or maybe it is certain hardware that doesn’t play nice together?

Any tests, logs or variables you want to see that isn’t below?

I don’t have any time out issues running at 250k both using a waveshare canhat on a RPI4 4GB or Using the MCU Octopus canbus both using the EBB36 on the tool head. Not even during IS tuning. I am also running pulling fan tachometer data as well.

I will post for each setup I have implemented:

Tool head Canbus PCB make/model (Have Huvud, EBB42, EBB36).
RPI type/model
CanHat type/model
MCU Octopus model
USB canbus adapter make/model
Canbus wiring and connectors source/type.

Canbus speeds ranging from 250k to 1M.

Canbus H & L line resistances measured with multimeter.

Canbus comm. logs.
Klipper logs.

Let me know if there is anything else that I can do to help hunt this bug(s) down.

I don’t suspect EMI at this point, or I’d have real errors. I never see any printing symptoms or retries in the klipper logs. I have induced some, while playing and it definatly caused all kinds of trouble. I’ve also had the same thought and rerouted cabling away from stepper leads and such. no difference at all.

I’m thinking it’s something in the linux side, all this stuff is piped through all the networking stack, and I bet it’s getting goofed up in there. This is a really good read https://www.kernel.org/doc/Documentation/networking/can.txt

So I’ve got the issue running a pi3b and 64bit pios lite bulleye, had worse issues with SPI hat than with the USB bridge. What are you running on your pi4? I’ll duplicate that side of it, and see if my issue persists. If it goes away I’ll try that card back in the 3b.

Used the imager for mainsail.

PRETTY_NAME=“Raspbian GNU/Linux 10 (buster)”
NAME=“Raspbian GNU/Linux”
VERSION_ID=“10”
VERSION=“10 (buster)”
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian

I forklifted my klipper from 64bit bulleye to 32bit buster, and have been running mesh’s all day. Homing issues are gone, no invalid_bytes in the logs. So it’s either aarch64 or bulleye(PiOS11) that’s the culprit with this packet reordering thing.

1 Like

Just a quick update… My system hasn’t had a peep of trouble running on 32bit Buster, Although I’m still not sure if it’s the revision or bitness. So I’ll keep trying to narrow it down. I also had to bump my baud to 1M in order to get input shaper to run. Otherwise it would overrun the bus at higher frequencies and crash out. It was pretty obvious checking canbusload, while running. The higher baud rate has been solid, which also confirms that it wasn’t a bus issue causing the homing timeouts previously.

2 Likes

I’m having the exact same symptoms as you and the others in this thread. Is your 32bit system still stable?

I’m not sure if this thread is still searched by people who are struggling… I had a multitude of issues getting CAN bus stable, and I spent hours searching for answers and trying every possible combination.

I started with an SKR 1.3 with an RPi 3b and a waveshare CAN hat. It worked but was incredibly unreliable. I tried every combination of settings out there: 250k, 500k. long txq length short… modifying the mcu.py (this is just masking your problem). All for naught… Some things made it more stable, others not… I learnt a LOT about Linux, CAN, Klipper and Mainsail/Moonraker in a short time. In the end, the logs, and error messages are somewhat cryptic, and often a bit ambiguous.

In the end, I replaced my SKR 1.3 with an Octopus 1.1 to use the onboard CAN, but before that, I opted for an Opto-isolated dual CAN hat (made no difference).

Just to throw my experience also into the fold, in the end getting things “stable” was replacing my 5V power supply. It was “noisy”, and as soon as it was under some load, seemed to cause random issues, not just related to CAN. I only saw this after putting it on a scope as a desperate measure.

The long and the short is that time-critical CAN messages, with a heavy duty cycle, is sensitive. It really has to be fine-tuned.

After doing away with the power supply, an issue only manifested when doing shaper calibration, as this puts the CAN comms under a LOT of pressure. Here’s the kicker. in the end, increasing the CAN speed to 1000 000 from 500 000 made it handle the load, and not run itself into knots. One would think that would decrease reliability.

I think the CAN mechanisms need to mature a bit still, and we are the early adopters. For now I got EVERYTHING to work as I want it to, and it’s a full house of features:

  • CoreXY with 3 point bed
  • Octopus 1.1 (CAN Bridge mode)
  • EBB42 (BLtouch, Revo,BMG)
  • 5ich TFT (Klipperscreen)
  • 36 LED addressable LED strip
  • Servo activated Brush (clean nozzle… no space in my frame fo rfixed mount)
  • Relay control to disarm 24V PSU

The point of listing all of this is not anything other than to state that ALL of this can play nicely together with CAN.

Getting everything in Klipper working can make you humble, but so worth it.
CAN modules are the future.

In summary:

  • Excellent quality CAN comms and as little RF interference as possible is critical (twisted pair, resistance measuring 60ohm, good quality power source)
  • Speed is king. Definately at least 500 000, I recommend 1000 000
  • the waveshare CAN hat is flrting with the edge of capable…If you have a choice get USB can adapter, or better yet CAN bridge on a compatible board
  • changing the mcu.py code, will make it better, but it’s not the answer.
3 Likes

And I got tired of editing mcu.py after each update of Klipper, changing the TRSYNC_TIMEOUT value to 0.050, and therefore I added 2 more wires to the 4 CAN wires for the limit switch and BL-Touch signals. Now I don’t have to change mcu.py, the “Communication timeout while homing…” error is gone, and there is no more stepper motor overshoot. And on the freed pin from the BL-Touch signal, I connected the RPM signal from the extruder cooling fan. In general, some pluses. :smiley:

Just to chime back in… my setup hasn’t seen a single count “invalid_byte” , nor had a single homing error since I moved to back to the 32bit kernel. It’s been over a month without any issues for me. Rock solid, nothing changed in mcu.py. I dunno what’s up with 64bit ARM kernel and can but it “can’t” :slight_smile:

2 Likes

Have there been any updates on this? I also have the same issue when homing z. Setup is rock solid otherwise. Editing TRSYNC_TIMEOUT helped a lot, but that feels more like a hack than fixing a problem. I find it curious that the 32-bit kernel apparently fixes the issue, but I don’t think that’s an option with the SBC I’m using

Edit: The error just happened to me now. My klippy.log file is apparently too large to attach. Here’s a truncated version - hope that’s enough

truncated.log (338.6 KB)

Just for some details on my setup:
Printer is a Voron 2.4 (r1 with most of the r2 updates)
Host = OrangePi 5 running Armbian
CAN-bus is using RK3588’s native can-bus with a TCAN337 as the transceiver
BTT Octopus Pro F429 connected via CAN-bus, this is running the A/B motors and the Z motors
BTT EBB sb2240 at the toolhead

I had similar issues using a mellow utoc with a raspberry pi 4, as well as when I was running a mellow SB2040

Sorry for the multiple posts, but I can semi-reliably trigger an error and have attached a candump and klippy log from the event. I hope that is useful
klippy (6).log (255.2 KB)
candump-2023-02-01_144603.log (1.8 MB)

Edit: I wonder if this is what we’re running into: [BUG] pfifo_fast may cause out-of-order CAN frame transmission — Linux CAN Development

The log seems to indicate that the host code didn’t get any processing time for about 60ms (or possibly a communication outage to both MCUs for that amount of time). The MCUs correctly aborted the homing attempt as a result.

Also, it doesn’t look like you are running code from the main Klipper branch. You’ll need to reproduce using the pristine Klipper code for any further assistance. Also, it’s unclear what type of host computer you are using with Klipper.

-Kevin

Impossible for me to use mainline klipper until tmc2240 support is added. I don’t believe any CAN code was altered by BTT. I noted the host computer above “OrangePi 5 running Armbian”

Raspberry Pi 4 with usb can board gave me flakier performance.

I guess the question is why is there sporadically a 60ms delay only when homing z / probing? Obviously I’m not the only one with a similar issue. I just don’t know if it’s klipper, linux, hardware, or some combo.

Edit: It occurs to me I could try running mainline and just commenting out the extruder stuff in my cfg for testing purposes. I can do that over the next couple days