Wow, tons of good info! Much of it helped me hammer out the below debugging process.
After 20 years as a full time Linux systems administrator I have gotten good at bug hunting. As such I started down the path of running this to ground. My test was simple, load CanBoot, then flash Klipper 3 times over the CAN bus. Check for bus errors (ifconfig can0). I realize this is a VERY simple test. I will use test prints for more in depth testing. I made a simple 120 ohm jumper for the Pi CAN Hat as a terminator and directly connected the Octopus 1.1 or EBB board to the Pi CAN Hat.
Base line testing
Testing at 250K - Oct
Flashed clean, TX errors 1 dropped 1 overruns 0 carrier 1 collisions 0
Flashed clean
Flashed clean
Testing at 500K - Oct
Flashing ERROR:root:Can, Read Error, RX errors 1 dropped 0 overruns 0 frame 1
Flashed clean
Flashed clean, RX errors 1 dropped 0 overruns 0 frame 1
Testing at 750K - Oct
Flashing ERROR:root:Can, RX errors 5 dropped 0 overruns 0 frame 5
Flashing ERROR:root:Can, then flashed but failed to re-connect for verfy, RX errors 13 dropped 0 overruns 0 frame 13
Flashing ERROR:root:Can, RX errors 5 dropped 0 overruns 0 frame 5
I got errors at ALL speeds tested. At this point I started debugging the data path starting with the Raspberry Pi’s SPI interface. First chip in the path is the MCP2515 connected via SPI. The data sheet calls out “High-Speed SPI Interface (10 MHz)”. But the CAN Hat calls for a 2MHz SPI frequency. That’s WRONG! Running at 500K @ 2MHz SPI won’t be stable as the sampling rate is just too small for reliable reads.
Lets crank up the SPI interface speed!
#CAN Hat enable (/boot/config.txt)
#dtoverlay=mcp2515-can0,oscillator=12000000,interrupt=25,spimaxfrequency=2000000
dtoverlay=mcp2515-can0,oscillator=12000000,interrupt=25,spimaxfrequency=5000000
Testing at 750K - Oct + spimaxfrequency=5000000
Flashed clean
Flashed clean
Flashed clean
Testing at 750K - Oct + spimaxfrequency=10000000 #10MHz
Flashed clean
Flashed clean
Flashed clean
Testing at 1M - Oct + spimaxfrequency=10000000 #10MHz
Flashing ERROR:root:Can, but still finished succeffuly, no interface errors
ERROR:root:Can Flash Error
Flashing ERROR:root:Can, RX errors 3 dropped 0 overruns 0 frame 3
Well, well, well! The data shows that the CAN Hat, or the Octopus is stable up to 750k (With this extremely limited short test). Does this same apply to the EBB?
Testing with a 10MHz SPI interface speed
Testing at 250K - EBB at spimaxfrequency=10000000
Flashed clean
Flashed clean
Flashed clean
Testing at 500K - EBB at spimaxfrequency=10000000
Flashed clean
Flashed clean
Flashed clean
Testing at 750K - EBB at spimaxfrequency=10000000
Flashed clean
Flashed clean
Flashed clean
Testing at 1M - EBB at spimaxfrequency=10000000
ERROR:root:Can Read Error, but still finished succeffuly, no interface errors
Flashed clean
Flashed clean
Based on this new information I am going to make an assumption that my configuration is “stable” at 750k. Now to put everything back on my CAN bus and see what happens at 750k.
750K at 10MHz SPI testing
#Here goes nothing!
Testing at 750K - EBB + Octopus @ spimaxfrequency=10000000
EBB, Flashed clean
EBB, Flashed clean
EBB, Flashed clean
OCT, Flashed clean
OCT, Flashed clean
OCT, Flashed clean
ZERO bus errors!!!
IT LIVES!!! Klipper is up, both MCU’s are running and homing + full 7x7 bed leveling passed with flying colors. Then it happened :
CRAP!
12:12:43 Communication timeout during homing probe
12:12:43 Communication timeout during homing probe
12:12:43 Communication timeout during homing probe
12:09:30 Communication timeout during homing z
12:09:30 Communication timeout during homing z
12:09:30 Communication timeout during homing z
And bus errors… ARG!
ratos:~ $ ifconfig can0
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 128 (UNSPEC)
RX packets 472032 bytes 2948451 (2.8 MiB)
RX errors 12 dropped 0 overruns 0 frame 12
TX packets 1047331 bytes 7780160 (7.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Flashed everything back to 500k. Ran a print that took 46 minutes, but still getting some RX errors.
Errors @ 500K with speaker wire connecting to the EBB
ratos:~ $ ifconfig can0
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 128 (UNSPEC)
RX packets 586568 bytes 3417757 (3.2 MiB)
RX errors 13 dropped 0 overruns 0 frame 13
TX packets 2452833 bytes 18673304 (17.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Errors are still happening, even at 500k. Without the interface status, specifically errors, on the Octopus and EBB I am not able to localize where the issue(s) are. I was using leftover red/black 24awg wire to make the long run to the EBB (AKA speaker wire). It was about 100cm. This should REALLY be twisted pair. So I replaced the EBB connection wire with a single twisted pair, multi strand, that I removed from a CAT5 network cable. Started the print again. Took 49m and still got some RX errors, but nothing the effected the print. Did the new cable help? 13 VS 10 errors? That’s way to close to call:
Errors @ 500K with twisted pair wires connecting to the EBB
ratos:~ $ ifconfig can0
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 128 (UNSPEC)
RX packets 592436 bytes 3449166 (3.2 MiB)
RX errors 10 dropped 0 overruns 0 frame 10
TX packets 2429379 bytes 18500590 (17.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
At this point my printer is running, stable~ish. Both MCU’s are on the CAN bus @ 500K. I think finding the SPI speed problem was a BIG win. Before that nothing was working. I also know that the CAN protocol has tons of error checking built in, as it was intended for less than nice environment. With out more information I can not say that those 10 RX errors are bad. The could be any number of errors, many of which will not effect the print.
Klipper data rate? In 2940 seconds Klipper transmitted around 17.6MiB of data. That’s around 6.28Kb/s, or about 6428 bytes per seconds. My current CAN bus is set to 500000B/s… We’re good! Unfortunately I think I need 500K\s to keep the response time up, otherwise klipper will timeout.
I am not sure where to go from here. For my part I am VERY happy with the outcome. I will try some larger 10+ hour prints in the coming days and see if anything nasty shows up. I hope this information will be useful to others on their CAN-Klipper journeys.