CAN-Bus suddenly losing connection ~3 Hours into Prints

  • Voron Trident 300
  • Raspberry Pi 4 1GB RAM
  • Fysetc Spider 2.2
  • CAN Setup with Mellow FLY-SB2040-V2 and Mellow FLY-UTOC-V1
  • Twisted the default Mellow Wires as per their instructions
  • Installation as per their instructions
  • Has a Phaetus Rapido and Sunon 10k RPM Fan attached, aswell as Klicky and KNOMI, Stealthburner LEDs and an X Endstop
  • UTOC-V1 and Spider Board connected to The Pi over their own USB
  • CAN-Bus resistance measured to be 60.03 Ohms

candump-2024-04-02_134904.log (62.1 KB)
klippy (5).log (3.5 MB)

The Issue:
Since I have installed this CAN-Combo, Klipper is shutting down about 3 hours into a print and I get the Error "Lost Connection with MCU ‘can’ "
In the imgur link are my recent prints and when the error occured.

Since then, I have tried to rule everything out that I could as of now without buying anything different. I have attached a candump log of the error aswell, but the candump just ends the logging when the error happens. It seems to just stack error frames to 128 and then disconnect from the bus for whatever reason, but I dont really know how to read the candump log or how I can set it up so I have a bit more information on what actually happens

At first I thought this might be a connection issue, so I had candump running for the entire time of the print while I was doing stuff on the Computer. Not even once have I seen an error flag being set in Candump. Then, if you check the klippy.log, bytes_retransmit is consistent at 0 for the can mcu throughout the entire print, until the connection suddenly breaks down, bytes_retransmit rapidly rises in that time period because it tries to repeat everything until klippy aborts the print. The weird thing is that if I press “firmware restart” immediately after, communication with the Can board is back to normal as if it never were “disconnected?”, although the log says that sending reset to the can board failed.

I also tried to turn the baudrate down to 500k with little success, aswell as removing all 5V drains on the toolhead board, as someone in some other thread said that the 5V regulator on the board is a bit underpowered to power the mcu on it and some additional 5V peripherals, nothing changed anything.

“canbusload” shows that the canbus is chugging along with at most 5% load, so this shouldnt be an issue either. I have tried replacing the USB cable from the UTOC Bridge with little success, aswell as redoing all my ferules in the Bridges Terminals. It doesnt seem like a wiring issue to me.

So now this is where I am stuck. I believe that some component is suddenly rebooting hours into a print, but I dont see a way how I can check that and thats just the closest guess I have. If someone knows anything that would help me with the issue or just help me how I can even further diagnose this I would be incredibly grateful!

If there’s anything else of importance I could provide or log regarding this please let me know

Imgur Folder since new Users are restricted in inline images and links.

Mellow Installation Manual

Did you produce you candump with this command ?

The candump log must be produced using the -tz -Ddex command-line arguments (for example: candump -tz -Ddex can0,#FFFFFFFF ) in order to use the tool.

If no - then catch that log again, for analysis I need last 60 seconds before issue occurred and your call for “firmware restart” in same session. Also i need your dict file, usually it’s here ./out/klipper.dict, don’t forget to include klippy.log which contain that session, also it would be usefull to include dmesg command output after the issue is happened.
re-read Obtaining candump logs
generate logs and send files - i will look at data and maybe will have something from that.

From your description and log it seems like canbus data become corrupt or HOST/MCU was sending nonsense data.

I did run a candump for the entirity of the last 4 hour print which is what is attached, but for file size it was only logging error frames or rather bit stuffing violations in frame explicitly.

While I still dont know what did it, I did switch the CAN Bridge board for a Canable board, and switched the wiring to IGUS CAN Wiring and its gone. I did one after the other and it was pretty much gone with the Bridge Board swapped, but still received a couple of error frames, no complete timeout however, I did already order the wiring though at that time so I might aswell swap it. Either way, the issue is not present anymore, If someone in the future finds this post and seems their setup, its probably the Mellow FLY-UTOC-V1.

I did also dump the entire flash content on the UTOC, then “installed” their firmware again. I say again because the official one seems to be the only firmware that works for the board, but the firmware that is on that page was NOT the same that was on my UTOC by factory. I dumped the contents, flashed, then dumped again and the dumps gave different hashes, so maybe a firmware update might also have done the trick, but once again I already had stuff to change so I changed.

If it does happen again, I will definitely go down that path and mention you again in it, thanks for more debugging steps.