CAN-Bus suddenly losing connection ~3 Hours into Prints

  • Voron Trident 300
  • Raspberry Pi 4 1GB RAM
  • Fysetc Spider 2.2
  • CAN Setup with Mellow FLY-SB2040-V2 and Mellow FLY-UTOC-V1
  • Twisted the default Mellow Wires as per their instructions
  • Installation as per their instructions
  • Has a Phaetus Rapido and Sunon 10k RPM Fan attached, aswell as Klicky and KNOMI, Stealthburner LEDs and an X Endstop
  • UTOC-V1 and Spider Board connected to The Pi over their own USB
  • CAN-Bus resistance measured to be 60.03 Ohms

candump-2024-04-02_134904.log (62.1 KB)
klippy (5).log (3.5 MB)

The Issue:
Since I have installed this CAN-Combo, Klipper is shutting down about 3 hours into a print and I get the Error "Lost Connection with MCU ‘can’ "
In the imgur link are my recent prints and when the error occured.

Since then, I have tried to rule everything out that I could as of now without buying anything different. I have attached a candump log of the error aswell, but the candump just ends the logging when the error happens. It seems to just stack error frames to 128 and then disconnect from the bus for whatever reason, but I dont really know how to read the candump log or how I can set it up so I have a bit more information on what actually happens

At first I thought this might be a connection issue, so I had candump running for the entire time of the print while I was doing stuff on the Computer. Not even once have I seen an error flag being set in Candump. Then, if you check the klippy.log, bytes_retransmit is consistent at 0 for the can mcu throughout the entire print, until the connection suddenly breaks down, bytes_retransmit rapidly rises in that time period because it tries to repeat everything until klippy aborts the print. The weird thing is that if I press “firmware restart” immediately after, communication with the Can board is back to normal as if it never were “disconnected?”, although the log says that sending reset to the can board failed.

I also tried to turn the baudrate down to 500k with little success, aswell as removing all 5V drains on the toolhead board, as someone in some other thread said that the 5V regulator on the board is a bit underpowered to power the mcu on it and some additional 5V peripherals, nothing changed anything.

“canbusload” shows that the canbus is chugging along with at most 5% load, so this shouldnt be an issue either. I have tried replacing the USB cable from the UTOC Bridge with little success, aswell as redoing all my ferules in the Bridges Terminals. It doesnt seem like a wiring issue to me.

So now this is where I am stuck. I believe that some component is suddenly rebooting hours into a print, but I dont see a way how I can check that and thats just the closest guess I have. If someone knows anything that would help me with the issue or just help me how I can even further diagnose this I would be incredibly grateful!

If there’s anything else of importance I could provide or log regarding this please let me know

Imgur Folder since new Users are restricted in inline images and links.

Mellow Installation Manual

Did you produce you candump with this command ?

The candump log must be produced using the -tz -Ddex command-line arguments (for example: candump -tz -Ddex can0,#FFFFFFFF ) in order to use the parsecandump.py tool.

If no - then catch that log again, for analysis I need last 60 seconds before issue occurred and your call for “firmware restart” in same session. Also i need your dict file, usually it’s here ./out/klipper.dict, don’t forget to include klippy.log which contain that session, also it would be usefull to include dmesg command output after the issue is happened.
re-read Obtaining candump logs
generate logs and send files - i will look at data and maybe will have something from that.

From your description and log it seems like canbus data become corrupt or HOST/MCU was sending nonsense data.

I did run a candump for the entirity of the last 4 hour print which is what is attached, but for file size it was only logging error frames or rather bit stuffing violations in frame explicitly.

While I still dont know what did it, I did switch the CAN Bridge board for a Canable board, and switched the wiring to IGUS CAN Wiring and its gone. I did one after the other and it was pretty much gone with the Bridge Board swapped, but still received a couple of error frames, no complete timeout however, I did already order the wiring though at that time so I might aswell swap it. Either way, the issue is not present anymore, If someone in the future finds this post and seems their setup, its probably the Mellow FLY-UTOC-V1.

I did also dump the entire flash content on the UTOC, then “installed” their firmware again. I say again because the official one seems to be the only firmware that works for the board, but the firmware that is on that page was NOT the same that was on my UTOC by factory. I dumped the contents, flashed, then dumped again and the dumps gave different hashes, so maybe a firmware update might also have done the trick, but once again I already had stuff to change so I changed.

If it does happen again, I will definitely go down that path and mention you again in it, thanks for more debugging steps.

I have changed the Toolhead board from a MELLOW FLY SB2040 v2 to a v3, I have changed the Mellow U2C to a MKS Canable, changed its USB cable multiple times, I have changed the CAN cable twice, I have changed and upgraded the Pi, changed the MCU firmware between 32 and 64 bit because someone said that might be the issue, I have changed virtually every single thing that can be changed.

The issue still persists. It cant be that this is solely a me problem if I can replace EVERYTHING and its still there.

Heres the folder with a log, the canlog and my dict file… I just need any help I can get at this point with a 6 week old problem. I am literally at my wits end.

https://drive.google.com/drive/folders/1WddBAL_0O8ECXyvmECd-fDKu0-8jkSY-?usp=sharing

Hi
I’m not some high expert on canbus, but from what I see in your latest data at certain moment RPi did send message to canbus but it was not acknowledged by other side (probably RP2040)
and 2 flags was raised BRS and ESI
BRS - Bit rate switch
ESI - Error State Indicator

(6961.476757) can0 TX B E 108 [8] 7E 3F 18 16 08 85 5A 48

usually those errors indicate that there are no confirmation from other side on canbus layer - it can be device is dead/hang-up or CAN wiring is disconnected.

The puzzling part is this statement:

It’s telling that other side is still live and accepting messages !
But your candump data show that canbus is trying to send some character to other side - but it’s stuck, if it stuck - firmware restart message is not possible to send!
I see confirmation about it in your klippy.log

Unable to issue reset command on MCU ‘can’

how exactly you did restore canbus communication ?
rebooted just klipper service ? rebooted RPi ? shutdown power ? something else ?

In my experience, the candump tool randomly reports B and E in its output. I recommend completely ignoring those columns during analysis.

-Kevin

how exactly you did restore canbus communication ?

Pretty much anything works. I can simply do FIRMWARE_RESTART and it will work again, I can reboot and it will work again. I dont have to physically do anything for the board to reconnect, even though the log contains that it shouldnt be able to send a reset in the first place. Something really odd is happening. The weird thing is that this issue is now happening with 100% different hardware. Nothing is the same as it was the first time I experienced this.

With the limited canbus knowlege I have I have also read out the error counter on both the bridge and the toolhead board, and the highest error count I could ever read were 16 on the bridge, so that not only means that it was a sending error from the bridge, but also a far cry from the 2^7 that would put a can device in passive mode. Something software related just seems to up and disconnect for no apparent reason. The bridge board also shouldnt be the issue as I’ve switched from the Mellow UTOC to the MKS Canable and put Candlelight and whatever reputable bridge firmwares you can find on it, no change.

At this point I am not even sure what I can still do that would cause a change.

I’ve also thought about it maybe being some extruder state since the last successful sends and the sends without a valid response are queue_step commands, so I tried with having the extruder set to relative to maybe avoid some edge cases, which ran into the error in half an hour, compared to maybe 3 or 4 hours with absolute extruder positioning.

Then I noticed that when the error happens, the data package that is trying to be sent over and over is already violating CAN 2.0 bit-stuffing, as it has 6 zeroes somewhere in the middle. which are dominant, so forced error packet? It also has 6 dominant ones at the start which makes me thing bit-stuffing bits are already removed? Or does that mean an already invalid packet is trying to be sent? I dont really know what the data is, any insight on that could maybe help, or the messages are already cleaned up from bit-stuffing and thats only happening on a lower transmission layer.

7E 3F 18 16 08 85 5A 48

011111100011111100011-000000-1011000001000100001010101101001001000

Deconstructing the message, with extended Packet ID the zero bit stuffing error is happening right at the start of the data portion of the packet.

I really dont know at this point. I’ve had this issue for 6 weeks now and are just tired of it. I am so close to just going back to wiring everything parallel like normal again with a passive toolhead breakout board, but I also really, really dont like not knowing what is wrong.

Well then this means it’s enough to do software reset of a canbus connection.
Did you changed OS Image version when was changing RPi ?
It’s possible that you are using same OS Image in both RPis which contain some bug in kernel/can-layer - then you can try to switch OS/kernel.

As additional testing of your setup i would suggest to do long term stress test of communication with your toolhead via canbus.
You can run Command dispatch benchmark but change it so it would run for hours …

I did try with the ~November version of 64 bit MainsailOS first which is what I had on, then put on the current version of 64bit MainsailOS, then put on 32bit MainsailOS as I read somewhere that this fixed a similar issue for someone else, then i put on the current version of the normal 64bit Raspberry Pi OS, then 32.

Then I swapped the 2GB Raspberry Pi 4 for an 8GB Raspberry Pi 5, different SD card, I then did pretty much the same steps again. No change.

During that I also tried the old CAN-Bridge on and off. No change.

I’ll try to do the command dispatch benchmark, although I doubt that this would be a limiting factor as canbus traffic is maybe at 1kb/s

I’m suggesting test because you need some reliable method to reproduce the issue without loosing physical 3D print job and you can leave it running for long time, additionally increased usage of a canbus can make issue appear sooner.
Also you can moderate it to any throughput which you like.

FLOOD semantic:

FLOOD counter delay
counter - number of requests
delay - fraction of MCU freq

delay=0.1 - should be 10msg/sec
delay=0.0001 - should be 10000msg/sec

also most probably you can replace “debug_nop” with “debug_ping data=.......” to send some data payload in ping/pong manner

I did not tested debug_ping but did see it in firmware

While test is running - you can collect error rates, monitor system or canbus, simulate some issues, etc …