CAN network failures - M8P CB2 EBB42

Basic Information:

Printer Model: Highly customized Anycubic Chiron
MCU / Printerboard: BTT Manta M8P v2(mcu), BTT EBB42 v1.2 (EBB_X, EBB_Y, EBB_E)
Host / SBC: BTT CB2
klippy.log (9.9 MB)

The printer is a testbed printer. The hardware and config changes frequently, but the CAN network failures have persisted in every configuration since installing the M8P/CB2. I am using the M8P as the CAN adapter and do not have a separate CAN board.

The CAN network will randomly fail. I get a connection failure from one of the devices on the CAN, usually the standard timeout failure. The error will usually come from the main “mcu” MCU not the “EBB_X”, “EBB_Y”, or “EBB_E” device, although it can come from any of them. The error has also occurred when there were no EBB42 boards connected and the only MCU was the M8P itself. No matter which MCU generateds the initial error, when the firmware is restated I get a connection error on the “mcu” MCU. As this MCU is physically on the same board as the CAN adapter this should be impossible. The problem can only be corrected by a power cycle. None of the reboot options correct this error. When there is a physical connection issue on the CAN network, reconnecting the device and performing a firmware restart will correct the error.
…

Update:

The error happened again. The initial shutdown message is ”MCU ‘mcu’ shutdown: Missed scheduling of next digital out event”

I clicked the “Firmware Restart” button.

After the system restarts it goes to a waiting message.

After a couple of minutes, it gives another CAN error message saying unable to connect to MCU ‘mcu.

The error persists after every reboot until the system is hard power cycled.

The Klippy log is too big (38MB). If someone can give me instructions I’m more than happy to alter/upload it another way. The new klippy log was downloaded after a succesful reboot from a hard power cycle of the system, a failed full reboot from the web UI, and a failed firmware reboot from the error message UI.

Since the error I have installed can-utils and will try to set up it’s monitoring features, but I am beyond my depth in this. Any help or troubleshooting guidance would be appreciated.

Here is the initial shutdown error from the log file:

Receive: 73 825805.557532 825805.557211 11: seq: 17, clock clock=1286237664
Receive: 74 825805.601452 825805.557211 14: seq: 17, analog_in_state oid=7 next_clock=1307800642 value=9510
Receive: 75 825805.901528 825805.653788 14: seq: 18, analog_in_state oid=7 next_clock=1327000642 value=9513
Receive: 76 825806.044960 825806.040861 18: seq: 1a, tmcuart_response oid=2 read=b’\n\xfa\xef-\x808\x02\x08 \xbb’
Receive: 77 825806.051317 825806.047211 18: seq: 1b, tmcuart_response oid=2 read=b’\n\xfa/ \x80\x00\x02\x08\xa0\x89’
Receive: 78 825806.201467 825806.118678 14: seq: 18, analog_in_state oid=7 next_clock=1346200642 value=9511
Receive: 79 825806.224706 825806.224220 11: seq: 19, canbus_status rx_error=0 tx_error=0 tx_retries=0 canbus_bus_state=active
Receive: 80 825806.501526 825806.224220 14: seq: 19, analog_in_state oid=7 next_clock=1365400642 value=9512
Receive: 81 825806.541002 825806.540623 11: seq: 1a, clock clock=1349172257
Receive: 82 825806.801469 825806.540623 14: seq: 1a, analog_in_state oid=7 next_clock=1384600642 value=9514
Receive: 83 825807.045758 825807.041600 18: seq: 1c, tmcuart_response oid=2 read=b’\n\xfa\xef-\x808\x02\x08 \xbb’
Receive: 84 825807.052206 825807.048060 18: seq: 1d, tmcuart_response oid=2 read=b’\n\xfa/ \x80\x00\x02\x08\xa0\x89’
Receive: 85 825807.101518 825807.048060 14: seq: 1d, analog_in_state oid=7 next_clock=1403800642 value=9512
Receive: 86 825807.227701 825807.227248 11: seq: 1f, canbus_status rx_error=0 tx_error=0 tx_retries=0 canbus_bus_state=active
Receive: 87 825807.401472 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1423000642 value=9515
Receive: 88 825807.701575 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1442200642 value=9514
Receive: 89 825808.001487 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1461400642 value=9513
Receive: 90 825808.301727 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1480600642 value=9511
Receive: 91 825808.601486 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1499800642 value=9514
Receive: 92 825808.901593 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1519000642 value=9512
Receive: 93 825809.201537 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1538200642 value=9512
Receive: 94 825809.501699 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1557400642 value=9514
Receive: 95 825809.801659 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1576600642 value=9513
Receive: 96 825810.101791 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1595800642 value=9512
Receive: 97 825810.101837 0.000000 15: seq: 16, stats count=427 sum=491811 sumsq=5264979
Receive: 98 825810.401642 0.000000 14: seq: 16, analog_in_state oid=7 next_clock=1615000642 value=9513
Receive: 99 825810.402734 0.000000 12: seq: 16, shutdown clock=1596312989 static_string_id=Missed scheduling of next digital out event
Stats 825810.5: gcodein=0 canstat_mcu: bus_state=unknown rx_error=0 tx_error=0 tx_retries=0 mcu: mcu_awake=0.012 mcu_task_avg=0.000002 mcu_task_stddev=0.000001 bytes_write=269823397 bytes_read=58277823 bytes_retransmit=1281 bytes_invalid=0 send_seq=5332751 receive_seq=5332748 retransmit_seq=5332751 srtt=0.001 rttvar=0.001 rto=3.200 ready_bytes=635 upcoming_bytes=6119 freq=399992819 canstat_EBB42_E: bus_state=unknown rx_error=0 tx_error=0 tx_retries=0 EBB42_E: mcu_awake=0.008 mcu_task_avg=0.000018 mcu_task_stddev=0.000021 bytes_write=70657764 bytes_read=23606417 bytes_retransmit=1199 bytes_invalid=0 send_seq=1658089 receive_seq=1658086 retransmit_seq=1658089 srtt=0.001 rttvar=0.001 rto=3.200 ready_bytes=0 upcoming_bytes=1551 freq=63999150 adj=64000630 heater_bed: target=100 temp=100.9 pwm=0.000 sysload=0.88 cputime=33944.513 memavail=1588880 print_time=157127.733 buffer_time=1.633 print_stall=0 extruder: target=245 temp=245.1 pwm=0.000
Timeout with MCU ‘mcu’ (eventtime=825811.519333)
Timeout with MCU ‘EBB42_E’ (eventtime=825811.519333)

Welcome Bacheshatonee,

Even zipped?

1 Like

/here it is zipped.

klippy(8).zip (3.2 MB)

1 Like

Digging into the file as far as I am able I see the following as the actual start of the errors where the print fails:

mycanlog.log (320.4 KB)

Stats 825809.5: gcodein=0 canstat_mcu: bus_state=active rx_error=0 tx_error=0 tx_retries=0 mcu: mcu_awake=0.012 mcu_task_avg=0.000002 mcu_task_stddev=0.000001 bytes_write=269823397 bytes_read=58277781 bytes_retransmit=1098 bytes_invalid=0 send_seq=5332751 receive_seq=5332748 retransmit_seq=5332751 srtt=0.001 rttvar=0.001 rto=1.600 ready_bytes=635 upcoming_bytes=5539 freq=399992819 canstat_EBB42_E: bus_state=active rx_error=0 tx_error=0 tx_retries=0 EBB42_E: mcu_awake=0.011 mcu_task_avg=0.000018 mcu_task_stddev=0.000021 bytes_write=70657764 bytes_read=23606348 bytes_retransmit=1054 bytes_invalid=0 send_seq=1658089 receive_seq=1658086 retransmit_seq=1658089 srtt=0.001 rttvar=0.001 rto=1.600 ready_bytes=0 upcoming_bytes=1374 freq=63999150 adj=64000649 sd_pos=3976233 heater_bed: target=100 temp=100.6 pwm=1.000 sysload=0.88 cputime=33944.417 memavail=1588568 print_time=157127.358 buffer_time=2.259 print_stall=0 extruder: target=245 temp=245.2 pwm=0.297
MCU ‘mcu’ shutdown: Missed scheduling of next digital out event

I see this at line #51705

Based on the “bytes_retransmit=1054” field it looks like the actual failure occures between line #51702-51703. Nothing else in there seems to indicate any kind of error to me.

I have the CAN monitoring working and can record out what appears to be a full packet capture of the CAN network. I can’t see having this running 24/7, but am happy to take any suggestions on how that can be used. The attached was captured after the system was rebooted from the error. There is a possibility it has some captured from before the reboot at the begining, but that would depend on weather or not the file is opened and modified or overwritten with the stock command from the flipper documentation at Klipper CANBUS troubleshooting. It was run once before the power cycle and again after the sytem was completely back online. The file as created has no extension, I added the .log in order to upload it.

Does not help much

(000.000000) can0 RX - - 109 [8] 14 1D 70 06 88 D9 93 89
(000.000014) can0 RX - - 109 [8] 40 87 BF 5C 88 D6 80 9F
(000.000091) can0 RX - - 109 [4] 00 5C AB 7E
(000.054988) can0 RX - - 10B [8] 0E 19 77 07 89 E6 97 BC
(000.055023) can0 RX - - 10B [6] 65 B2 58 FB C7 7E
(000.060223) can0 TX - - 108 [6] 06 1D 05 9D AE 7E
(000.060246) can0 RX - - 109 [8] 0B 1E 63 88 E4 AC BE 2A
(000.060257) can0 RX - - 109 [8] F1 85 7E 05 1E 77 FF 7E
(000.151129) can0 RX - - 109 [8] 0F 1E 6A 0F 89 AD C6 A6
(000.151144) can0 RX - - 109 [7] 00 81 ED 45 0D D1 7E

and so on.

This is a result of the CANBUS monitoring tool from the klipper CANBUS troubleshooting page. Like I said, it’s a packet capture of the CAN network traffic. Kinda like PCAP for LAN. Beyond my knowledge to interpret, but per their troubleshooting guide, the tool is available.

I believe you are mistaken.

Your configuration shows the only “route” to the MCU is via can. I do see a passing mention of Using SPI to CAN in the CB2 documentation. That leads to instructions on how to edit the Armbian config to add a module.

The generic-bigtreetech-manta-m8p-V2_0.cfg includes the section

[mcu]
serial: /dev/serial/by-id/usb-Klipper_Klipper_firmware_12345-if00

I don’t understand how the CB2 ↔ M8P CAN connection could ever work as I do not find any mention of the required 120 ohm terminating resistors for either device.

Hopefully someone will explain things here soon

I tried running “./scripts/parsecandump.py mycanlog 108 ./out/klipper.dict” but I get a bunch of errors that I didn’t think appropriate to put here until someone needed it.

@cardoc It was installed using the USB to CANBUS Bridge mode. I can link to the youtube I used for the install if needed, It’s been running “mostly” ok since Jan.

@cardoc With this hardware setup the resistor is on the “end” device on the CAN network. If there are no additional MCU’s other than the M8P no resistor is needed. I assume there is something in the software/drivers that allow that to work.

Here is the link to the setup I used. The only mod I made to it is to install the OS directly to the eMMC via their flashing tool, so no SD card used at all.

CA [A Guide:] Klipper Board Setup with BTT Manta M8P V2 + CB2 Compute Module

I’m really not conversant with CANopen. I am however very familiar with J1939 (SAE Spec) as I worked in the global tech call center for Freightliner. Modern trucks (real trucks not a Ridgeline or F150) have 20 to 50 CAN nodes on the main chassis bus.

In that “universe” 90% of the communication issues were either too many or not enough terminating resistors. CAN is extremely fault tolerant but needs a proper electrical circuit to work.

This error occurred with single ESB42 with the resistor pins bridged on the end of a straight cable containing 2 twisted pairs of wire. One 18ga the other 14ga. The 14ga are 24v and ground. The 18ga are Tx/Rx. The shielding is grounded at the M8P side only.

When I use more than 1 ESB42 the same cabling is used in this configuration:

Wire connections are soldered and shrink wrapped. openings at splices are less than 3cm and shielding is spliced as well with a single path to ground at the M8P. I’ll only have the “end” unit with the resistor pins bridged without errors other than these random ones and the occasional connection one when I monkey with something when the power’s on.

1 Like

I think I understand the CB2<->M8P connection.

It seems the connection is SPI using CAN data protocol but no can transceivers. So data moves as a single sided serial connection electrically, not a double ended can signal. That eliminates the terminating resistors from that branch.

1 Like

That makes sense.

I assume you have read this

The fact you occasionally get the error without a toolhead node connected leaves only the connection between the boards, the board traces and solder solder joints at the chips as a source of communication loss.

EDIT: Quick and dirty fix is add a U2C and point Klipper it it. You “should” edit the Armbian configuration to remove the overlay for the SPI connection.

The only problem with a physical connection error is that it requires a full power cycle to reset. To me that feels much more software fault than hardware fault.

When I cause a physical connection error a firmware reboot fixes it 100% of the time, as long as the underlying physical connection is fixed. With these errors even issuing a reboot command from the linux prompt via ssh doesn’t fix it. Only a full hard power cycle will do it.

You may read https://klipper.discourse.group/t/missed-scheduling-of-next-digital-out-event.

1 Like

@cardoc I ordered an orangePi CM4 to try. I’ve seen a couple other posts and I’m thinking it’s an issue either with BTT’s for of the klipper image or with the CB2 itself.

1 Like

Because the SPI route is connected “deeper” into the OS than the other method.

@hcet14 I have read through that thread and it’s linked topic. I’ve taken every step I can find to harden/troubleshoot the connections themselves. I do get the occasional retransmits as can be seen in the klipper log. System utilization is low throughout the log, I don’t see anything that appears to not keep up. Memory utilization seems fine as well. Even the CAN network utilization isn’t what I would consider unreasonably high. I am NOT an expert though. Overheating is possible on the EBB42, it’s in a 50c chamber without a fan, but my guess is the TMC2209 would overheat first. (I have heated one up enough to start throwing errors, but was fine once it cooled down.

When this error occurs both MCUs fail at exaclty the same time. When I cause a physical desconnect only the EBB42 mcu shows as a failure, and connections are reestablished when a fermware restart is performed. The M8P is also on a physically different network than the EBB42. If the M8P is using an SPI connection but just sending CAN data packets, any connection fault on the physical CAN network should have no effect.

1 Like

@cardoc Would there be any way to identify if physical faults from traces/connections on the M8P itself are the cause?