Can a failing MCU cause 'bytes_retransmit'

Basic Information:

Printer Model: Voron 2.4
MCU / Printerboard: Spider v1.0, SB2209, Cartographer, MMB
Host / SBC RPI
klippy.log

klippy.log.zip (1.0 MB)

Klipper v0.12.0-256

RPI → BTT U2C v2.1 → BTT CEB

I had the printer running great for a while, UART to the Spider board and CAN to the toolhead/cart probe and MMB. Power for the RPI also from the Spider. Then I ran into TTC errors on a multi-color print job, always about 3 hours into it, just after a tool change, but after a couple of tests never at the same spot. So I started digging in (and it has gone downhill since then). The TTC errors seemed to point to the main MCU, not the MMB, so I focused there. Also no bytes_invalid or bytes_retransmit (at that time).

I saw a lot of ‘Neopixel update did not succeed’ errors and started to worry that my LED string (although only 71 pixels) was causing problems, so I did 3 things specific for that, I installed an RS25 dedicated to power the RPI (left the LEDs on the 5v bus off of the mcu board), lowered the frame rate on all effects to 12 (halving the number of updates per sec) and put in a Mellow Fly-D5 just to handle the neopixel traffic and moved the LEDs to connect to that board. I had been using commercial can bus cables that were all pretty long, so I shortened them and recrimped the terminations.

I started testing and was getting:
Timeout on wait for ‘tmcuart_response’ response
Lost communication with MCU ‘mcu’

At first I thought that I had pulled a wire loose or bumped a driver out of its socket, or anything along those lines. I went around “tugging” on the stepper cables and found 2 wires in one connector that were a little loose (I pulled them out of their pins, albeit with a fair bit of force). I stripped and recrimped that connector. The others all looked good and all of the drivers were fully seated in their sockets on the board.

Then I started to see bytes_invalid for the [mcu]. So I crimped up a can transceiver, reflashed the Spider with the klipper configured to use CAN connection and connected it (using the transceiver) to the canbus. That worked (at least Klipper would connect to it/start and I could talk to the Spider). But any tests now show the bytes_retransmit numbers growing rapidly and within seconds ‘lost communication with mcu’ Sometimes I can home and do a couple of other tasks (IE: heat up hot end, etc.) but this last time I turned on the printer, walked away to grab some food and came back and found the printer had already shutdown.

So my main question is whether the retransmits are a possible indication of a failing control board? Or is the problem somewhere else downstream from the main board?

Thanks in advance.

bytes_retransmit is the number of bytes that had to be send twice because for whatever reason a command had to be retransmitted from the host to the MCU.
Usually this indicates some sort of bandwidth issue, because the host decided to send a command again as it did not receive an acknowledgement in a timely manner.
In itself, it might not be severe unless followed up by “real” errors.

Timeout with MCU / Lost communication with MCU occurs when the host does no longer receive data from the MCU. It can be the wiring or something made the MCU crash (e.g. due to some faulty hardware), so it just no longer reacts.
Unfortunately there is no deeper indication as to what could be the reason

Side notes:

  • Happy Hare does currently seem not to play well with Klipper
  • From the crash dumps it seems that Klipper was doing nothing else than “playing Neopixel” using 3rd party extras that do not belong to Klipper main-line and as such are not supported nor guaranteed to be free of side-effects, especially if various such 3rd party extras are stacked.

I appreciate the information, much of those links I had already read. However I had not found the Timeout with MCU link in my searches, although I am not sure how that slipped through my search ‘net’. Thanks for that.

I just wonder if one of the TMC2209 drivers went bad, if it would cause this issue? The first errors were tmcuart_response timeouts. Which made me think maybe the driver itself went bad. I have had bad drivers before, but usually they show up as missed steps or just “off” in terms of accuracy.

I had meant to follow up last night with a few additional points.

  • As part of my first attempts to resolve the TTC I had added active cooling to the RPI and reimaged onto a brand new Sandisk Ultra card. (was running on a Sandisk Ultra already, but an older one).
  • This was built “Doom” style, so there are no high voltage wires in the same compartment as the electronics. All mains power is underneath, all LV is on top. So EMI is less of a suspect, but I still try not to run the CAN cables off on their own, away from the 24vdc/5vdc power lines.
  • I have read about HH causing issues with Klipper. Unfortunately I found that out after I had ripped out half my wiring and purchased the extra electronics.
  • I understand that 3rd party ‘add ons’ are not supported or guaranteed, i am not looking for support for them, was just asking a specific question about bytes_retransmit. :slight_smile:

This link that you provided does seem to support my suspicion that the MCU (Spider 1.0 board) is faulty/dying.

Timeout with MCU / Lost communication with MCU occurs when the host does no longer receive data from the MCU. It can be the wiring or something made the MCU crash (e.g. due to some faulty hardware), so it just no longer reacts.

The bit about “or something made the MCU crash (e.g. due to some faulty hardware)”. I started to suspect that when I saw errors on UART then on CAN. I swapped to CAN as I modified my UART cable to take out the 5v power when I employed the RS25 to power the RPI, thinking I had possibly borked the connector, cable or connection somehow. Especially with the bytes_invalid on the UART connection. I wish there was an MCU specific log file to scrape/review, like putting a specific MCU into “debug logging” mode. I am learning more about the various deeper-dive Klipper troubleshooting tools, does something like that exist?

I am going to reflash the main board for USB communication and see if there are still errors (I am anticipating that there will be). But because I may have screwed up the UART cable (not sure how I could have jut by removing the 5v pins, but who knows) and the can tranceiver pins on the 1.0 Spider boards were not really recommended (forgot where i read that), it is good to rule out all possibilities.

I also have been thinking about changing from SD Card to an NVMe on the RPI. That should help with the HH related issues and now that i no longer use the UART pins on the RPI I can use an NVMe hat. Oh, and the RPI is a 4B 2GB, I never see the memory completely flat line, but was thinking of upgrading to a 4gb model, but that is just a last resort.

I am very much open to suggestion here and appreciate the feedback. Thanks!

I read the MCU timeout link.

" This error often appears in conjunction with the Got EOF when reading from device errors"

No EOF in the logs, I had checked for that previously

Most of the suggestions/points were referring to USB connections, which is not relevant for my setup.

Only USB in use is to the U2C, and was not using USB to the MCU that is losing communication, well not until later this morning.

I would start with simplifying the setup and start with doing away the Neopixel. I do not believe this is a driver issue

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.