EBB36 CAN cold start issue FIX

Basic Information:

Printer Model: home built bed slinger aka “ the …Franken8”
MCU / Printer board: BTT Manta m5p | EBB36 toolhead | USB-CAN bridge
Host / SBC: CB1 on board
klippy.log: attached

klippy.log (1.5 MB)

Describe your issue: After updating my mcu’s to v13 from v12 Klipper reported unable to connect to EBB36.

Straight up I am no software dev. much of what follows is my blindly driving a console ‘vibe debugging’ at Claudes direction. But it seems to have resolved the issue. This is the AI generated summary of the debugging and fix applied follows.

Bug Report: usb_canbus stall detection causes CAN toolhead board connection failure on cold power cycle

Summary

The stall detection introduced in commit 2c90c97c (usb_canbus: Detect canbus stalls when in usb to canbus bridge mode) causes CAN toolhead boards to fail to connect after a cold power cycle when using a USB-to-CAN bridge mainboard. The 50ms stall timeout is too short for toolhead boards that require more than 50ms to boot and assert CAN bus presence on cold start.

Environment

Component Details
Host BTT CB1 (Linux/ARM)
Mainboard BTT Manta M5P V1.0 — STM32G0B1, USB-to-CAN bridge mode
Toolhead board BTT EBB36 — STM32G0B1, CAN node
Bootloader Katapult v0.0.1-110-gb0bf421 (EBB36)
Klipper host v0.13.0-572-g88a71c3ce
Manta MCU firmware v0.13.0-572-g88a71c3ce
EBBCan MCU firmware v0.13.0-572-g88a71c3ce
CAN bus speed 500000 baud
CAN interface gs_usb via Manta M5P USB-CAN bridge

Symptom

After a cold power cycle (PSU off, then on), Klipper fails to connect to the EBBCan toolhead MCU with repeated timeout errors:

mcu 'EBBCan': Wait for identify_response
mcu 'EBBCan': Serial connection closed
mcu 'EBBCan': Timeout on connect
mcu 'EBBCan': Unable to connect

Warm restarts (firmware restart, Klipper service restart, or system reboot without PSU off) work correctly every time. The issue is exclusively triggered by a cold power cycle.

Diagnosis

CAN bus traffic analysis

candump captured during a failed cold start shows the Manta bridge correctly receives and broadcasts SET_NODEID commands to the EBBCan (0x3F0), and the host correctly sends identify requests to EBBCan’s assigned CAN ID (0x10A). However, EBBCan never transmits any response on its transmit channel (0x10B):

3F0#0126D998A69C7A05  ← SET_NODEID broadcast: UUID=26d998a69c7a, nodeid=5
10A#081101002842247E  ← host → EBBCan: identify request (MCU_RX)
10A#7E08110100284224  ← host → EBBCan: identify request (MCU_RX)
10A#7E               ← host → EBBCan: identify request (MCU_RX)
[... repeated with exponential backoff, no 0x10B traffic ever appears ...]

Querying Katapult after a failed cold start shows EBBCan CAN hardware is functional — it responds correctly on the admin channel (0x3F1):

3F0#00               ← QUERY_UNASSIGNED broadcast
3F1#2026D998A69C7A11 ← EBBCan responds: UUID=26d998a69c7a (CAN TX working)

This rules out EBBCan hardware or firmware as the cause.

Root cause

The frames sent by the host to EBBCan (0x10A) are being silently discarded by the Manta USB-CAN bridge before they reach the CAN bus.

Commit 2c90c97c introduced a stall detection mechanism in src/generic/usb_canbus.c. When the bridge cannot successfully send a CAN frame for 50ms, it transitions to a BSS_DISCARDING state and silently drops all subsequent outbound frames until a successful send occurs.

On cold power-on, the EBBCan STM32G0B1 requires more than 50ms to complete its boot sequence (clock initialisation, Katapult bootloader check, Klipper firmware startup, FDCAN peripheral initialisation). During this window, the EBBCan cannot acknowledge CAN frames. The Manta bridge attempts to send early frames, fails to get bus acknowledgement within 50ms, enters BSS_DISCARDING, and then drops all subsequent host-to-EBBCan frames — including the SET_NODEID and identify commands that Klipper depends on for connection establishment.

The relevant code path in src/generic/usb_canbus.c:

if (UsbCan.bus_send_state == BSS_READY) {
    // Just starting to block - setup stall detection after 50ms
    UsbCan.bus_send_state = BSS_BLOCKING;
    UsbCan.bus_send_discard_time = timer_read_time() + timer_from_us(50000); // ← too short
}

The state only clears back to BSS_READY on a successful send — which cannot happen if all frames are being discarded.

Warm restarts are unaffected because the EBBCan firmware is already running before Klipper reconnects, so it can immediately acknowledge CAN frames and no stall occurs.

Fix

Increasing the stall timeout from 50ms to 5000ms in src/generic/usb_canbus.c resolves the issue:

// Before:
UsbCan.bus_send_discard_time = timer_read_time() + timer_from_us(50000);

// After:
UsbCan.bus_send_discard_time = timer_read_time() + timer_from_us(5000000);

After rebuilding and reflashing the Manta M5P with this change, cold power cycle connection succeeds consistently.

Alternative fixes to consider

A more robust solution might be to reset bus_send_state to BSS_READY whenever a new UUID assignment cycle begins (i.e. when a SET_NODEID admin frame is sent), so that the discarding state cannot persist across a fresh connection attempt. This would preserve the stall detection benefit for genuine mid-session bus failures while avoiding false positives during boot.

Reproduction steps

  1. BTT Manta M5P in USB-to-CAN bridge mode with stock v0.13 firmware

  2. BTT EBB36 (or similar STM32G0B1 toolhead board) as a CAN node

  3. Power cycle the PSU completely (not just a warm reboot)

  4. Observe Klipper failing to connect to the toolhead MCU

  5. Confirm candump shows zero traffic on the toolhead’s transmit CAN ID

Additional notes

  • The issue was initially mistaken for an EBBCan firmware problem (bus-off, FDCAN clock misconfiguration, Katapult timing) and required extensive CAN bus traffic analysis to correctly attribute to the bridge’s discard logic.

  • The problem affects any configuration where a toolhead board takes longer than 50ms to boot relative to the USB-CAN bridge board on cold start. This is likely to affect other STM32-based toolhead boards beyond the EBB36.

  • Tested and confirmed on Klipper v0.13.0-572-g88a71c3ce. The stall detection was not present in v0.12 which did not exhibit this issue.

Hope it helps resolve this for others.

Interesting.

Can I ask, how did you set up your system?

What instructions did you use for configuring the CB1 for CAN operations?

I don’t have a link to the original source used when I built the machine about 20 month ago on v12. But for the update I followed Home | Esoterical’s CANBus Guide which seems to be the current goto on CANBUS builds.

My bitrate is 500000 as this was advised at the original v12 build and carried over to the update to v13. The machine work flawlessly until the update so thought keeping this value avoid ‘hardware’ issues i.e cabling and interference from other sources.

I found that I had similar issues when upgrading from V1.2 to V1.3 on a very similar machine to yours (M8P, CB1 and EBB42) a few months ago. I was running at CAN at 1Mpbs.

I ended up reimaging the CB1 following the Esoterical guide and everything has been running fine since then.

OK - useful to know thanks. I have been pretty much accepting all updates through mainsail and “apt upgrade” said nothing to update so assumed little benefit in a bare metal rebuild.

Need a few more cold starts to convince myself all is well.

FWIW,
The MCU CAN bridge has a frame/message FIFO queue.
If the message can’t be sent to the CAN physical layer, the queue is blocked, and the CAN BRIDGE can’t receive the message either.

To make progress, there is a discarding state, where the bridge will drop messages until the CAN physical layer is no longer faulty.

One can’t increase this time, otherwise we will lose connection to the CAN BRIDGE mcu upon CAN issues.
One can’t safely workaround that or reorder packages (usb_canbus: async local/hw message processing by nefelim4ag · Pull Request #7145 · Klipper3d/klipper · GitHub), because we have bandwidth issues with GS_USB (we can transfer only one CAN frame or one echo frame per one USB frame), and the CAN Bridge can overload the USB connection with frames.

The real question is:
Why do you have 2 identical MCUs (STM32G0), and why does one start on time while the other, as you assume, does not?

Hope that explains something,
-Timofey

Thanks Timofey,

Appreciate your explanation.

The question you pose about the MCU’s while I get your point. Whatever the root cause the exact same hardware ran flawlessly for ~20 months under V0.12 and and broke as soon as V0.13 was applied.

Certainly I am open to ideas on other ways to make a compile of a clean v13 work. Patching the code on each update or re-flashing the MCU every time I hit to power button is not where I want to be. But that timer change is working in that I can now power on and run allbeit at risk as you explain. And, I have been printing all day.

Happy to collaborate on further debugging with anyone who is willing.