CAN toolboards falling off of bus after query_unassigned

Hi!

I’m investigating an issue where it seems that with many devices (in my setup 14pcs of the BTT EBB42 / STM32G0B1) on a single CAN bus, they seem to go silent after handling multiple query_unassigned messages.

I firstly created a ticket on Katapult github: toolboards falling off the can bus · Issue #84 · Arksine/katapult · GitHub, where investigation was done. But I understand that it is better to come to a solution here and if a change would be in order that it can then be backported to Katapult.

In short I see the following happening:

  • Send a query_unassigned message with for example scripts/canbus_test.py.
  • All 14 devices answer, I notice that the UUIDs in the responses are sent in consecutive order - as if arbitration happens over the UUID (data payload).
  • When I send more of these requests by running the canbus_test.py script, i see fewer and fewer toolboards answering. I notice that the ones with highest UUID seem to fall off the bus. This goes on until one or sometimes two or three (so the ones with lowest UUID) are left responsive.
  • The devices that don’t respond anymore are now also unreachable with for example klippy/console.py.
  • I had an st-link v2 debugger connected and saw that the toolboards did not crash, they just don’t communicate on CAN anymore.
  • After i reset (button press) the toolboards, they all start responding again.

I am thinking that the devices become unresponsive because they are transmitting with the same ID (0x3F1) at exactly the same time. After sending the ID, they start sending the rest of the fields in the CAN frame and only there the bits start to differ. The ones with recessive bit still bailout and retry later, but I think they increase an error counter resulting in later going silent. @Arksine pointed me to the BUS_OFF state (BO ) after which I found info that the device may go into that state when a transmit error counter (TEC) goes over 255 and that the TEC counter is incremented if it notices bit errors, where having discrepancy of what’s being sent vs what’s being monitored (after arbitration) is counted as a bit error.

If the above is correct (I wasn’t able to prove it yet by printing the TEC counter or BO field in the PSR register), then wouldn’t it be an idea to make the response more unique ? I was thinking to take for example the first 29 bits of the 64 bit UUID and put that in an extended ID ? Then in the data perhaps still keep the 0x20 (CANBUS_RESP_NEED_NODEID) followed by the remaining 35 bits of the UUID.

I understand that this would be a protocol change which is probably not preferred, but maybe a new “CANBUS_CMD_QUERY_UNASSIGNED_EXTENDED” (0x03) could be added to avoid backward incompatibility ? I think it’s nice to have the possibility to have many toolboards connected :slight_smile:

Looking forward to your thoughts on this.

Thanks,

Leon

Nice issue. What are you doing with so many EBBs?

1 Like

Thanks :slight_smile: I’m using them in a device where I need to control a lot of steppers with endstops, hardware pwm and fans. It’s not in a printer and with custom control software on the Linux side. I really like how Klipper abstracts a microcontroller in such a generic way and the toolboards are relatively inexpensive :slight_smile:

Hi again :slight_smile: I have made the following change that works for me:

Think I ran it like 30 times where the issue has gone away.

It does take the first 29 bits of the 48 bit uuid and puts it in an extended id. The rest I left intact so that handling the response data also doesn’t need much change, so there is some duplication in the frame. But since it doesn’t happen that often I don’t think it matters much ?

I copied the scripts/canbus_query.py into scripts/canbus_test.py - in here i removed the initial filter but lateron filter on is_extended.

Would something like this be suitable for a pr ? or would req/resp need more changes ? or change applied in more places ?

Thank you,

Leon

I was also thinking, maybe making some fasthash29() and use a 29bits uuid everywhere would be cleaner way but that would not be backwards compatible, breaking everyones printer config :grimacing:

29bit is still 1/8th of the whole ipv4 address space, small chance for collision :slight_smile:

Anyone interested in this ?

Thanks for providing the information and testing results. As you’ve indicated, the current query for unassigned devices is a bit of a “hack”. Excessive use of it could result in devices going into an error state. However, the intent of the command was to identify unassigned devices and then immediately assign ids to them. It was never intended to keep multiple devices in an unassigned state.

So, unless I’m missing something, I think the best solution would be to avoid issuing unnecessary query unassigned commands.

Cheers,
-Kevin

Hi Kevin, thanks for your reply. I wouldn’t want to call it a hack, I really appreciate and like what you created :slight_smile: I agree that the issue is really an edge case that most people will never encounter, but I do think it’s a bug nonetheless and fixing it will certainly prevent someone pulling their hair our at some point in time :wink:

In my use-case I have custom software on Linux side that controls multiple toolboards, it performs the query-unassigned request every two seconds to see who’s new on the bus, automatically flashes to a version, etc. This resulted for me to have many occurrences where the toolboards would stop their CAN interface due to errors.

Using the extended id really resolves the issue, but I do think it’s not as clean as it could be. Would you not like a proper fix for this ?

If you want I’d be willing to spend the time to add the query_unassigned_extended where-ever it’s used, while keeping the current one in Klipper for backwards compatibility, create PR for Klipper and create PR for Katapult to have the same change. Then perhaps later-on the current method can be deprecated.

Or maybe there is a cleaner way ?

Thanks again,

Leon

The CMD_QUERY_UNASSIGNED message is intended to facilitate initial system building. It was not intended to be used as a general “ping” message. That is, it was intended to be used manually by users during a one-time process to obtain each chip’s canbus id.

I suppose it is possible to convert it to support general “ping” requests, but I suspect that will add a bunch of complexity and code churn to the existing code. (In particular, the mcus and host don’t currently support any “id masks” for filtering ranges of message ids.)

FWIW, it might be simpler to build an explicit “ping” message system, or implement a “ping” system using the existing klipper protocol after issuing a SET_KLIPPER_NODEID. That is, if you obtain the full list of all possible canbus ids then you can send explicit messages to each of those canbus ids to detect their presence instead of sending a general query.

Cheers,
-Kevin

I understand - in regular printer use after initial fetching of the id, it’s put in a static config and the query unassigned is then not necessary anymore.

It is possible to filter on just the fact whether a message has extended id or not, which I’m doing in the pr I created. Then having a message with that bit set actually means it’s a query-unassigned-response. Even though it works, I understand that that is not a nice way of doing it (because it prevents any future use of extended ids for other purposes?)

Only pinging nodes that are already known does not resolve finding new devices on the bus.

Thinking what would be an elegant way, I also don’t want to be intrusive especially now that I realize it’s really an edge case that is clearly not a problem for normal use.

I could keep the fork on my side, but it would be a pity for such a small change.

thanks again,

Leon

I do read now that on the host it is possible to set a filter on both can_id and can_mask:

Would that not be an option ?

For example that the response type (like what is used now for CANBUS_ID_ADMIN_RESP 0x3f1) is part of the extended id, for example the last 12 bits are used for this message type identifier, while the first remaining 17 bits are used to make the extended_id unique which could then be constructed of part of the serial hash uuid. Then the can_mask could be 0xfff ?