SerialQueue: Limit message payload on retry/errors

Currently, Klipper encodes multiple commands in message blocks.
Each message block can have a size of up to 64 bytes.
Where the usable bytes are: 59 = (MESSAGE_MAX - MESSAGE_HEADER_SIZE - MESSAGE_TRAILER_SIZE).
Which in the limit translates to 53 59 noop commands :smiley:
Or most probably 10-20 normal ones.

In case of a CAN network, that message will be split into up to 8 CAN frames. Because each CAN frame can only transmit 8 data bytes.

On top of that, there is a sort of congestion control, which enforces waits between data sends, to allow the network and MCU to process the data.

Several packages can be transmitted simultaneously, up to MAX_PENDING_BLOCKS (12), which is a sort of TCP send window.

There can be errors in transmission due to different reasons. But as long as the line is working, some data should pass end-to-end.
I assume that 1 frame has a higher possible chance to pass end-to-end.

So, the basic idea is to provide a sort of feedback loop from the retransmit code to the send code.
(This is also a sort of congestion control).
If the line is ā€œbadā€, it should be possible to reduce the size of the following messages.
It is not possible to limit the real size, because some commands will not fit.
But it is possible to reduce the number of commands per message block.
Which in the end, could end up as 1 command per CAN frame.

Which sounds suboptimal, but it should allow the machine to progress farther, instead of retrying indefinitely and dying.

This is a PoC Patch V2

The main concern that I can see, and I agree with. This is not a fix for issues with communication. It will mask them, more than retry do.
Maybe fail early, fail fast, fail hard is still better here.

Also, there will be high transmission overhead, so with too bad communication lines, there will be a failure anyway.

Any thoughts?

Thanks.

What is a ā€œbadā€ line?

Is it an intermittent connection? Is it crosstalk between the High/Low lines or with other signals in the printer? In these cases, is it due to the CAN cable/toolhead at a specific location in the printer and once the toolhead moves, the problem goes away and normal transmission can resume - this is where I can see the value in this PR.

Otherwise, I’m of the opinion to fail early and hard.

I’m not sure that I have a right definition there.
I mean a communication link with some issue that causes the retransmission count to constantly increase.
We don’t know why. EMI Noise, grounding issue, cross-talk & etc.
I think it will always be machine-specific.

Like, I have no way to know why someone has experienced them, I just see them often.
Even on the integrated boards, like MKS Skipr, where the MCU and SBC on the same physical board, and there is a direct hardware UART link between them.

Hope that makes it clear.

(I do not mean that the existence of retransmissions is bad and always leads to a failure. I just mean they exist, sometimes more often than I would expect to.)

Could this pose the risk of reducing the effective bandwidth by a factor of 8 in the worst case, i.e., turning 500k into 62.5k? Consequently, could it potentially lead to bandwidth issues that are hard to diagnose?

You know my opinion on band-aid solutions :wink:

1 Like

It is a good question.
We can use only half of the CAN bandwidth, because there is a frame size of 14 bytes, and only 8 bytes are usable.
So, effectively, there is ~285k usable bandwidth at 500k speed.

With the current format, the message itself consumes 5 bytes.
So, at a 64-byte limit, there would be … 285 * 59 / 64 = 262k.
With the proposed command per message limit, it is tricky, because commands have different lengths, and they use a variable amount of bytes, depending on the argument size, and I would expect most of them not to fit in the 3 bytes.

But, I would expect them all to roughly fit in 2 frames.
So, 285 * (16-5) / 16 = 195.9
Or 285 * (8-5) / 8 = 106.8 for short commands like get_clock

Not so dramatic.
Also, it should only happen at a high retransmission rate.
So, on the bandwidth graph, all of that still will be visible.


Also, notice, retransmissions also utilize usable bandwidth. So, technically, if there are fewer of them, there is more usable bandwidth. So, maybe in the end, from a bandwidth perspective, it will be roughly the same.

1 Like

Do you have an example system that you can test this on? Unless you have some hard examples that you can demonstrate your fix on and how it performs both in systems with demonstrated problems and ones without, you won’t be able to show its value.

I’ve seen a lot of them as well and the problems are manifest in a variety of ways and it takes time to drill down with the person having the problem to properly explain their system - I have seen communications errors that are a result of:

  • Power supply issues
  • Various host issues (overloaded host/apps that preempt Klipper/Marginal hosts, ie old laptops running Mint)
  • Poorly Flashed MCUs (both main controller board and toohead) where the Flash image doesn’t execute correctly
  • Dirty Klipper installations, either corrupted code on the host or user modified

I think I get what your motivation is here; you see cases where Klipper should be able power through with a bit of communication retries.

I’m just not sure that without strong evidence and use/error cases that it’s worth working on.

2 Likes

Unfortunately, no, I can only play in the ā€œlabā€ where I introduce errors. Like Linux MCU, patched to corrupt some bytes sometimes.

There are no issues on my side, maybe this is a time to build CAN :smiley:

:plus:

I’m still not sure myself if it would be helpful. Too many variables.
I can only argue that we mimic TCP, and it is another part of mimicry :smiley:


I think, probably it would generally have more sense to also decrease the send window size (768 bytes right now) on error, that would reduce the amount of data which should be retransmitted, to something more digestible. So, the patch is updated.

1 Like

Indeed there is a 64 byte max and a 5 byte header. However, a debug_nop command is 1 byte and thus one can reliably fit 59 of these commands in a message block (this is the basis of the command dispatch benchmark at Benchmarks - Klipper documentation ).

When I last checked, a queue_step command generally took between 7-8 bytes (1 byte for queue_step id, 1 byte for oid, 2-3 bytes for interval, 1 byte for count, 1 byte for add, 1 byte for amortized header cost). It’s generally expected that the bulk of the traffic will be in queue_step commands, so I’m generally expecting about 8 or so commands per message block at peak times.

The amount of data that can be outstanding is primarily limited by the RECEIVE_WINDOW constant defined by the mcu - which is 192 bytes for both uart and canbus setups. Only USB (and ā€œlinux mcuā€) would be able to utilize the full 12 blocks. So, in practice, the host should not send more than 192 bytes to a uart/canbus mcu without receiving an acknowledgement. (On canbus this is 192 bytes per MCU, so if one has several MCUs on the same canbus then the total amount of outstanding data on the bus as a whole may be larger.)

For what it is worth, 59 byte blocks and 192 bytes pending already seems ā€œridiculously smallā€ to me. If one can’t reliably transfer that amount of data, I don’t see how anything could work. YMMV.

Cheers,
-Kevin

1 Like