Klipper: communication bus tests

nefelim4ag · September 13, 2024, 1:31am

There are many discussions around klipper, serial/can/USB, and bandwidth.
My 2 cents: Real bandwidth with really loaded mcu will be around 50 Kbps.

I did write a benchmark module: GitHub - nefelim4ag/klipper at bus-benchmark
The module is python only and does not require you to update your mcu.
I just reused existing mcu functions.
So, here are the results with my thoughts.

# printer.cfg
# Linux MCU HOST working over the psuedo terminal
[bus_benchmark host]
mcu: host

# stm32h723 USB-Can emulation 1_000_000 bits
[bus_benchmark mcu]
mcu: mcu

# rp2040 USB Serial Emulation.
[bus_benchmark rp2040]
mcu: rp2040

lsusb -t
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
    |__ Port 1: Dev 12, If 0, Class=Communications, Driver=cdc_acm, 12M
    |__ Port 1: Dev 12, If 1, Class=CDC Data, Driver=cdc_acm, 12M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
    |__ Port 1: Dev 29, If 0, Class=Vendor Specific Class, Driver=gs_usb, 12M

Linux thinks STM32 & rp2040 has 12 Mbps.

$ BUS_LATENCY MCU=host COUNT=200
// Latency: mean=0.000121020s stddev=0.000061630s

$ BUS_LATENCY MCU=mcu COUNT=200
// Latency: mean=0.000249133s stddev=0.000028107s

$ BUS_LATENCY MCU=rp2040 COUNT=200
// Latency: mean=0.000996884s stddev=0.000029622s

So, the serial is the slowest. Latency is 120 us, 250 us, 996 us.

Bandwidth

$ BUS_BANDWIDTH MCU=host COUNT=1000
// Latency: mean=0.000073312s stddev=0.000051271s
// Approximate bandwidth: 10475829.50 bps
$ BUS_BANDWIDTH MCU=mcu COUNT=1000
// Latency: mean=0.000730682s stddev=0.000053912s
// Approximate bandwidth: 1051072.65 bps
$ BUS_BANDWIDTH MCU=rp2040 COUNT=500
// Latency: mean=0.000998789s stddev=0.000063828s
// Approximate bandwidth: 768931.21 bps

nload can0 screenshot.

Linux is the fastest, as expected, around 10 Mbps.
MCU USB to CAN bus bridge got slightly higher results than expected (we have CRC overhead for Klipper serial + CAN CRC).
However, NLoad clearly shows that the USB CAN Bridge works faster than the defined 1 Mbps.
USB Serial saturates around 760 Kbps.

I did not hit CPU limit with this test, simply because of how it works.
CPU is using a “performance” governor during this test at 2.9 GHz.

Addressing concerns below, about count and precision.

BUS_BANDWIDTH MCU=mcu COUNT=8
BUS_BANDWIDTH MCU=mcu COUNT=16
BUS_BANDWIDTH MCU=mcu COUNT=32
BUS_BANDWIDTH MCU=mcu COUNT=64
BUS_BANDWIDTH MCU=mcu COUNT=128
BUS_BANDWIDTH MCU=mcu COUNT=256
BUS_BANDWIDTH MCU=mcu COUNT=512
BUS_BANDWIDTH MCU=mcu COUNT=1024
BUS_BANDWIDTH MCU=mcu COUNT=2048
BUS_BANDWIDTH MCU=mcu COUNT=4096
BUS_BANDWIDTH MCU=mcu COUNT=8192
Latency: mean=0.000676792s stddev=0.000038650s
Approximate bandwidth: 1134765.61 bps
Transfered: 48 * 8 * 2 = 768.00 bytes
Latency: mean=0.000668934s stddev=0.000027414s
Approximate bandwidth: 1148095.65 bps
Transfered: 48 * 16 * 2 = 1536.00 bytes
Latency: mean=0.000669885s stddev=0.000087903s
Approximate bandwidth: 1146464.74 bps
Transfered: 48 * 32 * 2 = 3072.00 bytes
Latency: mean=0.000666482s stddev=0.000034197s
Approximate bandwidth: 1152319.49 bps
Transfered: 48 * 64 * 2 = 6144.00 bytes
Latency: mean=0.000666031s stddev=0.000030229s
Approximate bandwidth: 1153100.32 bps
Transfered: 48 * 128 * 2 = 12288.00 bytes
Latency: mean=0.000673456s stddev=0.000080476s
Approximate bandwidth: 1140386.33 bps
Transfered: 48 * 256 * 2 = 24576.00 bytes
Latency: mean=0.000675128s stddev=0.000075948s
Approximate bandwidth: 1137562.27 bps
Transfered: 48 * 512 * 2 = 49152.00 bytes
Latency: mean=0.000666847s stddev=0.000066104s
Approximate bandwidth: 1151688.84 bps
Transfered: 48 * 1024 * 2 = 98304.00 bytes
Latency: mean=0.000650181s stddev=0.000042222s
Approximate bandwidth: 1181208.94 bps
Transfered: 48 * 2048 * 2 = 196608.00 bytes
Latency: mean=0.000658150s stddev=0.000069231s
Approximate bandwidth: 1166906.72 bps
Transfered: 48 * 4096 * 2 = 393216.00 bytes
Latency: mean=0.000655775s stddev=0.000048133s
Approximate bandwidth: 1171133.33 bps
Transfered: 48 * 8192 * 2 = 786432.00 bytes

On my setup, with octopus connected to USB and working in USB CAN Bridge mode, there is no significant dependency between sample count and output.

I think anyone concerned about available bandwidth can now test that.
Hope someone finds it useful.

Thanks.

miklschmidt · September 17, 2024, 2:46am

You should run at lot more samples and a lot more data if you want any sort of useful numbers. 10000 for the latency test seems to yield consistent results, less than that and the run variation is huge.

Those are both running USB (STM32G0B1 and STM32H743).

Similarly, here’s using 10k for data (noticed how you use 500 for the rp2040 and 1000 for the others, which gives further misleading results):

That’s still way slower than it should be however, as native USB isn’t bandwidth limited (baud rate is ignored). I’m guessing it’s severely bottlenecked by host side code (the klipper processes are using 100% CPU on my Pi4 test machine while running these tests), at least when compared to the results posted here: Benchmarks - Klipper documentation (klipper3d.org).

I tried running that, but the test commands didn’t seem to do anything. Suppressing the analog_in_state reports worked fine though, not sure what i’m doing wrong.

nefelim4ag · September 17, 2024, 12:41pm

You should …

I should give you a tool, what I’ve done
Everything else is out of my control, actually.

any sort of useful numbers.

The purpose of benchmarking is to highlight differences, so I hope the results are useful to you or anyone else interested in comparing USB/CAN/Serial performance.

lot more samples and a lot more data

In this test, we’re measuring Klipper data payload bandwidth with all the protocol/encoding/decoding overhead (Klipper serial, CAN, USB). While the Klipper protocol does impose some limitations, the data payload size represents a typical batch from sensors like ADXL.

In my setup, I can’t increase the sample count because it simply doesn’t work. But when comparing mean latency, I see that your USB setup is twice as fast as mine with USB. For bandwidth, increasing the sample count improves precision, but it doesn’t change the actual bandwidth. My results don’t vary much, as the count remains relatively stable, aside from small changes in stddev.

As for bandwidth, it’s calculated by dividing the total amount of data by the time taken sum, of each individual transfer time. So, increasing the number of samples improves the precision of the result but doesn’t change the actual bandwidth measurement. In my tests, increasing the sample count doesn’t alter the average much—only the standard deviation.

which gives further misleading results

It’s great that you’re able to get higher numbers. My results may seem conservative, but they should be stable for others and avoid errors like ‘Timer too close.’
I’m curious a little about your lower sample counts and how they lead to differences. Sharing those details would clarify the ‘misleading’ aspect.

If you check Klipper’s benchmarks closely, you’ll see they use nop (code), which means “do nothing.” As explained in the documentation, the benchmark measures “dummy” command processing by the microcontroller. So it’s not directly comparable to actual data payload tests.

I’m not sure, that Klipper’s console.py support work with CAN.

P.S. Just a thought - are you running RatOS on a Raspberry Pi with a 32-bit image? I noticed 32-bit images on GitHub. I’m using the latest official 64-bit Raspbian with Python 3.12, which might explain some differences in our test results.
(This post on CAN communication timeouts between 32-bit and 64-bit Pis might be relevant: https://klipper.discourse.group/t/psa-canbus-on-32-bit-vs-64-bit-pis-communication-timeout-errors).

I will try to make some graphs and add them to the post, to clarify things with the relation between output and count param.

miklschmidt · September 17, 2024, 3:36pm

Sorry, i thought you posted this to compare data between the protocols? Your lead in to your post mentions debates around the different communication protocols, and for that the sample size used is just too small, at least when run on the machines i have here - it wildly depends on host side load. It appears we’re measuring python and code performance more than anything.

Klipper’s message protocol is limited to 64 bytes, it’s boxed in by minimum 5 bytes of headers etc. So 48 bytes is prolly the max we’ll get. Still a higher sample count significantly reduces run variance, it should run long enough to include all random process peaks on the system to reduce the variance. This is evident from your own numbers, obviously it makes no sense that USB → CAN is faster and has lower latency than just USB, either the collection and sample count of data needs adjustment or there’s something wrong somewhere. What’s your take on that?

If you check Klipper’s benchmarks closely, you’ll see they use nop (code), which means “do nothing.” As explained in the documentation, the benchmark measures “dummy” command processing by the microcontroller. So it’s not directly comparable to actual data payload tests.

Regardles, 870k messages of a minimum size of 5 bytes is ~4 mbps. All the ping command does is parse the message data and send it back.

The purpose of benchmarking is to highlight differences

Exactly, it takes more samples / longer run time to do that in an end to end test like this as it has a ton of unknowns and seems to be CPU bound on the host - as is clear from the variance between consecutive runs with low sample size.

It’s great that you’re able to get higher numbers.

That’s not really the point.

My results may seem conservative, but they should be stable for others and avoid errors like ‘Timer too close.’

It’s not about the scale of the numbers, it’s about consistency. AFAIK you’d need another process to bog down the system in order to provoke a “timer too close” in this scenario. However, that could be a fun experiment!

P.S. Just a thought - are you running RatOS on a Raspberry Pi with a 32-bit image

Yes. 64-bit is still full of 32-bit packages, random odd quirks etc. I’m also running bullseye, so there’s definitely some discrepancies there as well.

I will try to make some graphs and add them to the post, to clarify things with the relation between output and count param.

Awesome!

koconnor · September 17, 2024, 6:01pm

I don’t think that is accurate. In practice, the limit on commands that can be issued is dominated by the host cpu performance (and how fast it can run Python programs). The benchmarks indicate that most commonly available MCUs can fully saturate the available communication interfaces (and commonly available communication interfaces are dramatically higher than 50,000 bits per second).

If you can provide the output of console.py I’ll take a look.

The console.py tool does work with canbus (console.py -c can0 <uuid> or run console.py --help for the list of options).

The reports in the table are the number of commands sent per second. Each debug_nop command is one byte long. Commands are packed into 64-byte message blocks that contain a 5 byte header. So, in practice the total number of bytes sent is command_rate / 59 * 64. One can see this with the “serial” tests: 250000 (baud) / 10 (bits-per-byte when using 8N1 encoding) / 64 (klipper block size) * 59 (commands per block) == 23046.875 . Which is exactly the available commands per second reported by the serial tests.

A commands per second report of 870K is thus ~943728 bytes per second of application data (870000 / 59 * 64). Add in USB overhead and the value is effectively the 12Mbit per second that USB Full Speed runs at.

Hope that helps,
-Kevin

nefelim4ag · September 17, 2024, 7:56pm

will be around 50 Kbps.
I mean not a limit of MCU, I mean real normal bus usage.

I get 50kbits usage with normal “fast” printing.
There are actual tests.

I can’t test with 256 microstepping, because I got the “timer too close”, but I can do tests with 128 microstepping.

With nload on can0 I see numbers like this

Step create should be (1000 / 40) * 200 * 128 = 640000
x2 because there are 2 motors.

This is macro:

[gcode_macro LOOP_TEST]
gcode:
    G28
    {% set vel = (1000, 1000) %} # set velocity values to be tested (mm/s)
    {% set accel = range(15000,25000) %} # set accel values to be tested (mm/s^2)
    {% set min_x = printer.toolhead.axis_minimum.x|float +20 %}
    {% set max_x = printer.toolhead.axis_maximum.x|float -20 %}
    {% set min_y = printer.toolhead.axis_minimum.y|float +20 %}
    {% set max_y = printer.toolhead.axis_maximum.y|float -20 %}
    G1 X{min_x} Y{min_y} F15000

     {% for V in vel %}
        {% set machineVel = V*60 %}
        {% for A in accel %}
            RESPOND MSG="velocity:{V}mm/s, accel:{A}mm/s^2"
            SET_VELOCITY_LIMIT VELOCITY={V}
            SET_VELOCITY_LIMIT ACCEL={A}
            G1 X{max_x} Y{min_y} F{machineVel}
            G1 X{max_x} Y{max_y} F{machineVel}
            G1 X{min_x} Y{min_y} F{machineVel}
            G1 X{min_x} Y{max_y} F{machineVel}
            G1 X{max_x} Y{min_y} F{machineVel}
            G1 X{max_x} Y{min_y} F{machineVel}
            G1 X{max_x} Y{max_y} F{machineVel}
            G1 X{min_x} Y{max_y} F{machineVel}
            G1 X{min_x} Y{min_y} F{machineVel}
            G4 P200
        {% endfor %}
    {% endfor %}

    # restore original limits
    SET_VELOCITY_LIMIT VELOCITY={printer.configfile.settings.printer.max_velocity}
    SET_VELOCITY_LIMIT ACCEL={printer.configfile.settings.printer.max_accel}

With different tests, with arcs and 0.1 segments, there a “lot” of small moves, I got:

This is 3 times higher than the above peak value.
But still within the available bus bandwidth limits, even 250k.

G28 X Y
G1 F24000
# I just pasted it several times
G1 X238.153 Y141.283
G3 X230.916 Y144.084 I-0.294 J9.992 F48000
G1 X238.153 Y141.283
G3 X230.916 Y144.084 I-0.294 J40.992 F48000
G1 X225.153 Y141.283

This can be a wrong assumption but with sane usage, on a normal printer, with normal microstepping, a user will not hit any bus limits.
Because in normal circumstances step rate and other stuff will not produce such load.

BTW, I also use USB everywhere, USB CAN emulation for octopus, and serial for RP2040 dev board above.

Thank you for your help with console.py, and clarification with calculations.

DELAY {clock + 2*freq} get_uptime
FLOOD 200000 0.0 debug_nop
get_uptime

I did reproduce the benchmark from the documentation, and with your help, I got:

>>> 200_000 * 400_000_000 / (3125864828 - 2484949551) / 59 * 64
135399.58264553765

So, only 135399.58 * 8 = 1083196.64 kbits
Which match with nload:

FLOOD looks CPU-limited in my case.
(And it is comparable values to my benchmark tests above)

Awesome!

@miklschmidt - I added a graph.
I just pointing out, that misleading was a little early accusation.

miklschmidt · September 17, 2024, 9:50pm

Ran it again, looks like it works i just got confused by the garbled input/output mixed with the spam from analog_in_state:

pi@RatOS-pro:~/klipper/klippy/extras $ ~/klippy-env/bin/python ../console.py /dev/RatOS/btt-skr-3-ez

  This is a debugging console for the Klipper micro-controller.
  In addition to mcu commands, the following artificial commands are
  available:
    DELAY : Send a command at a clock time (eg, "DELAY 9999 get_uptime")
    FLOOD : Send a command many times (eg, "FLOOD 22 .01 get_uptime")
    SUPPRESS : Suppress a response message (eg, "SUPPRESS analog_in_state 4")
    SET   : Create a local variable (eg, "SET myvar 123.4")
    DUMP  : Dump memory (eg, "DUMP 0x12345678 100 32")
    FILEDUMP : Dump to file (eg, "FILEDUMP data.bin 0x12345678 100 32")
    STATS : Report serial statistics
    LIST  : List available mcu commands, local commands, and local variables
    HELP  : Show this text
  All commands also support evaluation by enclosing an expression in { }.
  For example, "reset_step_clock oid=4 clock={clock + freq}".  In addition
  to user defined variables (via the SET command) the following builtin
  variables may be used in expressions:
    clock : The current mcu clock time (as estimated by the host)
    freq  : The mcu clock frequency

==================== attempting to connect ====================
INFO:root:Starting serial connect
Loaded 113 commands (v0.12.0-208-g49c0ad636 / gcc: (15:8-2019-q3-1+b1) 8.3.1 20190703 (release) [gcc-8-branch revision 273027] binutils: (2.34-4+rpi1+14) 2.34)
MCU config: ADC_MAX=4095 BUS_PINS_i2c1_PB6_PB7=PB6,PB7 BUS_PINS_i2c1_PB8_PB9=PB8,PB9 BUS_PINS_i2c2_PB10_PB11=PB10,PB11 BUS_PINS_i2c3_PA8_PC9=PA8,PC9 BUS_PINS_spi1=PA6,PA7,PA5 BUS_PINS_spi1a=PB4,PB5,PB3 BUS_PINS_spi2=PB14,PB15,PB13 BUS_PINS_spi2a=PC2,PC3,PB10 BUS_PINS_spi2b=PI2,PI3,PI1 BUS_PINS_spi3a=PC11,PC12,PC10 BUS_PINS_spi4=PE13,PE14,PE12 BUS_PINS_spi5=PF8,PF9,PF7 BUS_PINS_spi5a=PH7,PF11,PH6 BUS_PINS_spi6=PG12,PG14,PG13 CLOCK_FREQ=400000000 MCU=stm32h743xx PWM_MAX=255 RESERVE_PINS_USB=PA11,PA12 RESERVE_PINS_crystal=PH0,PH1 STATS_SUMSQ_BASE=256 STEPPER_BOTH_EDGE=1
WARNING:root:got {'oid': 19, 'next_clock': 511908096, 'value': 32760, '#name': 'analog_in_state', '#sent_time': 2422932.987732393, '#receive_time': 2422932.988790615}
WARNING:root:got {'oid': 18, 'next_clock': 627908096, 'value': 6456, '#name': 'analog_in_state', '#sent_time': 2422933.241154634, '#receive_time': 2422933.278811208}
WARNING:root:got {'oid': 19, 'next_clock': 631908096, 'value': 32760, '#name': 'analog_in_state', '#sent_time': 2422933.241154634, '#receive_time': 2422933.288781541}
====================       connected       ====================
000.827: analog_in_state oid=18 next_clock=747908096 value=6452
000.837: analog_in_state oid=19 next_clock=751908096 value=32760
001.127: analog_in_state oid=18 next_clock=867908096 value=6452
001.137: analog_in_state oid=19 next_clock=871908096 value=32759
001.427: analog_in_state oid=18 next_clock=987908096 value=6454
001.437: analog_in_state oid=19 next_clock=991908096 value=32760
001.727: analog_in_state oid=18 next_clock=1107908096 value=6450
001.737: analog_in_state oid=19 next_clock=1111908096 value=32760
SUPPRESS analog_in_state 18
SUPPRESS analog_in_state 19
DELAY {clock + 2*freq} get_uptime
FLOOD Eval: DELAY 27700126140393 get_uptime
100000 0.0 debug_nop
get_uptime
003.440: stats count=323 sum=267965 sumsq=1323253
003.968: uptime high=6449 clock=1887396844
004.081: uptime high=6449 clock=1932579610
008.540: stats count=5214 sum=8040340 sumsq=99032938
013.573: stats count=130 sum=67515 sumsq=214581
^C

003.968: uptime high=6449 clock=1887396844
004.081: uptime high=6449 clock=1932579610

frequency = 400006467

100000 * mcu_frequency / clock_diff
100000 * 400006467 / (1932579610 - 1887396844) = 100000 * 400006467 / 45152766 = 885307.61

So 885k

A commands per second report of 870K is thus ~943728 bytes per second of application data (870000 / 59 * 64). Add in USB overhead and the value is effectively the 12Mbit per second that USB Full Speed runs at.

Yep that tracks! Thanks for the explanation, i wasn’t sure of the details so i just used the minimum size / worst case scenario as an indicator that something seemed off

miklschmidt · September 17, 2024, 9:54pm

Sorry i wasn’t suggesting you were trying to mislead people, rather that the numbers were suggesting something that wasn’t true because of the sample count. This is what i get if i run 200 samples multiple times on the STM32G01B in USB mode:

03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200
03.55
Latency: mean=0.000416088s stddev=0.000193513s
03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200
03.55
Latency: mean=0.000403709s stddev=0.000202556s
03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200
03.55
Latency: mean=0.000392506s stddev=0.000047312s
03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200
03.55
Latency: mean=0.000711638s stddev=0.000432278s
03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200
03.55
Latency: mean=0.000571161s stddev=0.000283995s
03.55
BUS_LATENCY MCU=toolboard_t0 COUNT=200

The graph doesn’t address that unfortunately.

koconnor · September 17, 2024, 10:53pm

For what it is worth, I’m a little confused what you are testing.

For simple G1 X99 type moves, the limitation is going to be in the C step compression code (see Code overview - Klipper documentation ). In particular, increasing the velocity will have little impact on the total bandwidth consumed because many steps are compressed into just a few queue_step commands. Also note that the C code will generate the queue_step commands, so this code path doesn’t stress the Python code much.

For many small arcs, however, it’ll likely start to stress the host CPU’s ability to run the Python code. Depending on the processor, the host CPU may max out before bandwidth is saturated.

Testing the bandwidth to a “USB to CANbus bridge” MCU is not at all a test of a CAN bus. The results will not correlate with a bandwidth test to an MCU actually using CAN bus (that is, an MCU compiled for “Communication Interface: CAN”).

An MCU compiled for “USB to CANbus bridge” communicates over USB and nothing purposely rate-limits that connection (not Linux, not Klipper, and not the MCU code). The reason the numbers are so much lower than normal “USB” mode is due to the goofy “gs_usb” protocol - for each 8-byte CANbus packet, two full USB packets must be transmitted on the USB bus. Thus, a benchmark of an MCU in “USB to CANbus bridge” mode is actually a test of how well the host USB hardware can schedule back-to-back USB packets and how well the MCU USB hardware can schedule back-to-back USB acknowledgement packets. It’s stressing a corner-case of USB. It’s plenty fast for our purposes of sending packets on CANbus, but I don’t think it’s representative of other use-cases.

Cheers,
-Kevin

nefelim4ag · September 17, 2024, 10:57pm

I will add one here, but I did not have such a value distribution.
I think this is because of schedutil governor.

$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000257218s stddev=0.000024770s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000232436s stddev=0.000052232s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000248304s stddev=0.000045641s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000221044s stddev=0.000045164s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000210681s stddev=0.000032338s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000236262s stddev=0.000056476s
$ BUS_LATENCY MCU=mcu COUNT=16
// Latency: mean=0.000250272s stddev=0.000044057s

BUS_LATENCY MCU=mcu COUNT=4			mean=0,000331722		tddev=0,000134367
BUS_LATENCY MCU=mcu COUNT=8			mean=0,000205673		tddev=0,000026219
BUS_LATENCY MCU=mcu COUNT=16			mean=0,000207444		tddev=0,000042051
BUS_LATENCY MCU=mcu COUNT=32			mean=0,000229691		tddev=0,000043577
BUS_LATENCY MCU=mcu COUNT=64			mean=0,000224012		tddev=0,000079367
BUS_LATENCY MCU=mcu COUNT=128			mean=0,000203064		tddev=0,000022567
BUS_LATENCY MCU=mcu COUNT=256			mean=0,000205901		tddev=0,000022508
BUS_LATENCY MCU=mcu COUNT=512			mean=0,000204455		tddev=0,000034917
BUS_LATENCY MCU=mcu COUNT=1024			mean=0,000205433		tddev=0,000028755
BUS_LATENCY MCU=mcu COUNT=2048			mean=0,000205368		tddev=0,000035601
BUS_LATENCY MCU=mcu COUNT=4096			mean=0,000215199		tddev=0,000030328
BUS_LATENCY MCU=mcu COUNT=8192			mean=0,000226695		tddev=0,000046440

nefelim4ag · September 17, 2024, 11:29pm

For what it is worth, I’m a little confused what you are testing.

I saw several topics about CAN bandwidth, USB, etc.
(In my little chats there are also a lot of them, sometimes about reliability - people often assume Automotive CANs is equal to CAN printer boards).
So, instead of arguing - I actually try to show “data”.

the limitation is going to be in the C step compression code

step_compress is limited from above, for 65k steps. So large amount of steps (high microstepping) and movements should just increase amount of output compressed steps.

For many small arcs, however, it’ll likely start to stress the host CPU’s ability to run the Python code.

Python is pretty fast, most CPU time is spent in C.
I’m with a friend of mine, spent some time benchmarking it. as one of the fruits of this was my patch-to-arc generator. This was simplest part

(we actually tested it with a resolution of 0.025 and really large arcs with a radius of 150mm)

To be precise, on perf data most of the time is spend in:
static inline double get_axis_position_across_moves(struct move *m, int axis, double time)
and
static double pa_range_integrate(struct move *m, double move_time , struct list_head *pa_list, double hst)

But, this happens not because the list travels slowly or something to my surprise.

The issue is here:

static int32_t
itersolve_gen_steps_range(struct stepper_kinematics *sk, struct move *m
                          , double abs_start, double abs_end)

Because of a lot of small moves, it can iterate 1000-3000 times on the same path, till it finds what it wants, and all lists and large amount of small movements make it only worse because at the above functions it will iterate every time from a fresh start.

we did come up with a dirty patch, which fixed this by caching the last move entry position.
But because it dirty and original author does not have time to refactor it - is still not in PR

github.com

Razenpok/klipper/blob/andrei/klippy/chelper/kin_shaper.c#L83


      
              double axis_r = m->axes_r.axis[axis - 'x'];
              double start_pos = m->start_pos.axis[axis - 'x'];
              double move_dist = move_get_distance(m, move_time);
              return start_pos + axis_r * move_dist;
          }
          
          #include <stdio.h>
          
          static char message[128];
          
          struct cache_entry {
              double time_delta;
              struct move* move;
          };
          
          struct cache {
              struct cache_entry entries[MAX_PULSES];
          };
          
          static struct cache cache_x;
          static struct cache cache_y;

Before patch

After patch
o

AFAIK from our testing, this gets x3 difference (really huge, we expect like 20% as a win), but it only caches for get_axis_position_across_moves. So, pa_range_integrate caching part is missing, this branch is next one by CPU usage in such cases.

normal “USB” mode is due to the goofy “gs_usb” protocol …

Wow that is interesting

koconnor · September 18, 2024, 1:32am

I’m not sure what tools you used to benchmark the code. Be aware that some tools modify the code to gather usage statistics, and these tools can dramatically alter the relative performance of the programs themselves. They can therefore provide misleading results. I typically measure performance using the LInux perf tool to avoid those issues. I’m surprised the C code was so much higher than the Python code - though I’m not surprised that the itersolve and step compression code are heavy consumers of CPU time.

Yeah - the itersolve system currently resets its predictions at the start of every gcode move. It’s difficult to make predictions across moves though because we expect each gcode move to change the pattern of steps. (Putting aside extreme cases like over-specified arcs, the purpose of each new gcode move is to alter the pattern of toolhead movement…)

Cheers,
-Kevin

system · November 17, 2024, 1:32am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Serial Data Rate Adjustment General Discussion	26	871	October 1, 2024
CANbus bandwidth usage General Discussion	12	139	July 23, 2025
Klipper 400MHz limitations Developers	31	1689	March 30, 2025
MCU 'mcu' shutdown: Rescheduled timer in the past - Host Buffer 100% General Discussion	22	1657	August 16, 2024
Communication timeout during General Discussion	42	7976	November 26, 2024

Klipper: communication bus tests

Related topics