Hi,

Klipper generates and executes stepper steps in a following manner: first, iterative solver generates the times of each step based on the moves and the printer kinematics, then a step compression code compresses steps for transmission over serial/USB, and then the MCU executes the compressed steps. The steps are compressed to a format `step[i] = step[0] + interval * i + add * i * (i-1) / 2`

, `i=1..count`

, and in this case only `interval`

, `add`

and `count`

are transmitted to the MCU. The compression is lossy, and each step `i`

after compression ends up between `(true_step[i] + true_step[i-1]) / 2`

and `true_step[i]`

.

In general, this schema is really not bad and produces a very good positional accuracy. However, when I was inspecting the velocity and acceleration profiles for the compressed steps motion, I noticed that the compression introduces systematic artifacts (this is a corner where Y decelerates from 100 mm/sec to 0 and X accelerates from 0 to 100 mm/sec with EI input shaper on X and 2HUMP_EI on Y, and with 256 microstepping):

Namely, the acceleration changes within some interval: basically it starts from the value relatively far from expected, the acceleration changes until it reaches the expected value, then it overshoots and goes further until the maximum error is exceeded, and then the cycle repeats. Also, there are often some small discontinuities in the stepper velocity. This is happening because the true change of time step during acceleration should be

```
dt = dx / (v0 + a * t) ~= dx / v0 * (1 - a / v0 * t + (a / v0 * t)^2 / 2 - ...)
```

which, due to compression schema, is effectively approximated only with the first term of Taylor series expansion. I do not think this could affect dimensional accuracy of the prints, but I thought that it may negatively affect the efficacy of the input shaper, for instance, which is fairly sensitive to the magnitude of the pulses and their timing.

So, I attempted to improve the step compression schema by introducing the extra parameters:

```
step[i] = step[0] + (interval * i + add * i * (i-1) / 2 + add2 * i * (i-1) * (i-2) / 6) >> shift
```

which is essentially implemented approximately (with the rounding and remainder handled appropriately) as

```
for i in range(count):
step += (interval + remainder) >> shift
remainder = (interval + remainder) & ((1 << shift) - 1)
interval += add
add += add2
```

This is effectively adds the next, second term to the Taylor expansion, and uses a fixed point arithmetic to increase precision over long sequences of steps. As a result, we can achieve much higher precision of the steps times: `step[i]`

in ~ `[true_step[i] - 0.015 * (true_step[i]-true_step[i-1]); true_step[i] + 0.015 * (true_step[i+1]-true_step[i])]`

, so it has an error of approximately +/- 1.5%.

This is how it looks with a new protocol:

Here we can see that all the acceleration levels from input shaping on X and Y, there are no unexpected stepper velocity jumps: there are only jumps from the input shaping of the square corner velocity 5 mm/sec.

It looks similarly with 16 microstepping (I had interpolation enabled):

And another point in the test with 256 microstepping:

Of course, the new protocol is more computationally expensive (at least, on MCUs) and increased the amount of data to be transferred to the MCU. I did make some optimizations to the MCU code, and was able to achieve a reasonable performance (e.g. 150 mm/sec printing speeds and 220 mm/sec move speeds on the Ender 3 with LPC1769 120 MHz and 256 microstepping; ~100 mm/sec printing speeds and 220 mm/sec move speeds on a Delta with ATMEGA 2560 16 MHz and 16 microstepping controlled by Raspberry Pi Zero W). I didnât try to push it too much and get the maximum performance, so I cannot tell how exactly the new code fairs against the baseline in terms of performance. But on the other hand, I also didnât want to spend too much time on optimizations right now, considering that the code may not be accepted into the mainline Klipper. However, I think there is some room for further optimizations and code simplification.

I also made some tests with the Klipper batch mode. I sliced 3D Benchy for a cartesian and a delta printer (2 different gcode files with slightly different parameters) and ran the gcodes in batch mode with a cartesian and delta configs with 16 and 256 microstepping with the baseline and updated stepcompress code, and obtained the following stats:

10361446 bytes, 1280262 queue_step, 3DBenchy_delta_baseline_16.serial

13513534 bytes, 1073562 queue_step, 3DBenchy_delta_stepcompress_16.serial

24911292 bytes, 3532198 queue_step, 3DBenchy_delta_baseline_256.serial

46399165 bytes, 3516039 queue_step, 3DBenchy_delta_stepcompress_256.serial

9937719 bytes, 1239940 queue_step, 3DBenchy_ender3_baseline_16.serial

14451059 bytes, 1151310 queue_step, 3DBenchy_ender3_stepcompress_16.serial

28077698 bytes, 4046116 queue_step, 3DBenchy_ender3_baseline_256.serial

45639795 bytes, 3500590 queue_step, 3DBenchy_ender3_stepcompress_256.serial

The âbytesâ is just the raw serial file size, and âqueue_stepâ is the number of queue_step commands in the corresponding serial file. So, we can see that with the updated stepcompress code we get, on average, tiny bit fewer commands, but the amount of data to be transferred increases by 40-60%, depending on the configuration.

I have been printing with this new code a lot lately, and at least all apparent bugs were fixed (cannot be 100% certain that *all* bugs were found and eliminated though). TBH, I must admit that I did not notice any obvious improvements in prints (no degradations either). It is possible that the ringing test looks tiny bit better with the new code, but I am not sure if I am simply imagining it. It is also possible that I do not possess any sufficiently rigid printer to set high accelerations where it would be important to reduce ringing further.

In either case, it would be great to have more tests of it on different machines to see if it actually improves anything, or just not worth the trouble. So, if you are interested, please give it a try (branch). Naturally, an MCU reflashing is required due to a change in the protocol. Also, Iâd say that Stealthchop mode introduces other delays and inconsistencies, and will probably not be affected by these changes, so it probably makes sense to test it in SpreadCycle mode of TMC drivers.

@koconnor FYI, and it would be great to hear your thoughts on it.

Implementation-wise, the new code uses a binary search on `count`

, and then uses Least-Squares method to determine all other parameters. Fortunately, the matrices for each `count`

, as well as their LDLT decomposition can be precomputed, so it is not too computationally-intensive on the host side.