Improved stepcompress implementation

I have done some experiments with step compression. This is a new attempt, inspired by Dmitry’s code here, my previous work on step compression, and the MARS algorithm. The code is at: GitHub - KevinOConnor/klipper-dev at work-stepcompress-20220831

This new code reduces the “micro velocity bursts” between queue_step commands with only host code changes. The mcu code and message protocol is unchanged; it is not necessary to reflash the mcu.

The code internally ties upcoming queue_step “interval” parameters to past step intervals - thus effectively forcing velocity to remain continuous between each move segment. The code implements the same absolute step scheduling constraints as the current code. It uses “least squares” to find add1,count1 sequences such that the following add2,count2 sequence has the maximal step “total count”.

This branch is probably only of interest to developers or those experimenting with the low-level kinematics. I’ve run a bunch of unit tests, but have only lightly tested it on real hardware.

I wanted to share some of my results - maybe it will inspire other ideas or implementations.

-Kevin

1 Like

I ran a test on my Voron Zero using the various step compression algorithms. Below are some snippets from the test.

Original Klipper algorithm:

Code from my work-stepcompress-20220831 branch (e31be989):

Code from Dmitry’s stepcompress branch (83f7bd50):

Some high-level thoughts:

  • My work-stepcompress-20220831 branch does have smoother velocity compared to the baseline code. It is not as smooth as Dmitry’s code.
  • None of the branches seem to improve the jitter picked up by the accelerometer and angle sensors on this machine. It seems that my Voron Zero has an increase in mechanical jitter when moving around 60-80mm/s that is not related to step compression.

Cheers,
-Kevin

1 Like

This is interesting, I’ll take a look at your code, Kevin. Just for my curiosity, with which microstepping did you perform your test on Voron Zero?

I used settings like the following for both x and y (this is a corexy printer):

[stepper_x]
step_pin: PB3
dir_pin: PB4
enable_pin: !PD2
microsteps: 128
rotation_distance: 40
endstop_pin: PC0
position_endstop: 120
position_max: 120
homing_speed: 20   #Max 100

[tmc2209 stepper_x]
uart_pin: PC11
tx_pin: PC10
uart_address: 3
run_current: 0.800
interpolate: False
driver_TOFF: 4
driver_TBL: 0
driver_HSTRT: 0
driver_HEND: 0
stealthchop_threshold: 999999

The test is with diagonal moves. (The graph shows the accelerometer of the “x” direction, even though the move is diagonal.)

-Kevin

A post was split to a new topic: Error in stepcompress

*I’m just showing.

While I try to grasp/rewrite the stepcompress code,
I slightly modified the queue_step.py to compare 2 different outputs.
Now, I can directly compare the 2 decoded serial outputs against each other.
Also, I converted the output to arbitrary speed, because I think it is much easier to compare derivatives here, instead of the actual clock output.

# Example decoded log generation
./klippy/parsedump.py out/klipper.dict test.serial > test.log
./scripts/graph_steps_cmp.py -O 2 -s 1000000 -c 500000 perfect.log test.log

graph_steps_cmp.zip (2.9 KB)

To generate perfect output, which is large, uncompressed, and inefficient, I did this:

Much simplified compress_bisect_add
// Fit quadratic function
static struct step_move
compress_bisect_add(struct stepcompress *sc)
{
    const uint32_t max_count = 0xffff;
    uint32_t *qlast = sc->queue_next;
    if (qlast > sc->queue_pos + max_count)
        qlast = sc->queue_pos + max_count;
    uint32_t *pos = sc->queue_pos;
    uint32_t lsc = sc->last_step_clock;
    int64_t add = 0, minadd = -0x8000, maxadd = 0x7fff;

    struct step_move best_move = {
        .interval = pos[0] - lsc,
        .count = 1,
        .add = 0,
    };

    if (qlast - sc->queue_pos == 1)
        goto out;

    // Try perfect fit 2 points
    add = (int32_t)(pos[1] - lsc) - 2 * best_move.interval;
    if (add <= minadd || add >= maxadd)
        goto out;

    best_move.count = 2;
    best_move.add = add;
    if (qlast - sc->queue_pos == 2 || 1)
        goto out;

out:
    return best_move;
}

Thanks.

I was somewhat interested in this problem for some time.
After testing Dmitry’s smoother branch (smooth IS), I noticed that I can’t see the difference without adjusting max_error.
I was puzzled. Can I finally do something with it?

I initially was unable to grasp the original code in a way that would allow me to reimplement it. I have written different solutions from scratch several times :D, till I get the idea.
Sure, I think that I am still not able to write a good add fit function.
Now, I do have some understanding, and maybe I can add something useful.

If we define our problem as velocity jumps. Sure, it is in part due to the greedy algorithm.
But the other half of the problem is limited quantization and the quadratic function.
Like, for example, we cannot encode interval=1.5 or add=0.5.
So, there would be compression artifacts.
For the same reason, there is a theoretical limit to how good we can compress if we try to fit the raw data line.
I agree with the point that just decreasing max_error could improve the situation, and it will drop the compression ratio.

I can speculate on how much, depending on the MCU clock rate, microstepping & etc.
But, I suspect the main limitation of that way, it will limit the allowed fit for the oscillating flat lines.
Example:

// max error is mcu_freq * 0.000_025s
// Real data line is 2344, 2343, 2344, 2343... max error is 3750
// with 150MHz rp2350
queue_step oid=2 interval=2344 count=1172 add=0
queue_step oid=2 interval=2343 count=1172 add=0
queue_step oid=2 interval=2344 count=2231 add=0
// RP2040, max_error=300
queue_step oid=2 interval=188 count=188 add=0
queue_step oid=2 interval=187 count=188 add=0
queue_step oid=2 interval=188 count=188 add=0
queue_step oid=2 interval=187 count=188 add=0

There is a patch that does trim the outliners/underfitted tails.
That is it, if our fit does overshoot and the real data underneath has a different curvature/slope.

Below are the graphs.
Microstepping is set to 64. RP2040/RP2350 are just examples of different timer resoultions.
Compression ratio, I utilize the size of the serial output:

# Example 20min empty cube print
$ du -hs test_12Mhz*.serial
18M     test_12Mhz.serial
20M     test_12Mhz_trim.serial # +10%
$ du -hs perfect_150Mhz.serial test_150Mhz.serial test_150Mhz_trim.serial  
792M    perfect_150Mhz.serial # 1-2 steps per cmd
11M     test_150Mhz.serial
17M     test_150Mhz_trim.serial # +35%
# Sort of curve torture test, torus, gcode_arcs, gyroid infill 40min
$ du -hs Torus_12Mhz*
32M     Torus_12Mhz.serial
36M     Torus_12Mhz_trim.serial # +12%
$ du -hs Torus_150Mhz*
26M     Torus_150Mhz.serial
40M     Torus_150Mhz_trim.serial # +35%
Borked metrics from above script
test_12Mhz.log vs Reference:
  Speed Comparison:
    RMSE: 0.000000
    MAE:  0.000000
    Max Error: 0.000001
    Correlation: 0.999919
  Interval Comparison:
    RMSE: 6.031088
    MAE:  0.969365
    Max Error: 299.000000
    Correlation: 1.000000
  Add Comparison:
    RMSE: 8.210030
    MAE:  0.806903
    Max Error: 597.000000
    Correlation: 1.000000
test_12Mhz_trim.log vs Reference:
  Speed Comparison:
    RMSE: 0.000000
    MAE:  0.000000
    Max Error: 0.000001
    Correlation: 0.999921
  Interval Comparison:
    RMSE: 5.781919
    MAE:  0.893169
    Max Error: 299.000000
    Correlation: 1.000000
  Add Comparison:
    RMSE: 8.080921
    MAE:  0.791200
    Max Error: 597.000000
    Correlation: 1.000000
Speed Graphs, speed is constant / interval.

Green is where I zoom. I messed up file names. This is a 150MHz RP2350.





This is a 12MHz RP2040.


I messed up file names. This is a 150MHz RP2350.


Green is where I zoom, because with the trim, there is an overshoot


It seems to improve the fit. It also introduces its own artifacts, where the trimmed tail allows the next function to start overshooting sometimes. But overall, because of just more aggressive segmentation, it is smoother.
(They also could be dumped, but that, unfortunately, would increase the data rate further).

Thanks.


Now I also have my own implementation, but nothing special

Implementation.
More code, I hope it’s simpler. Sligtly slower. Also, less compression, closer fit.
The basic idea is heuristic analysis.

  1. Don’t bother with add extremus, limit max count
  2. Work with derivatives - intervals, not real numbers.
  3. Split by flat lines, cache intervals, and min intervals (minp).
  4. Guess. Opportunistic split.
  5. Refit with Linear Least Squares.
  6. Trim. Well, It is where I got it.
1 Like

I fear to say that, but I’ve probably found a solution or something that seems suitable.
The change of speed is the acceleration.
That means that the error function should account for acceleration.
Branch: GitHub - nefelim4ag/klipper at stepcompress-accel-error-20250824
(Modified Kevin’s graph script, lossless compressor, patch)

Patch Short diff
diff --git a/klippy/chelper/stepcompress.c b/klippy/chelper/stepcompress.c
index d7ff4f810..8baf410ee 100644
--- a/klippy/chelper/stepcompress.c
+++ b/klippy/chelper/stepcompress.c
@@ -95,6 +95,14 @@ minmax_point(struct stepcompress *sc, uint32_t *pos)
     uint32_t max_error = (point - prevpoint) / 2;
     if (max_error > sc->max_error)
         max_error = sc->max_error;
+    // Narrow by acceleration
+    if (pos > sc->queue_pos) {
+        uint32_t I_orig = sc->queue_pos[1] - sc->queue_pos[0];
+        uint32_t I_cur = point - prevpoint;
+        int32_t add = (int32_t)(I_cur) - (int32_t)(I_orig);
+        add = abs(add);
+        max_error = max_error - add;
+    }
     return (struct points){ point - max_error, point };
 }

This is a hack in some sense because I limit the accumulated error over the curvy sequences.
It provides better results than tail trimming.

Graphs

./scripts/graph_steps.py -O 2 --diff 2 -f 12000000 -c 1000000 12Mhz/perfect.log 12Mhz/add.log



./scripts/graph_steps.py -O 2 --diff 2 -f 64000000 -c 1000000 64Mhz/perfect.log 64Mhz/add.log


./scripts/graph_steps.py -O 2 --diff 2 -f 150000000 -c 1000000 150Mhz/perfect.log 150Mhz/orig.log

Any MHz could be produced by the jq '.config.CLOCK_FREQ = 12000000' out/klipper.dict

Compressed sizes
13M     12Mhz/orig.serial
15M     12Mhz/add.serial +15%
418M    12Mhz/perfect.serial
8.5M    64Mhz/orig.serial
11M     64Mhz/add.serial +30%
338M    64Mhz/perfect.serial
7.8M    150Mhz/orig.serial
9.6M    150Mhz/add.serial +23%
455M    150Mhz/perfect.serial

I’m not sure it is possible to do a substantially better fit with the use of the current quadratic sequence and have an integer restriction (well, I’ve failed here).
So, it seems that some error propagation is required to ensure a “better” fit.
Of course, questions start to rise, “how to define fit” and what is “better” over the broad range of the input sequences.

But it seems, generally, that with current limits, to get a better fit, sequences have to be shorter.
It is undesirable to touch the maximum error and so, constant velocity.
It seems hard to guide the algorithm and restrict the maximum error in a meaningful way inside the fit function.
This solution, at least this sort of solution. Seems dirty cheap in all ways.

Thanks.


It is possible to utilize the same idea to penalize an algorithm by accumulated Jerk.
(Because our polynomial can’t encode the jerk, i.e., step acceleration difference).

Patch
diff --git a/klippy/chelper/stepcompress.c b/klippy/chelper/stepcompress.c
index 9d097195c..a6dd6b72a 100644
--- a/klippy/chelper/stepcompress.c
+++ b/klippy/chelper/stepcompress.c
@@ -141,6 +141,11 @@ compress_bisect_add(struct stepcompress *sc)
     int32_t bestinterval = 0, bestcount = 1, bestadd = 1, bestreach = INT32_MIN;
     int32_t zerointerval = 0, zerocount = 0;
 
+    // Fine by 3rd degree derivative (Jerk)
+    int32_t J_sum = 0;
+    int32_t Ji = 0;
+    int32_t I_prev = sc->queue_pos[0] - sc->last_step_clock;
+    int32_t A_prev = 0;
     for (;;) {
         // Find the longest valid sequence with the given 'add'
         struct points nextpoint;
@@ -154,6 +159,21 @@ compress_bisect_add(struct stepcompress *sc)
                 return (struct step_move){ interval, count, add };
             }
             nextpoint = minmax_point(sc, sc->queue_pos + nextcount - 1);
+            if (nextcount > 2) {
+                for (; Ji < nextcount-1; Ji++) {
+                    int32_t I = sc->queue_pos[Ji+1] - sc->queue_pos[Ji];
+                    int32_t A = I - I_prev;
+                    J_sum += A - A_prev;
+                    I_prev = I;
+                    A_prev = A;
+                }
+                uint32_t max_error = nextpoint.maxp - nextpoint.minp;
+                if (max_error > abs(J_sum))
+                    max_error = max_error - abs(J_sum);
+                else
+                    max_error = 1;
+                nextpoint.minp = nextpoint.maxp - max_error;
+            }
             int32_t nextaddfactor = nextcount*(nextcount-1)/2;
             int32_t c = add*nextaddfactor;
             if (nextmininterval*nextcount < nextpoint.minp - c)

It does improve things a little bit, but only a little. The accumulating difference is too small.

Graph

Hmmm, I’m exploring the peek at the next output. I think I got the initial idea somewhere above, probably or in conversation with Dmitry.

The idea is simple: check the next triplet, based on the output of the current one.
Then search for a local “optimum” for total reach.
“How much can we cut from the current move count to make the next guess better?”
Which is already okay, but sometimes cuts “too much” from the current count.
So, there is a Jerk filter, which also tries to minimize the difference between the current and upcoming triplets.

There is no refit of the current sequence (that will probably require not non-greedy algorithm and create a different problem).
Also, I did not reuse the latest output, which can probably decrease the computational cost.
Branch: GitHub - nefelim4ag/klipper at stepcompress-peeker-28082025

It seems promising to me. But it is computationally heavy.
The output seems comparable to the above window error reduction by the “add” difference.
But it has better compression.

Sizes
13M     12Mhz/orig.serial
14M     12Mhz/peek.serial # +7%
418M    12Mhz/perfect.serial
8.5M    64Mhz/orig.serial
9.0M    64Mhz/peek.serial # +6%
338M    64Mhz/perfect.serial
7.8M    150Mhz/orig.serial
8.2M    150Mhz/peek.serial # +13%
455M    150Mhz/perfect.serial
Graphs






Thanks.

1 Like