Improved stepcompress implementation


Klipper generates and executes stepper steps in a following manner: first, iterative solver generates the times of each step based on the moves and the printer kinematics, then a step compression code compresses steps for transmission over serial/USB, and then the MCU executes the compressed steps. The steps are compressed to a format step[i] = step[0] + interval * i + add * i * (i-1) / 2, i=1..count, and in this case only interval, add and count are transmitted to the MCU. The compression is lossy, and each step i after compression ends up between (true_step[i] + true_step[i-1]) / 2 and true_step[i].

In general, this schema is really not bad and produces a very good positional accuracy. However, when I was inspecting the velocity and acceleration profiles for the compressed steps motion, I noticed that the compression introduces systematic artifacts (this is a corner where Y decelerates from 100 mm/sec to 0 and X accelerates from 0 to 100 mm/sec with EI input shaper on X and 2HUMP_EI on Y, and with 256 microstepping):

Namely, the acceleration changes within some interval: basically it starts from the value relatively far from expected, the acceleration changes until it reaches the expected value, then it overshoots and goes further until the maximum error is exceeded, and then the cycle repeats. Also, there are often some small discontinuities in the stepper velocity. This is happening because the true change of time step during acceleration should be

dt = dx / (v0 + a * t) ~= dx / v0 * (1 - a / v0 * t + (a / v0 * t)^2 / 2 - ...)

which, due to compression schema, is effectively approximated only with the first term of Taylor series expansion. I do not think this could affect dimensional accuracy of the prints, but I thought that it may negatively affect the efficacy of the input shaper, for instance, which is fairly sensitive to the magnitude of the pulses and their timing.

So, I attempted to improve the step compression schema by introducing the extra parameters:

step[i] = step[0] + (interval * i + add * i * (i-1) / 2 + add2 * i * (i-1) * (i-2) / 6) >> shift

which is essentially implemented approximately (with the rounding and remainder handled appropriately) as

for i in range(count):
  step += (interval + remainder) >> shift
  remainder = (interval + remainder) & ((1 << shift) - 1)
  interval += add
  add += add2

This is effectively adds the next, second term to the Taylor expansion, and uses a fixed point arithmetic to increase precision over long sequences of steps. As a result, we can achieve much higher precision of the steps times: step[i] in ~ [true_step[i] - 0.015 * (true_step[i]-true_step[i-1]); true_step[i] + 0.015 * (true_step[i+1]-true_step[i])], so it has an error of approximately +/- 1.5%.

This is how it looks with a new protocol:

Here we can see that all the acceleration levels from input shaping on X and Y, there are no unexpected stepper velocity jumps: there are only jumps from the input shaping of the square corner velocity 5 mm/sec.

It looks similarly with 16 microstepping (I had interpolation enabled):

And another point in the test with 256 microstepping:

Of course, the new protocol is more computationally expensive (at least, on MCUs) and increased the amount of data to be transferred to the MCU. I did make some optimizations to the MCU code, and was able to achieve a reasonable performance (e.g. 150 mm/sec printing speeds and 220 mm/sec move speeds on the Ender 3 with LPC1769 120 MHz and 256 microstepping; ~100 mm/sec printing speeds and 220 mm/sec move speeds on a Delta with ATMEGA 2560 16 MHz and 16 microstepping controlled by Raspberry Pi Zero W). I didn’t try to push it too much and get the maximum performance, so I cannot tell how exactly the new code fairs against the baseline in terms of performance. But on the other hand, I also didn’t want to spend too much time on optimizations right now, considering that the code may not be accepted into the mainline Klipper. However, I think there is some room for further optimizations and code simplification.

I also made some tests with the Klipper batch mode. I sliced 3D Benchy for a cartesian and a delta printer (2 different gcode files with slightly different parameters) and ran the gcodes in batch mode with a cartesian and delta configs with 16 and 256 microstepping with the baseline and updated stepcompress code, and obtained the following stats:

10361446 bytes, 1280262 queue_step, 3DBenchy_delta_baseline_16.serial
13513534 bytes, 1073562 queue_step, 3DBenchy_delta_stepcompress_16.serial

24911292 bytes, 3532198 queue_step, 3DBenchy_delta_baseline_256.serial
46399165 bytes, 3516039 queue_step, 3DBenchy_delta_stepcompress_256.serial

9937719 bytes, 1239940 queue_step, 3DBenchy_ender3_baseline_16.serial
14451059 bytes, 1151310 queue_step, 3DBenchy_ender3_stepcompress_16.serial

28077698 bytes, 4046116 queue_step, 3DBenchy_ender3_baseline_256.serial
45639795 bytes, 3500590 queue_step, 3DBenchy_ender3_stepcompress_256.serial

The ‘bytes’ is just the raw serial file size, and ‘queue_step’ is the number of queue_step commands in the corresponding serial file. So, we can see that with the updated stepcompress code we get, on average, tiny bit fewer commands, but the amount of data to be transferred increases by 40-60%, depending on the configuration.

I have been printing with this new code a lot lately, and at least all apparent bugs were fixed (cannot be 100% certain that all bugs were found and eliminated though). TBH, I must admit that I did not notice any obvious improvements in prints (no degradations either). It is possible that the ringing test looks tiny bit better with the new code, but I am not sure if I am simply imagining it. It is also possible that I do not possess any sufficiently rigid printer to set high accelerations where it would be important to reduce ringing further.

In either case, it would be great to have more tests of it on different machines to see if it actually improves anything, or just not worth the trouble. So, if you are interested, please give it a try (branch). Naturally, an MCU reflashing is required due to a change in the protocol. Also, I’d say that Stealthchop mode introduces other delays and inconsistencies, and will probably not be affected by these changes, so it probably makes sense to test it in SpreadCycle mode of TMC drivers.

@koconnor FYI, and it would be great to hear your thoughts on it.

Implementation-wise, the new code uses a binary search on count, and then uses Least-Squares method to determine all other parameters. Fortunately, the matrices for each count, as well as their LDLT decomposition can be precomputed, so it is not too computationally-intensive on the host side.

1 Like

And the files that I used to generate the charts: (4.6 MB)

The charts were generated with commands like

python3 klipper/scripts/motan/ test_rt_x_sprcl_256_interp_stepcompress -d 30 -g '[["adxl345(hotend,x)","trapq(toolhead,x_accel)","derivative(derivative(stepq(stepper_x)))"], ["adxl345(bed,y)","trapq(toolhead,y_accel)","derivative(derivative(stepq(stepper_y)))"], ["trapq(toolhead,x_velocity)","derivative(stepq(stepper_x))"], ["trapq(toolhead,y_velocity)","derivative(stepq(stepper_y))"]]'

and zooming and panning the chart appropriately

Wow. Very interesting.

It’ll take me some time to review your code to better understand the system you’ve built. In the meantime, I’ll make a few posts describing my past experiences with the stepcompress code. Maybe you’ll find something interesting in that experience, or be able to describe how you’ve approached the past issues I’ve run into.

Looking briefly at your code, it seems to use a least squares system to calculate the interval/add/count (or equivalent) parameters. In the past, I’ve been leery of using least squares because of my goal of ensuring that there are hard bounds on the error range of every step. It seems you’ve managed to deploy a least squares implementation that also validates the maximum error ranges. That is interesting.

As above, I’ll follow up with additional posts on my past experiences.


1 Like

A few years ago I explored converting the step compression code from using “piecewise quadratic functions” to “piecewise cubic functions”. That is, I mocked up code to convert from the current queue_step commands with interval/count/add to queue_step commands with interval/count/add1/add2. Ultimately, I abandoned the work.

At a high-level I view the stepcompression code as a “compression” system and not part of the “kinematics”. That is, I view the main goal is to efficiently communicate the array of step times (or a close approximation of them) to the micro-controller and to facilitate the micro-controller executing those steps at the requested times.

One can view the stepcompression code as creating a “piecewise math function”. Something like:

step_time(s) = { f1(s) if s in n1..n2; f2(s) if s in n2..n3; f3(s) ... }

After working on a “cubic” implementation, I came to the conclusion that I could likely obtain any desirable level of precision by increasing the number of “pieces” instead of increasing the complexity of each “piece function”.

Said another way, I concluded that more “quadratic pieces” would be a better alternative to “cubic pieces”. One of the reasons for that conclusion is the cost of the additional “add1 += add2” in the micro-controller. That math can actually be significant - mostly because the current stepper function is so efficient. For example, it takes the samd51 just 39 cycles to execute a step today. If that increases by just 5 cycles (2 cycles to load add2, 1 cycle to perform the add, and 2 cycles to write add1 back to memory) then it would only be an advantage if the code saved enough “stepper_load_next()” calls to make up for the extra work done on each step. Said another way, if a print has 10 million steps in it, one would need to save 50 million cycles by reducing “stepper_load_next()” calls in order for the trade-off to be effective. My conclusion is that this was unlikely to occur in practice.

One interesting test you may want to run is to put the current code into “lossless compression mode”. There’s an undocumented config option in the mcu config section called max_stepper_error. You should be able to set it to zero in the config and generate mcu test output. This will, of course, increase the size of the resulting output - both total bytes and total number of queue_step commands. But I’ve found it useful for getting a feel for an “upper bound” of the current implementation.

(As an aside, the “max_stepper_error” option goes all the way back to the initial public release of Klipper. Trying to actually run prints with it set to some non-standard value is likely to result in unpredictable results. I’ve only found it interesting during low-level development.)

Another useful tool is the scripts/ program. It takes a text version of the “klippy batch” output and provides some statistics on the resulting queue_step commands. For example, something like:

~/klippy-env/bin/python klippy/ printer.cfg -i test.gcode -o myoutput -v -d out/klipper.dict
~/klippy-env/bin/python klippy/ out/klipper.dict myoutput > myoutput-parsed
scripts/ myoutput-parsed

And, for example, if comparing stats between different runs, one should see the same number of steps taken, the same number of direction changes, the same number of steps in each direction - but the total number of queue_step commands may change depending on stepcompress changes.

Finally, I should point out that my earlier attempts at “cubic code” was fairly primitive. I was able to generate test files, but I didn’t came up with an efficient implementation (I vaguely recall using some exponential running time code that took hours to process large files). I also didn’t implement the precision tricks using a “shift field”. It was effectively just “count” iterations of “interval += add1; add1 += add2”. It is possible my analysis was flawed.

Separately, during those tests a few years ago it occurred to me that maybe “quadratic pieces” was too complex and instead one could just use many “linear pieces”. I concluded this was not the case, however. As another high-level observation, I concluded that, over short time frames, “add” was roughly equivalent to linear acceleration and that one could meaningfully approximate a real acceleration curve using it. Using purely linear would result in so many “pieces” during acceleration that it would burden the mcu. In case anyone is curious, I wrote about this at:


I have also noticed timing artifacts in the “motan analysis” that I think are related to step compression. (I have not observed anything in print quality that I could attribute to step compression, though.) In case anyone is curious, I wrote about this at:

I spent a bunch of time earlier this year working on possible alternatives. I’ll spoil the ending though - I ran into challenges and “shelved it” about 3 months ago. I did have some successes, but I was not ultimately able to come up with an efficient algorithm with sufficient benefits to justify the code churn. (Said another way, it was slow, it still had scheduling quirks, and was likely buggy.) I did want to return to it, but so far I haven’t found time.

FWIW, I approached the problem a little differently. My thinking is that the issue isn’t with the “queue_step protocol”, but is instead a result of the current compress_bisect_add() “greedy algorithm”. The current code builds “queue_step” commands one at a time with the goal of finding “interval” and “add” parameters to obtain the longest possible “count” sequence. I concluded that this leads to problematic results when the algorithm “takes too many steps” in a queue_step command. That is, I observed many examples where the code did a good job of selecting “interval” and “add” parameters that closely matched the requested motion, but then kept taking steps with that “interval” and “add” even after the pattern of requested motion changed. In many cases, this behavior didn’t even reduce the bandwidth or total number of queue_step commands because the next queue_step command (which would match the updated motion pattern) could have absorbed the previous steps without penalty. Said another way, I found many examples where queue_step1 ... count=n ; queue_step2 ... count=m could have been replaced with something like queue_step1 ... count=n-3 ; queue_step2 ... count=m+3 with an overall better match to the requested motion. This behaviour isn’t surprising as it’s a known limitation of “greedy algorithms”.

FWIW, my observation was not that the resulting acceleration was unsmooth - my thinking was that the problem was changes in direction of acceleration. That is, during a toolhead acceleration one would expect the velocity to increase smoothly. I’m not too concerned about the velocity increasing in slightly unsmooth increments, but I was finding cases where the velocity between two steps would actually be lower than the velocity between the previous two steps. This could occur when a queue_step command took too many steps with one pattern such that the next pattern started off with an interval that was greater than the last step of the previous pattern. Said another way, there could be these “micro velocity spikes” between each “queue_step” command. In particular, I was concerned that “stealthchop” could magnify these velocity spikes.

(As an aside, for any internet search algorithms, I think Klipper does an excellent job at step timing - scheduling each step within a few microseconds of its ideal time - and I haven’t seen print artifacts from the above scheduling quirks - my interest is in seeing if the system can be further improved.)

Anyway, that was the approach I took earlier this year - trying to see if there was an alternative to compress_bisect_add() with improved “smootheness” between queue_step commands - ideally without a significant increase in bandwidth.


In my last post I mentioned I did have “some successes” with my efforts earlier this year. In case there is interest, I can give a high-level description of the approach I was persuing.

The current code produces “queue_step” commands with interval/count/add parameters. For example, one can see this in the batch output - here’s an example of a simple X homing move on a cartesian printer.

queue_step oid=1 interval=4379818 count=1 add=0
queue_step oid=1 interval=31827 count=2 add=-9492
queue_step oid=1 interval=17575 count=3 add=-1793
queue_step oid=1 interval=12292 count=7 add=-599
queue_step oid=1 interval=8496 count=8 add=-216
queue_step oid=1 interval=7201 count=138 add=0
queue_step oid=1 interval=7200 count=111 add=0
queue_step oid=1 interval=7200 count=111 add=0
queue_step oid=1 interval=7200 count=111 add=0

The above results in acceleration followed by constant velocity moves (ultimately arriving at a constant interval of 7200 clock ticks between steps). Although the above can look cryptic, it kinda makes sense. As the stepper accelerates, the interval between steps decreases. This is seen both as an “interval” parameter that decreases as well as a negative “add” parameter.

Looking a little closer, one can see that the “interval” parameter is generally close to the interval of the previous sequence. This makes sense as we expect the motion to be smooth. This leads to an opportunity to reduce overall bandwidth as the code could transmit just the change in interval instead of having to send the (potentially lengthy) interval value explicitly. (I mentioned this in that blog post years ago - ).

The work I did earlier this year used this insight in an effort to design a different host compression algorithm. That is, the current system of “interval”, “count”, and “add” parameters could instead be viewed as “first_add”, “count”, and “add”. Where “first_add” is the difference between the inteval of the first step in the new sequence relative to the interval of the last step of the previous sequence.

If we were to encode the example X acceleration steps above with this new way of looking at the data we’d get:

queue_step oid=1 (interval=4379818) first_add=? count=1 add=0
queue_step oid=1 (interval=31827) first_add=-4347991 count=2 add=-9492
queue_step oid=1 (interval=17575) first_add=-4760 count=3 add=-1793
queue_step oid=1 (interval=12292) first_add=-1697 count=7 add=-599
queue_step oid=1 (interval=8496) first_add=-202 count=8 add=-216
queue_step oid=1 (interval=7201) first_add=217 count=138 add=0
queue_step oid=1 (interval=7200) first_add=-1 count=111 add=0
queue_step oid=1 (interval=7200) first_add=0 count=111 add=0
queue_step oid=1 (interval=7200) first_add=0 count=111 add=0

I’ve included the original “interval” parameter for reference - it wouldn’t need to be transmitted explicitly as it could be calculated from the other parameters.

The above transformation shows two interesting features - first that a newer encoding generally uses smaller integers and thus could reduce the amount of data sent to the mcu, and second that there is indeed quirks in the velocity data. In particular, the “add” (or change between interval) isn’t steadily decreasing - at one point we actually increase it (first_add=217). Strikingly, if the previous queue_step had just been count=7 then the resulting interval would have been exactly 7200 and there could also have been one less queue_step command.

This is an example of the “micro velocity jumps” due to “taking too many steps” that I mentioned in the previous post.

Taking the above encoding further, one could imagine adding a “count1” parameter alongside the “first_add” parameter - something like: queue_step add1=-1697 count1=1 add2=-599 count2=6 (or, equivalently, send two “queue_step” commands that just have “add” and “count” parameters).

This was the “gist” of my attempt earlier this year to rework step compression. (Or, at least, my most successful attempt.) More concretely, I attempted to find “add1”, “add2”, and “count1” parameters such that I could maximize “totalcount” (count1+count2). It then accepted the “add1/count1” sequence and attempted to find parameters for the longest possible “count2+count3” sequence. (Then accepted “add2/count2” and went on to find longest possible “count3+count4”, and so on.) Alas, trying to maximize “totalcount” with three free variables while still keeping the absolute range constraints proved challenging. I could approximate it, but the resulting data was still quirky and it wasn’t particularly fast. I ultimately paused the work.

Crucially, by not sending an explicit “interval” the algorithm should be more likely to produce smooth transitions between queue_step commands.

Anyway, this concludes my posts on past work. I thought I’d give a little context on my previous experiences. (With the obvious caveat that those attempts had limitations.) Hopefully I haven’t careened too far off topic.


1 Like

Kevin, thank you, this was very insightful. I will definitely take a look at scripts/ (previously, I just used sed/awk to verify that the number of steps stays the same). I wouldn’t be able to cover all your points in a single reply, but I will post a few random things that may be of interest.

I saw max_stepper_error earlier, but I think it is probably not realistic to have it set to 0? However, it seems to be possible to modify minmax_point function to have a smaller error:

     uint32_t lsc = sc->last_step_clock, point = *pos - lsc;
     uint32_t prevpoint = pos > sc->queue_pos ? *(pos-1) - lsc : 0;
-    uint32_t max_error = (point - prevpoint) / 2;
+    uint32_t max_error = (point - prevpoint) / 32;
     if (max_error > sc->max_error)
         max_error = sc->max_error;
     return (struct points){ point - max_error, point };

If it does what I think it does, this should give the comparable precision ((-3%,+0%) vs (-1.5%, +1.5%) in my code), though I didn’t test it, only ran it through the Klipper batch mode. It seems to increase the size for 3DBenchy_ender3 case to 12196363 (22% size increase vs baseline; my code increases the size by 45% in this case), and the number of generated queue_step commands is 1568486 (vs 1239940 in baseline and 1151310 in my branch). Overall, not too bad, but I didn’t check if and how it works in practice, and how do the velocity and acceleration profiles look like in this case. I should look more into it though (and run it through a max velocity test described below).

FWIW, I noticed that shift is pretty much mandatory in this case. For example, the difference between add2==1 and add2==2 is already too great if we consider, say, 128 steps (~340K ticks on 128th step). So, it becomes too coarse and pretty much useless at this rate, at least that’s what I concluded.

In principle, I was a bit concerned with both (that the required acceleration profile is not maintained, and that velocity spikes by a noticeable amount), but in a sense that it could possibly affect the input shaping. I do not have any substantial evidence of it though. FWIW, the more complex step encoding schema (which doesn’t have to be tied to kinematic, but rather be able to accurately encode the requested steps) may be beneficial for kind of ‘shaping the stepper motion’ to reduce VFA artifacts. I made some experiments with it last year that turned out to be unsuccessful at the time (but it was unrelated to step compression, at least I think so).

Anyway I agree that the greedy algorithm makes the aforementioned artifacts more apparent, and I thought about this issue too. However, I also didn’t manage to devise any robust solution for it TBH. The way the ‘cubic’ stepcompress algorithm approaches it (in a very limited fashion) is that it decreases the number of generated steps from the previous step by 1, and then adds the last (removed) step to the least squares algorithm to a set of constraints to optimize for - that is, the first step in the generated sequence should not only be close to the next actual step, but also to that “removed” step. This reduces the jitter between moves a bit, though it is not perfect, of course. But also due to a small maintained error (+/-1.5%) throughout the generated step sequence, the jumps in velocity become much less apparent.

I did look at lpc176x code (I have skr v 1.4 pro board). After I eliminated some weird code that is unneeded (I assume Klipper runs only on two’s complement MCUs? however it wouldn’t work on other architectures even previously), the difference between the baseline and my code is 12 instructions along the longest possible branch, if I didn’t make a mistake (I could have made a mistake though). So, I’d say it is likely more than your best-case scenario estimate. That said, I find that the existing benchmarks in Klipper are a bit ‘benchmarky’ and unrealistic. That is, if we want to benchmark the move from (x_min, y_min) to (x_max, y_max) and find the maximum velocity, they are probably very good. However in real prints, the users are likely to hit ‘Rescheduled timer in the past’ errors and such if they use the speed limits from benchmarks, and they will be forced to reduce the maximum printing speed.

So, in order to make a bit more realistic comparison of the two codes, I made the following test:

In essence, it is simply a cylinder of diameter 200 mm and height 200 mm, sliced in the vase mode, speed 2 m/sec and otherwise the fairly usual parameters - 0.25 mm layer height, 0.4 mm line width, etc.

Now, before ‘printing’ the test (without the filament, and with heaters at 0 degrees target), I run the following commands:


and then starting the print. Basically, set the acceleration limits high enough that they don’t limit the print, and instead increase the maximum velocity limit by 20 mm/sec every 2.5 mm starting from 100 mm/sec. The bands of 2.5 mm height will be printed at the pretty much constant velocity, and the bands give Klipper enough opportunity to fail if it really fails at a given velocity.

The nice thing about this print is that it is relatively complex on the stepper motion (it uses all 4 axes of the printer, and in a complex fashion, except Z axis which is very light, but could mimic the bed meshing), but it is not obscure - that is, every 3D printer owner has a chance to face something similar, so it is quite representative of the real prints. So I’d say that the velocity limits obtained from this test can really be used in real life, with some safety margin added, naturally. My results were fairly interesting, I think.

  1. Ender 3 Pro, SKR v 1.4 pro, RPi 3, 40 mm rotation distance with 1.8 steppers, 256 microstepping on all axes.
    • baseline code failed at 520 mm/sec
    • my cubic stepcompression code failed at 500 mm/sec
  2. Delta printer, RUMBA (Atmega 2560) MCU, RPi Zero, 40 mm rotation distance with 1.8 steppers, 32 microstepping on a, b,c and 16 - on extruder.
    • baseline code failed at 300 mm/sec
    • my cubic stepcompression code failed at 200 mm/sec

So, while the relative code increase on the 32 bit Arm MCU seems to be significant, I guess in this case it was not the main decisive factor. I am not 100% sure, but I suspect that stepper_load_next may have sufficient latency that if the step rate is above a certain rate, they will make the MCU to miss the scheduling cycles. For example, if two or three stepper_load_next calls for several steppers overlap, they may eventually block the scheduled steps for other steppers, causing an error. And even though stepper_load_next is also affected by my code changes, perhaps the effect on its latency may be much less noticeable. This is just my theory though, and I could easily be wrong. In the end, the maximum velocity drops only by ~4% with the new code.

Then, on the 8 bit board, the effect on the maximum velocity is a lot more noticeable. I think that’s just because the new code adds a lot of operations on large numbers, which don’t fit 8 bit architecture particularly well. I did some optimizations for AVR earlier, but I didn’t spend too much time on it. Though I doubt that the current code could be optimized significantly beyond what there is already (that is, 10-20% performance optimizations are probably possible, but that’s not enough to cover 33% velocity drop). However, it is probably possible to write a slightly different code specifically for AVR that will have better performance potentially at the expense of something else. For instance, maybe it can limit shift to 8 and use shorter integers to save precious cycles, and even potentially only use shift=0, 8 and -8 values to avoid slow shifts altogether. But as I mentioned previously, I’d rather not spend enormous amount of time polishing the code at this point, since it is unclear if it even makes sense to integrate it to the mainline. But should a decision be made, it will be worth exploring further.

Note that the test above is only a suggestion, I’d be happy to hear about other proposals on how to reasonably benchmark the stepping code. Here are the logs from my benchmarking attemps (just klippy.log logs): (322.4 KB)

If the goal was to avoid issues with the input shaper, have you tried to measure a resonance graph to see if the new algorithm has any effect on it?

To follow-up a bit on this topic. I actually ran the test with uint32_t max_error = (point - prevpoint) / 32; - reduced by 16 vs the mainline. The results are as follows.

While there are clear improvements over the mainline code (charts in my first post), the acceleration is still worse than the ‘cubic’ implementation. But that’s not all: when I ran it through the cylinder ‘performance test’, it failed at only 340 mm/sec (vs 540 mm/sec in the baseline and 500 mm/sec in my steppercompress branch) with Ender3 with 256 microstepping, skr1.4 pro board and RPi3. The klippy.log indicates that RPi3 was very loaded (sysload=1.07 cputime=254.921), but I’m not sure if the host overload was a reason for that. The dump of the stepper queues also indicate that there were lots of short sequences of moves scheduled, which could be the reason that the MCU was not able to keep up with them:

queue_step 2119: t=26614896211 p=123932 i=275 c=9 a=0
queue_step 2120: t=26614898687 p=123941 i=276 c=145 a=0
queue_step 2121: t=26614938706 p=124086 i=275 c=9 a=0
queue_step 2122: t=26614941182 p=124095 i=276 c=103 a=0
queue_step 2123: t=26614969609 p=124198 i=275 c=10 a=0
queue_step 2124: t=26614972360 p=124208 i=276 c=72 a=0
queue_step 2125: t=26614992231 p=124280 i=275 c=10 a=0

In any case, my conclusion is that it is likely not very realistic to just reduce the size of the segments and obtain better precision - the acceleration/velocity issues with overshooting do not entirely go away, and in addition that severely increases the required processing power (likely both on host and MCU side).


I definitely agree with that. As I was trying to improve the performance of my stepcompress branch, I came to a conclusion that it is very critical to generate queue_step commands with “large” count values. It does not mean that the efficiency of a single step scheduling on the MCU isn’t important, but the former also plays a large, if not bigger role than this. In fact, I was able to improve the maximum attainable velocity in the ‘cylinder test’ on my branch by optimizing the 32 bit version of the MCU code a bit and noticeably improving step_move generation code (by replacing binary search with a golden-ratio search). Now it has a maximum velocity comparable with the mainline Klipper (also fails at 540 mm/sec). I will polish the code a bit and make some AVR optimizations and will push it to my branch.

Thanks, that is a good suggestion. I think I did run such a test and noticed some differences, but I will repeat the tests with the latest code and post the results. Additionally, it would likely be even more interesting to measure the acceleration during a [short part of a] print, then compute the ‘ideal’ acceleration profile, how it should be, subtract one from another, and compute and plot the frequency response charts in case of mainline Klipper vs my branch. That may reveal some patterns of differences between two branches ‘in the real world’.

BTW, here are the logs from my tests with the modified Klipper mainline code: (1.4 MB)

Well, I did not mean that the current code could be easily tweaked to produce the “desired level of precision”. I meant, in the general case it should be possible for code in the host to generate a sequence of quadratic pieces to any desirable level of precision. I agree that the current code does poorly with transitions between queue_step commands. (As before, my conclusion is the greedy algorithm with absolute time range invariant is leading to poor choices that cause notable velocity swings between queue_step commands.)

Because of my high-level conclusion, my focus has been on the host code and not on the mcu code.

The change to max_error that you’ve shown done doesn’t look complete to me. The current code ensures that every step is within a time range of 25us from its ideal time, or within a range of half the time since the last step - which ever range is smaller. It seems you are reducing the second part of that invariant (half the time since the last step converted to 1/32nd the time since the last step), but don’t seem to be changing the first part (the 25us). There is an undocumented max_stepper_error in the mcu config that you can test with if you wish. However, as above, I agree this isn’t a solution in itself, as it only constrains the absolute time range and doesn’t significantly constrain the velocity jumps.

Agreed. This was the issue I was having with my “cubic” code from years ago. It was not significantly reducing the number of queue_step commands - certainly not enough to make up for the higher overhead on each step. Worse, my cubic code was best at reducing the number of queue_step commands when the toolhead was moving slowly, and had little impact when the toolhead was moving fast. That pattern isn’t of much use because there is oodles of extra cpu time when moving slowly - the constraints occur when the toolhead is moving at high speed. That said, it is possible my old cubic code was too primitive to get meaningful results.

Alas, I haven’t had time to look through your code. I definitely would like to understand it.


I was looking through your step compress code. I think I have a high-level understanding of the major parts of it:

  1. The mcu code and protocol have been converted from a queue_step command with interval/count/add (ie, “quadratic”) to interval/count/add1/add2 (ie, “cubic”).
  2. The mcu code and protocol have been extended to support scalable precision for add1 and add2 (ie, “shift”). Thus, both add1 and add2 can describe interval increments with precision greater than a single mcu clock tick.
  3. The host code selects interval/count/add1/add2 parameters for each queue_step command with the goal of finding the longest valid sequence (greatest count). In order for the sequence to be valid, each actual step time must be within a strict time range of the requested step time. In this regard it is similar to the current compress_bisect_add() greedy algorithm. It finds the interval/add1/add2 parameters by solving a sequence of linear equations (ie, “least squares”).
  4. Due to the overhead of solving the equations, the code implements an exponential search followed by a binary search. That is, it tries exponentially larger values of count until it finds a value of count that is too large, and then bisects within the last count range to find the likeliest longest valid “count”.

Some interesting additions - the “least squares” seems to be both “method” and “invariant” in that it seems to be used both to find valid queue_step parameters and to ensure that the queue_step parameters largely follow the requested motion. There is also a mechanism to encourage the first step in a queue_step to be smooth wrt to the last step in the last queue_step. Also, even if an interval/add1/add2 parameters isn’t valid for the full
attempted count length, it can still compete for the “longest range” - upto the number of steps that are valid for the given parameters. This raises the potential that future steps could influence the sequence of steps leading up to them.

Did I understand the above correctly?


Thank you, Kevin. In general, your high-level summary is accurate. Some comments inline below:

Yes, and interval too. Perhaps less useful, however it can now better approximate e.g. interval=700.5. The current code has to interleave queue_step commands with interval=700 and interval=701.

Yes, correct. It is pretty much a greedy algorithm too. Just the way to find corresponding interval/add1/add2 parameters is different from the current Klipper code. Also, with stricter constraints, it is less likely to ‘overshoot’, and due to the cubic approximation of steps, it can recover from overshooting more gracefully (that is, it doesn’t have to start with a vastly different next interval to account for the overshooting).

Yes. FWIW, Exponential search and bisection-like algorithm are meant to reduce the algorithmical complexity of finding count to O(count * log(count)) vs O(count^2) in case of a simpler testing all count=1, 2, … values until an optimum is found. You could say this is to reduce the overhead from the Least Squares method, and it is true to a large degree, but it requires O(1) operations to find parameters for a single count value. However, testing those parameters requires simulating all steps i=1,…,count, which is where O(count^2) comes from. In any case, it probably speeds up the algorithm by a factor of 10-20 for count=100-200.

Since Least Squares method does not guarantee any error bounds for individual points, the only way to enforce them is to actually check steps one by one. Depending on the actual steps and the sequence length, such a test may fail in the very beginning, somewhere in the middle, or closer to the end of the step sequence, as for the minimization of the total squared error, it might be beneficial to let the errors be large in some specific part of the sequence. To reduce the number errors in the beginning, which cannot be recovered from and which do not allow to use the generated parameters, I added that part to try to make the first step be close to the expected last step from the previous sequence (which is within the tolerated error bounds), and it also ensures better smoothness of the intervals. The failure of the test in the middle of the sequence cannot be helped with, and typically indicate that the length of the sequence is too great. The failure of the test closer to the end of the sequence still allows to use the generated parameters for the most part (and the code takes advantage of it), and it is also often beneficial to increase the length of the sequence being approximated in the hope that the errors will be pushed towards the end of the new longer sequence, increasing the usable length of the sequence. That’s what exponential and binary search are essentially trying to do.

Unfortunately, the function usable_count = F(attempted_count) is not necessarily smooth and does not necessarily have a single local extremum, because depending on the step sequence, the maximum error may be left at different parts of the sequence by the least squares method for close attempted_count values. Then the fixed point arithmetic may play out a bit differently for different interval/count/add1/add2 parameters. As a result, the bisect/exponential algorithm may not find an optimal value, but rather hopefully some ‘good enough’ value.

Yes, that is quite possible, and may further improve the smoothness of the generated step sequence.

BTW, I pushed a slightly more optimized version of the MCU code, which, in case of 32 bit arm, compared to the baseline, has

  • 5 more instructions to execute - in stepper_load_next, and 9-10 more instructions - in stepper_event_full,
  • 12-14 more instructions in SysTick_Handler to execute,
  • 16-17 more instructions - in command_queue_step in the longest codepath.

Separately I had a question on your comment:

I indeed only did the first part, as I was trying to reduce the errors in velocities and accelerations. It seems that this improved the velocity errors substantially (somewhat less so for the accelerations). However, max_stepper_error serves as an upper bound on the errors, right? So, I imagine it has the most impact on the positional accuracy (which is already quite good), but less so on the velocity and acceleration errors. But maybe I am missing something? If yes, how would you suggest to adjust it? Also reduce it by 16? FWIW, I’m not changing this parameter in my stepcompress code changes, and still it seems that all errors are improved quite substantially only by constraining the relative errors between the generated and the actual requested steps; so at least in that case it seems to be sufficient.

Ah, I missed that.

FWIW, I’m not fully understanding the “shift” code. Couldn’t one just do:

    s->interval += s->add1;
    s->add1 += s->add2;
    s->time.waketime += s->interval >> s->shift;

For that matter, I’d guess one could reduce CLOCK_DIFF_MAX in stepcompress.c and perform a hard-coded shift on all moves (eg, last line of s->time.waketime += s->interval >> 8) that have a count greater than 1.

I suspect I’m missing something.

Yes - an upper bound. I’d guess that going to 1/32nd between steps is severely constraining step timing (and thus increasing mcu load_next_step() overhead). It might be possible to get similar precision improvements by, for example, halving max_stepper_error and only going to 1/8th between steps. However, as before, constraining the absolute time is unlikely to noticeably reduce velocity jumps between queue_step commands, so probably not worth spending time on.


Just so that we are on the same page, did you look at the latest code I pushed (commit c8f8f760fbbbe0f984407cd3b67a875384028ff1)? I basically did something similar there. I guess initially I wanted to avoid a non-constant shift (>>) because AVR only has a shift-by-one. But later I suspected that I over-engineered the code. Now implementation does a similar thing to your suggestion on 32 bit architectures with a varying shift parameter and pre-shifts on AVR so that it can have only shift==0, 8 or 16.

Then there is a separate variable int_low_acc which stores and accumulates the low part of the interval (basically, interval & ((1 << shift)-1) that does not get added to s->time.waketime. This part is necessary to hold and accumulate the fractional part of the interval. Let’s say we have interval = 200.5, add = 0, add2=0. If we just ignore the fractional part, over 100 moves we will accumulate 50 clocks in error. By maintaining int_low_acc, we can generate the steps of 200 and 201 as appropriate and maintain the high accuracy of the stepping.

Possibly, but I agree that perhaps it is worth spending too much time on it. At the end of the day, the magnitude of velocity jumps is primarily driven by the maximum relative step error today.

Thanks - that’s the part I was missing. The fractional part has to be accumulated for both the contributions to interval and to interval’s contribution to waketime.


I got a chance to look a little further at the math in the stepcompress.c code.

As I understand it:

  1. compute_rhs_3() is used to generate the “right hand side” of the linear equation being solved. That is, if we think of the linear equations as A * B = C then compute_rhs_3() is calculating the C 3x1 matrix. In terms of “least squares”, compute_rhs_3() calculates the 3x1 matrix resulting from matrix_transpose(constants_matrix) * step_times.
  2. fill_least_squares_matrix_3x3() calculates the A matrix of the linear equation being solved. In terms of “least squares” it is calculating the 3x3 matrix resulting from matrix_transpose(constants_matrix) * constants_matrix.
  3. The linear equation (eg, A * B = C) is solved for B by first calculating the LDL decomposition of the A matrix and then using forward and back substituion. The former is done by compute_ldl_3x3() and the latter by solve_3x3().
  4. As an optimization, since the A matrix only depends on count and does not depend on step_times, the A matrices for the first 1024 possible count sequences are pre-calculated.

Did I understand this correctly?

Seems like a clever system.

At a high-level, it seems that this system is a notable change from the existing system, in that the existing system has a very fast mcu step update, while this new code requires more work on each step. (The current code has just a handful of instructions to calculate next_wake += interval; interval += add.) It seems this new code does more calculation during the mcu step update, with the goal of having fewer overall queue_step commands. As mentioned earlier, I’m not sure of the overall benefits of making that trade-off.

Out of curiosity, do you find that there are still “velocity jumps” between queue_step commands with this new code? Since it uses the “greedy algorithm” to determine the maximum number of steps to take in a queue_step sequence, I would guess that it may still “take too many steps” and have to correct for that with a velocity jump at the start of the next queue_step sequence.

Separately, I do wonder if “least squares” could be used to enhance the current “piecewise quadratic” queue_step system. Specifically, I’m wondering about using “least squares” both to choose the “add” parameter and in choosing the “count” parameter so as to avoid velocity spikes between queue_step commands. I’ll try to put together some test code as I get time.


FWIW I tried your branch @dmbutyugin and printed some parts.
Worked nicely with my setup but I’m not sure that I noticed any differences (might be a tad smoother in the movements but pretty much subjective)

Side note:
I tried to do the Klipper benchmark with my board (BTT SKR 1.4T - LPC1769). The default branch worked, although I did not reach the values given in the documentation (Ticks 222): Ticks 231 → Test result is: 1558K

The stepcompress branch I could not get to work:

allocate_oids count=3
config_stepper oid=0 step_pin=P1.20 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
config_stepper oid=1 step_pin=P1.21 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
config_stepper oid=2 step_pin=P1.23 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
finalize_config crc=0
028.933: stats count=82 sum=84968 sumsq=5504041
033.934: stats count=67 sum=32363 sumsq=152732
038.934: stats count=67 sum=32358 sumsq=152679
043.934: stats count=67 sum=32358 sumsq=152679
SET start_clock {clock+freq}
SET ticks 231

reset_step_clock oid=0 clock={start_clock}
set_next_step_dir oid=0 dir=0
queue_step oid=0 interval={ticks} count=60000 add=0
set_next_step_dir oid=0 dir=1
queue_step oid=0 interval=3000 count=1 add=0

reset_step_clock oid=1 clock={start_clock}
set_next_step_dir oid=1 dir=0
queue_step oid=1 interval={ticks} count=60000 add=0
set_next_step_dir oid=1 dir=1
queue_step oid=1 interval=3000 count=1 add=0

reset_step_clock oid=2 clock={start_clock}
set_next_step_dir oid=2 dir=0
queue_step oid=2 interval={ticks} count=60000 add=0
set_next_step_dir oid=2 dir=1
queue_step oid=2 interval=3000 count=1 add=0
Eval: SET start_clock 6752409628
Eval: reset_step_clock oid=0 clock=6752409628
Eval: queue_step oid=0 interval=231 count=60000 add=0
Error: Unable to encode: queue_step
Error: Unable to encode: queue_step
Eval: reset_step_clock oid=1 clock=6752409628
Eval: queue_step oid=1 interval=231 count=60000 add=0
Error: Unable to encode: queue_step
Error: Unable to encode: queue_step
Eval: reset_step_clock oid=2 clock=6752409628
Eval: queue_step oid=2 interval=231 count=60000 add=0
Error: Unable to encode: queue_step
Error: Unable to encode: queue_step

FYI, to run the benchmark on @dmbutyugin’s code, find and replace any commands that look like queue_step oid=2 interval=231 count=60000 add=0 with commands that look like queue_step oid=2 interval=231 count=60000 add=0 add2=0 shift=0. That is, add add2=0 shift=0 to all queue_step commands.


1 Like

Thanks, Kevin!

Yes, you got all above correct.

FWIW, the goal wasn’t really to reduce the number of generated queue_step commands, but rather to increase the precision of the stepping, as it seems that the current schema does not allow to reduce the stepping errors by, say, 10x and maintain the similar stepping rates. However, fewer queue_step commands help improve performance a bit, but it is insufficient to reach the performance of the current code, since indeed it spends only a few cycles to generate the next step. In principle, my point was that for practical printing, people are likely limited not by the maximum attainable step rate, but rather by other factors - e.g. if stepper_move is depleted for 2 or 3 steppers simultaneously (which may happen) and the MCU will have to take a slow stepper_load_next path, then it is possible that some timers will be scheduled in the past if the step rate is too high. So it might be that in the real setups, the performance hit will be not that significant. However, I do agree that just complicating the code and making it slower for now apparent benefits wouldn’t make a lot of sense.

Well, due to much tighter constraints on the step timing, I wouldn’t say the new code generates ‘velocity jumps’ really. Also, it is harder to conclude that the greedy algorithm ‘overshot’ because the 3-order polynomial schema can recover from it more naturally, without a need to introduce a velocity jump. However, what I did observe was that the new code would from time to time generate (just an illustrative example):

queue_step interval=20000000 count=200 add=100 add2=10 shift=16
queue_step interval=20000000 count=200 add=100 add2=-10 shift=16
queue_step interval=20316160 count=20 add=-20000 add2=1000 shift=16
queue_step interval=20000000 count=200 add=100 add2=10 shift=16

*basically, sometimes it would generate a queue_step command with a small count parameter and large add and add2 values, which is surrounded by queue_step commands with large count and small add and add2 parameters. Upon closer inspection of such cases they turned out to be similar to ‘overshooting’ cases. Meaning that the first step in that irregular queue_step command had to be noticeably different than the rest of the steps (due to accumulation of stepping errors in the previous command). So, I made this commit that tries to truncate queue_step commands more intelligently to prevent that kind of overshooting. And I think it improved the situation substantially (though I’m unsure if it entirely eliminated that possibility or not).

Maybe that can be done. But the current code, perhaps, does a better job at handling integer values of interval and add.

Thank you, Sineos, for giving it a try. FWIW, I’m unsure if my code changes could result in ‘smoother movements’. So, it might be that your impression wouldn’t pass the blind test :slight_smile: I thought that the mainline Klipper code could affect input shaping a bit (but I didn’t observe improvements on my printer), and that it may also create some audible noise (a few hundreds or thousands of Hz) if the steppers are running in spreadycle mode due to periodical velocity jumps, however that may also not be noticeable due to the noise produced by a stepper motor itself when it’s running in spreadcycle mode.

Yes, as written above, a quite subjective statement :innocent:
Lets agree on: I did not notice any detrimental effects and the prints came out as expected

Does not work for me.

allocate_oids count=3
config_stepper oid=0 step_pin=P1.20 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
config_stepper oid=1 step_pin=P1.21 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
config_stepper oid=2 step_pin=P1.23 dir_pin=P1.18 invert_step=-1 step_pulse_ticks=0
finalize_config crc=0
349.392: stats count=82 sum=85008 sumsq=5504992
354.393: stats count=67 sum=32335 sumsq=152086
SET start_clock {clock+freq}
SET ticks 231

reset_step_clock oid=0 clock={start_clock}
set_next_step_dir oid=0 dir=0
queue_step oid=0 interval={ticks} count=60000 add=0 add2=0 shift=0
set_next_step_dir oid=0 dir=1
queue_step oid=0 interval=3000 count=1 add=0

reset_step_clock oid=1 clock={start_clock}
set_next_step_dir oid=1 dir=0
queue_step oid=1 interval={ticks} count=60000 add=0 add2=0 shift=0
set_next_step_dir oid=1 dir=1
queue_step oid=1 interval=3000 count=1 add=0

reset_step_clock oid=2 clock={start_clock}
set_next_step_dir oid=2 dir=0
queue_step oid=2 interval={ticks} count=60000 add=0 add2=0 shift=0
set_next_step_dir oid=2 dir=1
queue_step oid=2 interval=3000 count=1 add=0Eval: SET start_clock 43821357194
Eval: reset_step_clock oid=0 clock=43821357194
Eval: queue_step oid=0 interval=231 count=60000 add=0 add2=0 shift=0
Error: Unable to encode: queue_step
358.575: shutdown clock=752173603 static_string_id=Invalid count parameter
Eval: reset_step_clock oid=1 clock=43821357194
358.575: is_shutdown static_string_id=Invalid count parameter
358.577: is_shutdown static_string_id=Invalid count parameter
Eval: queue_step oid=1 interval=231 count=60000 add=0 add2=0 shift=0
358.579: is_shutdown static_string_id=Invalid count parameter
358.579: is_shutdown static_string_id=Invalid count parameter
Error: Unable to encode: queue_step
358.581: is_shutdown static_string_id=Invalid count parameter
Eval: reset_step_clock oid=2 clock=43821357194
Eval: queue_step oid=2 interval=231 count=60000 add=0 add2=0 shift=0
358.583: is_shutdown static_string_id=Invalid count parameter
358.583: is_shutdown static_string_id=Invalid count parameter
358.585: is_shutdown static_string_id=Invalid count parameter
358.586: is_shutdown static_string_id=Invalid count parameter
359.393: stats count=102 sum=71864 sumsq=455272