Itersolve/InputShaper/PressureAdvanced precessing overhead

Well, the “problem”. The itersolver can take a significant CPU time.
This would really depend on the specific printer config/speed/slicer precision.

How do I profile the code (python >= 3.12):

$ export PYTHONPERFSUPPORT=1
$ time perf record -F 9999 -g -- python3 -X perf klippy/klippy.py rp2040_printer.cfg -i short.gcode -d out/klipper.dict -o test.serial -v
$ perf script > out.perf

How do I view the output: speedscope

I tested with arc fitting disabled, with the example model:

It takes ~57s:

With patched InputShaper, it takes ~49s.

49/57 = 0.859 ~ 14%

What’s the patch?
It does cache the input shaper intermediate position between calls.
I don’t think it is a clever solution. It is just what I can come up with.

With chelper -O0, not patched:

So, from my previous investigations.
Basically, the itersolver could do multiple tries around the same move, and if there are too many small moves to travel, they are multiplied by each other.
Like 100 tries, 100 moves ~10000k iterations.
(Actually, more tries, fewer moves, but it shows the idea).
Arcs can also produce small moves, generally.
Which could also make a high pressure there.

So, I’m not sure how stable the patch is; it does pass the tests.
It does work faster on average (I did try the different models).

PA range integrate could also use a significant time, but now I do not think there is an easy fix.

Thanks.


I already mentioned this overhead here: Klipper: communication bus tests - #11 by nefelim4ag
I agree that arcs are bad test cases, so I cleaned the patches and tested on something more “general”.

Could you share your gcode that you tested this with? And the test config, as in general the selected input shaper(s) and their frequencies will affect the performance. And also, whether you have enabled Pressure Advance or not. As a general comment, it might not be worthwhile to complicate the code with caching just for 10% performance improvement in some corner cases. I also made a prototype patch that does not cache the moves, but instead performs input shaping in one sweep over the moves, regardless of the number of input shaper pulses, implementation similar to Pressure Advance integration. Unfortunately, I observed only ~2-3% performance improvement on 2HUMP_EI input shaper on a spline test (but the accuracy of discretization is a bit of an open question, that’s why I’d like to test on your gcode test), however I at least did not observe any performance degradations on MZV shapers and such, and on other, more simple gcodes.

2 Likes

rp2040_printer.cfg (1.9 KB)
cube_no_arcs.zip (2.0 MB)

I mostly just copied sections from my printer and just changed the pins.
I hope there is nothing special.