Klipper 400MHz limitations

trofen · April 16, 2024, 12:29pm

According to this post, the microcontroller cannot run faster than 400 MHz.

I was curious about why and where these restrictions appear. And is it possible to fix it somehow?

The first thing I realized is that the STM32H7 microcontrollers do not have a 64-bit hardware timer. This is not a problem because some microcontrollers don’t even have 32-bit timers, however klipper has a solution.

Secondly, as far as I understand, the klipper source code always uses the uint32_t data type for time. Is it safe to change this type to 64-bit? Will there be any problems with 8-bit architects? And do we need any changes for host code?

@koconnor said it’s been about 10 seconds, but I don’t understand where this number comes from? Is there any way we can change this?

TheFuzzyGiggler · April 16, 2024, 1:38pm

Klipper at it’s core relies on strict timing between all sections (Seperate mcus, stepper pulses etc.)

I don’t have an example of specifically where he’s referring to the 10 second limit from as there is a lot of timing related code and checks in various locations.

But 400 Mhz = 2.5ns per clock tick
10 seconds / 2.5 ns = 4 x 10^9
uint32 can hold 4.294967296 × 10^9 values

Hence it would overflow the clock within that 10 seconds expected timing window.

Will there be any problems with 8-bit architects?

No, Because they stay within the timing window specified above.

When the timer hits a max value in the mcus faster than 400 mhz it rolls over to zero and starts again (simplistic view, depends on the mcu, what timer is used etc.)

When Klipper sees discrepancies in the timing sync it throws and error and shuts down.

Is it safe to change this type to 64-bit?
Is there any way we can change this?

What are you trying to do? Keep in mind that everything is Klipper is about driving stepper motors to move things around. That’s what it’s built for, that’s what it does.

As stated in the other post

which is very close to the minimum supported by the trinamic drivers (~103ns)

Of course there is some overhead in the mcu so it isn’t a stepper pulse per clock cycle but some of the faster mcus are already at the limit of the stepper drivers. But if you’re curious you can see the existing processor benchmarks here.

https://www.klipper3d.org/Benchmarks.html#micro-controller-benchmarks

One of the best is the RP2040 and it’s a 133 Mhz processor

rp2040	ticks
1 stepper	5
3 stepper	22

So really, It all comes down to this… What are you trying to do?

A faster mcu isn’t going to result in faster prints because you can’t drive the stepper motors faster than certain step speeds. Your physical printer setup will be a barrier long before the software/processor is any kind of impediment.

Edit: Also see here for more details about the timing. I forgot the code does handle timer roll over (it would have to), and it converts 32 bit timers to 64 bits and vice versa.

https://www.klipper3d.org/Code_Overview.html#time

koconnor · April 23, 2024, 10:37pm

I think this was mostly answered above. But to repeat, the micro-controllers use a signed 32bit integer to track times. Currently heaters program a maximum “heater pin on time” of 5 seconds in the mcu. So, the mcu has to be able to schedule out an event at least 5 seconds into the future without rolling over a signed 32bit integer. At 400Mhz, five seconds is 2,000,000,000 which fits. Indeed, one could go up to about 429Mhz and still fit. If using a clock rate faster than that, then an event scheduled 5 seconds in advance would appear to be in the past and result in errors.

Changing to something other than 32bit signed integers in the mcu code is unlikely to be a solution as it would likely decrease code efficiency significantly. It may be possible to audit all the code and never use a timer that far in advance (eg, limit to 3 seconds in the future), but that would be a bit of work.

In most cases, we don’t need to run the micro-controllers at over 400Mhz as they are already “more than fast enough” at 400Mhz.

Cheers,
-Kevin

trofen · April 26, 2024, 2:22pm

@TheFuzzyGiggler @koconnor thanks for the detailed answer!

However, I did not understand one thing. Kevin said about 32bit signed integer. Why not a 32bit unsigned integer?

TheFuzzyGiggler · April 26, 2024, 3:55pm

Consistency in calculations
Unambigious
The ability to count down as well as up

koconnor · April 26, 2024, 6:15pm

Times in the mcu are tracked as 32bit numbers. The code can’t do a direct comparison between times (eg, now_time == scheduled_time) because one can’t guarantee the comparison will be at just the right tick (the desired time may have been a few ticks earlier). So, the code needs to compare if the desired time has elapsed. However, directly doing that does not work (eg, now_time >= scheduled_time) because of 32bit counter rollovers (eg, now_time might be less than scheduled_time only because of a counter rollover). So, the code performs comparisons by checking time differences (eg, (int32_t)(now_time - schedule_time) >= 0). This is efficient to calculate and works reliably even during counter rollovers, but it does limit the maximum schedule time to no more than a 2^31 ticks in the future.

-Kevin

trofen · April 27, 2024, 7:50am

I looked at /src/generic/timer_irq.c.
If I understood correctly, this is some kind of tricky solution, assuming that A < B but if the distance between events is less than half of the virtual timeline. Right?
MCU timeline

TheFuzzyGiggler · April 28, 2024, 8:24am

Again, This is one of those situations where, if you said what you were trying to DO you’d probably get a better, more direct answer.

Are you just curious about how Klipper works and why it’s setup like that?

Are you trying to modify something for a specific end goal?

Help us understand what the underlying question is to give you the answer you’re actually looking for.

Again, That question might actually be just how the timers work that way and why, but it seems like you’re trying to achieve something because you keep asking about changing things.

trofen · May 2, 2024, 6:58am

As you said, I’m really interested in understanding how it works.

And yes, I would like to use more than 400 MHz just because the MCU can.

However, unfortunately, I still do not see a simple and elegant way to do this. 64-bit timestamps will really greatly reduce the performance of weak MCUs. Using 64-bit timestamps only for the H7 MCU leads to support for 2 host-to-MCU communication APIs, which is bad. We can use the same trick as on RP2040, use a separate clock source for the timer, even if it is 10 times slower than the MCU frequency, if I understand correctly how it works, it will not harm and will really increase the performance of the MCU and the maximum step rate. But as if the solution rather looks like something is not very correct.

P.S. I understand that it really won’t be possible to use all this performance for one driver. The change is likely for AWD systems, where 5 drivers are used at high speeds (X, Y, X1, Y1, E) and for complex IDEX assemblies, where there can be 10 motors on one MCU at the same time, which can be used simultaneously (X, U, Y, Y1, Z, Z1, Z2, Z3, E, E1)

koconnor · May 2, 2024, 3:29pm

FYI, if one were to increase the mcu frequency above 400Mhz, then one would likely need to disable the CONFIG_HAVE_STEPPER_BOTH_EDGE optimization. Currently that optimization is used to improve the performance of Trinamic stepper motor drivers. Those drivers need 100ns between steps (40 clock ticks at 400Mhz). The optimization takes advantage of the fact that the mcu can’t complete the Klipper step pulse and scheduling work in less than 100ns.

However, if the mcu frequency is notably increased then the optimization would need to be disabled. The optimization roughly doubles the number of steps per second that the mcu can schedule. So, it’s unlikely a faster mcu speed would increase stepping performance unless one were to go over 800Mhz.

Cheers,
-Kevin

P.S. - check out Benchmarks - Klipper documentation for the measurements of the number of clock ticks per Klipper stepper pulse on each mcu.

trofen · May 3, 2024, 7:36am

Kevin, thank you so much for your reply.

I found the discussion where there is information on this optimization. The only thing I did not understand is whether the timing of 100ns will not be observed only when the stepper motors are moving at high speeds, or will this problem arise anyway? I mean, can we change the step pin output more often than at the speed at which we want to rotate the stepper motor?

P.S. About benchmarks. Are the RP2040 results in processor ticks or in clock cycles of the 12 MHz counter? Apparently, it is the counter cycles that are somewhat confusing when viewing these results.

TheFuzzyGiggler · May 3, 2024, 10:13am

The 100ns is built into the chip itself. Not to mention that Stepper motors can physically only be driven so fast.

From a cursory reading it seems like any faster than that the chip will consider it spurious noise and try to filter it out.

100ns is 10 million steps per second, or for a 1.9 degree stepper motor (200 steps per rotations), that’s 50,000 RPS (per second, not per minute).

Not that I’ve ever tried or will ever try but for one, I don’t think a stepper motor could turn anywhere near that fast due to the physics of their construction and two, if they ever did spin that fast they’d almost certainly tear themselves apart from centrifugal force and be an extremely deadly high speed shrapnel weapon if the case didn’t contain it.

I’m sure the 100ns is due to micro-stepping, but that’s still nearly 200 RPS. Far faster than a stepper could feasibly turn and even if it tried to approach that you’d likely have zero torque so it would be useless.

can we change the step pin output more often than at the speed at which we want to rotate the stepper motor?

Why would you want to? and per the data sheet, no, it would see it as noise and try to filter it (if I’m reading that right)

trofen · May 3, 2024, 10:51am

@TheFuzzyGiggler I think you completely misunderstood me (or I explained it badly).

All calculations about 50,000 RPS do not make any sense, because no one controls the motors in full-step mode
No one is going to spin a stepper motor to the speed of light. Kevin said that CONFIG_HAVE_STEPPER_BOTH_EDGE only works stable because of the MCU cannot schedule events faster than 100ns. So my question was: Why would we need to control the step pin with an interval of less than 100ns, if there is no purpose to turn the stepper motor at these speeds?

Thus, is this limitation of MCU performance a “natural” protection mechanism against violation the limit of 100 ns, or is it an important “feature” without which optimization will not work at all?

koconnor · May 3, 2024, 5:38pm

It’s a general limitation even if not stepping at those rates. As an example, if the motor is stepping once every 50us, but the mcu gets busy for a few 100us, then the klipper mcu will “catch up” by running all the pending timers as fast as it can - thus the code needs the timing enforcement regardless of desired scheduling.

Yes, the numbers are in “scheduling ticks” and not “processor ticks”.

-Kevin

TheFuzzyGiggler · May 4, 2024, 2:45am

You made me curious.

Turns out, assuming a standard GT2 pully with a diameter in the grooves of 12mm, the linear speed at point on that circumference for a stepper turning at 50,000 RPS would be ~1885 m/s

Which, doing the math for the acceleration is 6.04 x 10^7 g’s of acceleration.

To get up to the speed of light at a point on the circumference you’d need a pully with a diameter of ~2km or a little over a mile.

Don’t think that’ll fit in my printer, nor do I think my NEMA 17 steppers have the torque to turn it.

Plus, I’m pretty sure that would be considered creating a WMD. Which is generally frowned upon.

No relativistic 3d printing speeds for me.

Edit: For fun I went a little further…

Assume a 1/8" ball bearing attached somehow at a point on the circumference of our imaginary pulley.

A 1/8" ball bearing has a weight of .131 grams or .000131 kg.

Using the relativistic kinetic energy equation (y -1)mc^2 where y is the Lorentz factor (1/sqrt(1-v^2/c^2))…

Linear speed of that pulley would be 2.78x10^8 m/s, just baaaaarely under light speed.

Lorentz factor in that case would be 2.67… Plug it into the equation and with the ball bearing mass and c and you get…

1.97x10^13 joules or a 4.7 kiloton nuke

Definitely a WMD for a 1/8" ball bearing

Could wipe LAX off the map
NukeMap

nefelim4ag · August 20, 2024, 2:12am

I think this topic is just missing a part where someone wants to remove the 400 Mhz limit, to hit the limits.
This is not so important of course but may have value, like for mad deltas with 256 microstepping, or AWD.
Let’s ignore the practical side here.

RP2040 is a bad example because these ticks are in internal timer ticks, not processor clocks.
Actually, it is 5 times slower than STM32H723 (550Mhz) where we can hit that limit.
For reference, RP2040 runs at 150 Mhz but has 2 cores (I think only one is used).

About the TMC limit, if TMC has an external clock reference faster than 12Mhz - it can support “faster” timings, but that’s for hardware guys, I think no one here has actually tried that.

About Klippy, it is tricky.
But at least Edge optimization is fairly easy - a patchwork solution:

// patch just in case
diff --git a/src/stepper.c b/src/stepper.c
index 00a8ff01..5cc4a42f 100644
--- a/src/stepper.c
+++ b/src/stepper.c
@@ -29,6 +29,8 @@
  #define HAVE_AVR_OPTIMIZATION 0
 #endif
 
+#define HAVE_TOO_FAST_MCU (CONFIG_CLOCK_FREQ > 430000000)
+
 struct stepper_move {
     struct move_node node;
     uint32_t interval;
@@ -77,6 +79,10 @@ stepper_load_next(struct stepper *s)
     s->interval = m->interval + m->add;
     if (HAVE_SINGLE_SCHEDULE && s->flags & SF_SINGLE_SCHED) {
         s->time.waketime += m->interval;
+        if (HAVE_TOO_FAST_MCU) {
+            s->next_step_time += m->interval;
+            s->time.waketime = s->next_step_time;
+        }
         if (HAVE_AVR_OPTIMIZATION)
             s->flags = m->add ? s->flags|SF_HAVE_ADD : s->flags & ~SF_HAVE_ADD;
         s->count = m->count;
@@ -99,6 +105,12 @@ stepper_load_next(struct stepper *s)
     return SF_RESCHEDULE;
 }
 
+static unsigned int
+nsecs_to_ticks(uint32_t ns)
+{
+    return timer_from_us(ns * 1000) / 1000000;
+}
+
 // Optimized step function to step on each step pin edge
 uint_fast8_t
 stepper_event_edge(struct timer *t)
@@ -106,13 +118,38 @@ stepper_event_edge(struct timer *t)
     struct stepper *s = container_of(t, struct stepper, time);
     gpio_out_toggle_noirq(s->step_pin);
     uint32_t count = s->count - 1;
-    if (likely(count)) {
+    if (likely(count) && !HAVE_TOO_FAST_MCU) {
         s->count = count;
         s->time.waketime += s->interval;
         s->interval += s->add;
         return SF_RESCHEDULE;
     }
-    return stepper_load_next(s);
+    if (likely(count) && HAVE_TOO_FAST_MCU) {
+        uint32_t curtime = timer_read_time();
+        uint32_t min_next_time = curtime + nsecs_to_ticks(100);
+        s->count = count;
+        s->next_step_time += s->interval;
+        s->interval += s->add;
+        // The next step event is too close - push it back
+        if (unlikely(timer_is_before(s->next_step_time, min_next_time)))
+            s->time.waketime = min_next_time;
+        else
+            s->time.waketime = s->next_step_time;
+        return SF_RESCHEDULE;
+    }
+
+    if (HAVE_TOO_FAST_MCU) {
+        uint32_t curtime = timer_read_time();
+        uint32_t min_next_time = curtime + nsecs_to_ticks(100);
+        uint_fast8_t ret = stepper_load_next(s);
+        if (ret == SF_DONE || !timer_is_before(s->time.waketime, min_next_time))
+            return ret;
+        // Next step event is too close to the last unstep
+        int32_t diff = s->time.waketime - min_next_time;
+        if (diff < (int32_t)-timer_from_us(1000))
+            shutdown("Stepper too far in past");
+    }
+    return SF_RESCHEDULE;
 }
 
 #define AVR_STEP_INSNS 40 // minimum instructions between step gpio pulses

It still will be faster than dedge=0, but slower than the optimized one, not sure how much slower.
*for reference dedge=0 stepper_event_full() has 51
By taking into account assembly instructions 20 → 30, slower by 33%.
400 * 1.33 = 532 Mhz
With taking into account branching - slightly less.

*I’m not sure which part of the code can schedule it if the timer has passed, I still do not fully understand how all things work together underneath

BTW, patch works

nefelim4ag · January 23, 2025, 12:16am

I got curious, so:

arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:12.2.rel1-1) 12.2.1 20221205

allocate_oids count=3
config_stepper oid=0 step_pin=PG0 dir_pin=PG1 invert_step=-1 step_pulse_ticks=0
config_stepper oid=1 step_pin=PF13 dir_pin=PF12 invert_step=-1 step_pulse_ticks=0
config_stepper oid=2 step_pin=PC13 dir_pin=PF0 invert_step=-1 step_pulse_ticks=0
finalize_config crc=0
...
ECHO Test result is: {"%.0fK" % (3. * freq / ticks / 1000.)}

I got 44 and 185 ticks respectively for STM32H7 (slightly faster than currently in docs).
Master: 1 stepper 9091K and 3 steppers 6452K

I did overclock to 435Mhz, to still account for TMC optimization, and fixed several small things in the code to make timers works.
Patched: 1 stepper 9886K and 3 steppers 7054K, ticks count are same.
So, ~8%.

(much less changes than I expected)
Patched branch: GitHub - nefelim4ag/klipper at stm32h7-overclock

I think it may be better to compute “MAX_TIME” based on the MCU clock, but that is more like overkill.
Right now I changed 5 → 4.5 because of 2000_000_000 / 435_000_000 = 4.59

prehistory: a buddy of mine runs servos with really high microstepping because of some servo controller limitation (like 80000 per rotation). Servo controllers use pulses instead of edges for stepping, so it is easily got Stepper too far in past

So, for reference, with “normal” stepper drivers (pulse limit is 44 ticks ~ 100ns) the limit is:

171 ticks or 2339K (400Mhz) → 2544K (435Mhz)
361 ticks or 2216K → 2410K
535 ticks or 2242K → 2439K

So, I think there is a slightly different dependency between edge and pulse mode in performace.

nefelim4ag · January 23, 2025, 11:46pm

I have updated the above patch, now it is “tested” and works as expected.

Maybe there is a place to optimize assembly.
This patch costs 85 ticks at 550Mhz, and 75 at 450Mhz for some reason.
For 3 steppers it is better, there are 197 ticks (450Mhz) and 195 (550Mhz).
So, the step rate is:

1 * 550_000_000 / 85 // 1000 = 6470.0
3 * 550_000_000 / 195 // 1000 = 8461.0

FWIW, 550 is too much and there is a little more work to make it reliable if that makes sense at all.
For example, it is making software SPI a little bit too fast, I think, there is no rate limit, currently on stm32h7 at 400Mhz - it is running around 4Mhz, and on 550Mhz MCU it is running at ~5.6Mhz, so it looks like SPI TMC (with limit 6Mhz at 12Mhz internal clock I think) return some noise and printer can’t be homed because of SPI write errors.
500 Mhz same.
450 Mhz works, it even still prints.

I tried to busy wait in edge event - it limits frequency, there are 210ns per pulse - 105 ns per step ~ 57 ticks (I utilized an oscilloscope probe to verify).
For 3 steppers 195 ticks, same as above.

It works up to 49 ticks, so formally there are 11224K

Patch

diff --git a/src/stepper.c b/src/stepper.c
index 00a8ff01..2c1d8991 100644
--- a/src/stepper.c
+++ b/src/stepper.c
@@ -29,6 +29,8 @@
  #define HAVE_AVR_OPTIMIZATION 0
 #endif
 
+#define HAVE_TOO_FAST_MCU (CONFIG_CLOCK_FREQ > 430000000)
+
 struct stepper_move {
     struct move_node node;
     uint32_t interval;
@@ -77,6 +79,8 @@ stepper_load_next(struct stepper *s)
     s->interval = m->interval + m->add;
     if (HAVE_SINGLE_SCHEDULE && s->flags & SF_SINGLE_SCHED) {
         s->time.waketime += m->interval;
+        if (HAVE_TOO_FAST_MCU)
+            s->next_step_time = s->time.waketime;
         if (HAVE_AVR_OPTIMIZATION)
             s->flags = m->add ? s->flags|SF_HAVE_ADD : s->flags & ~SF_HAVE_ADD;
         s->count = m->count;
@@ -99,11 +103,23 @@ stepper_load_next(struct stepper *s)
     return SF_RESCHEDULE;
 }
 
+static unsigned int
+nsecs_to_ticks(uint32_t ns)
+{
+    return timer_from_us(ns * 1000) / 1000000;
+}
+
 // Optimized step function to step on each step pin edge
 uint_fast8_t
 stepper_event_edge(struct timer *t)
 {
     struct stepper *s = container_of(t, struct stepper, time);
+    if (HAVE_TOO_FAST_MCU) {
+        uint32_t ctime = timer_read_time();
+        while (unlikely(timer_is_before(ctime, s->next_step_time)))
+            ctime = timer_read_time();
+        s->next_step_time = ctime + nsecs_to_ticks(100);
+    }
     gpio_out_toggle_noirq(s->step_pin);
     uint32_t count = s->count - 1;
     if (likely(count)) {

software spi fixed & works.
hardware i2c has overflow of presc value.

480_000_000 / 4 / 8_000_000
15.0

The branch is updated.

Hmm, something else is broken on 550Mhz, so it is errors out randomly.
Maybe in the end this high not worth the effort.

480Mhz survived 2 several hours long prints.

Arduinix · January 29, 2025, 12:03pm

So why BTT sell board with 550Mhz mcu? How they manage to run klipper with this

LifeOfBrian · January 29, 2025, 12:12pm

Other firmwares can profit from the higher speed and Klipper just limits it to 400 MHz afaik.

Topic		Replies	Views
MCU 'mcu' shutdown: Rescheduled timer in the past - Host Buffer 100% General Discussion	22	1657	August 16, 2024
"Timer too close", always at the same point in gcode General Discussion	28	493	June 28, 2024
Hc32f460 MCU Clock Speed Issue/Bug General Discussion	12	1144	April 11, 2024
SV07plus, KlipperScreen & MCU 'mcu' shutdown: Rescheduled timer in the past General Discussion	24	1403	March 1, 2024
MCU ‘sb2040’ shutdown: Timer too close General Discussion	48	651	August 24, 2024

Klipper 400MHz limitations

1.97x10^13 joules or a 4.7 kiloton nuke

Related topics