Z stepper microstep shift on multi-mcu homing / loadcell / ldc1612

Well, I think I can reproduce this. I see microstep drift across some of the z steppers. The first z stepper never seems to drift. But the other steppers have some drift.

klipper log: test_microstep_drift.log (688.0 KB)
console log: test_microstep_drift.console.log (17.9 KB)

Keep in mind that this is NOT stock klipper, so you could dismiss this as just a bug in the load cell code. Its using custom endstop code, however what I’ve done doesn’t alter homing.py, kinematics or itersolve. Nothing I’ve done should impact the microstep position of an individual stepper. It also uses trigger_count=1 so this might not be the same issue. It could be something to do with the multi-stepper implementation.

I disabled the z-axis thermal compensation and tried not doing a QGL at machine startup but these didnt make a difference. I put some M400’s in the macro as it seems that GET_POSITION might run before the step queue drains and give an in flight position.

# Before
// mcu: stepper_x:6410 stepper_y:249602 stepper_z:248 stepper_z1:260 stepper_z2:6407 stepper_z3:263
// stepper: stepper_x:275.000000 stepper_y:225.000000 stepper_z:4.999982 stepper_z1:4.999982 stepper_z2:4.999982 stepper_z3:4.999982
// kinematic: X:250.000000 Y:25.000000 Z:4.999982

# After
// mcu: stepper_x:6410 stepper_y:249602 stepper_z:248 stepper_z1:262 stepper_z2:6409 stepper_z3:266
// stepper: stepper_x:275.000000 stepper_y:225.000000 stepper_z:4.999982 stepper_z1:4.999982 stepper_z2:4.999982 stepper_z3:4.999982
// kinematic: X:250.000000 Y:25.000000 Z:4.999982

# Before
12:44:49  // MSCNT: 000000dc mscnt=220
12:44:49  // MSCNT: 00000252 mscnt=594
12:44:49  // MSCNT: 000001df mscnt=479
12:44:50  // MSCNT: 000000e1 mscnt=225

# After
12:49:12  // MSCNT: 000000dc mscnt=220  # 0 microsteps
12:49:12  // MSCNT: 00000250 mscnt=592  # -2 microsteps
12:49:12  // MSCNT: 000001e1 mscnt=481  # -2 microsteps
12:49:13  // MSCNT: 000000de mscnt=222  # -3 microsteps

Test macro, for a Voron 2.4 with 4 steppers:

[gcode_macro TEST_MICROSTEP_DRIFT]
gcode:
    G1 Z5
    M400
    G1 Z5
    M400
    GET_POSITION

    DUMP_TMC STEPPER=stepper_z
    DUMP_TMC STEPPER=stepper_z1
    DUMP_TMC STEPPER=stepper_z2
    DUMP_TMC STEPPER=stepper_z3

    PROBE_ACCURACY SAMPLES=100

    G1 Z5
    M400
    G1 Z5
    M400
    GET_POSITION
    DUMP_TMC STEPPER=stepper_z
    DUMP_TMC STEPPER=stepper_z1
    DUMP_TMC STEPPER=stepper_z2
    DUMP_TMC STEPPER=stepper_z3
    
    M112

edit: switched to PROBE_ACCURACY

P.S. I tried setting trigger_count=2. This makes things drastically worse than the stock endstop because the ADC sensor only gives out a sample every ~3 milliseconds (3000us). But this doesn’t make a difference in my case, I get the same results. The first stepper in synced but the others seem to trail behind.

Interesting, thanks. I’m moving this to a new thread because at first glance it doesn’t seem to have the same symptoms as the original report. It’s also using the experimental loadcell code. It could be related though, so I think it is worth tracking down.

What you’re reporting definitely looks like an error in the software. The relative Z positions should not shift like that during a probe sequence. Unfortunately, I can’t reproduce this on my Voron2 with klicky probe (on toolhead mcu).

Good catch on using M400. I’ve updated the original post to include that. Note that it is a good idea to go to an exact position including XY (eg, G1 X50 Y50 Z5) before issuing GET_POSITION/DUMP_TMC. It may not matter on corexy, but it can definitely have an impact on other kinematics.

I’m not sure what could cause the different Z motors to get out of sync. The only thing that I can think of is if the trsync notification was somehow executed after some Z stepper had taken a step but before all steppers completed that step. That shouldn’t be possible though.

You could try adding some additional debugging to the code:

--- a/klippy/extras/homing.py
+++ b/klippy/extras/homing.py
@@ -122,8 +122,8 @@ class HomingMove:
             if trig_steps != halt_steps:
                 haltpos = self.calc_toolhead_pos(kin_spos, halt_steps)
             logging.info("Probe move: %s",
-                         (sp.stepper_name, sp.start_pos, sp.trig_pos, sp.halt_pos)
-                         for sp in self.stepper_positions)
+                tuple(((sp.stepper_name, sp.start_pos, sp.trig_pos, sp.halt_pos)
+                       for sp in self.stepper_positions)))
         else:
             haltpos = trigpos = movepos
             over_steps = {sp.stepper_name: sp.halt_pos - sp.trig_pos

The other thing to look for is if this continues from run to run. (That is, was it just a one time thing on the first set of probes.)

Finally, you could try issuing an M112 in the middle of a probing sequence - that way we can look at the low level queue_step commands and verify they are being sent to the mcu with exact step timings for all z motors.

-Kevin

Actually, it did show up on my voron2 with Klicky. It doesn’t happen frequently. I’ll try to track it down.

-Kevin

1 Like

My machine is using single MCU homing, not a toolhead MCU, if that matters?

I can reproduce this on my Voron2 as well. Its equipped with a BTT Eddy. I’m running Z at 128 microsteps, hence the MSCNT difference is double the MCU position difference.

klippy.log (376.9 KB)

# before
mcu: stepper_x:-18782 stepper_y:9850 stepper_z:-234064 stepper_z1:-234064 stepper_z2:-234064 stepper_z3:-234064
stepper: stepper_x:350.000000 stepper_y:-0.000000 stepper_z:5.000000 stepper_z1:5.000000 stepper_z2:5.000000 stepper_z3:5.000000
kinematic: X:175.000000 Y:175.000000 Z:5.000000
toolhead: X:175.000000 Y:175.000000 Z:5.000000 E:0.000000

MSCNT:      00000361 mscnt=865
MSCNT:      000000a1 mscnt=161
MSCNT:      00000361 mscnt=865
MSCNT:      000000a1 mscnt=161

# after
mcu: stepper_x:-18782 stepper_y:9850 stepper_z:-234064 stepper_z1:-234064 stepper_z2:-234062 stepper_z3:-234060
stepper: stepper_x:350.000000 stepper_y:-0.000000 stepper_z:5.000000 stepper_z1:5.000000 stepper_z2:5.000000 stepper_z3:5.000000
kinematic: X:175.000000 Y:175.000000 Z:5.000000
toolhead: X:175.000000 Y:175.000000 Z:5.000000 E:0.000000

MSCNT:      00000361 mscnt=865
MSCNT:      000000a1 mscnt=161
MSCNT:      00000365 mscnt=869
MSCNT:      00000099 mscnt=153

I concur with Kevin that this is different from the reported “drift” bug. That said, I really like that debugging macro, it might be a good idea to add it to the OP in the other thread.

Doesn’t the load cell have its own MCU?

I was able to track this down. It only impacts devices that call trsync_do_trigger() from task context. That’s multi-mcu homing and homing with descent on ldc1612. The experimental loadcell code is thus also caught up.

The issue is that the host code expects that trigger processing will be atomic when there are multiple Z steppers. There is a “loop hole” here that can occur if the mcu timer scheduler becomes backlogged - it has an ability to pause timers to allow tasks to have some priority. In this case, it is possible that some of the z steppers could take a step, tasks could be given priority, a task could signal a trigger event, and then the remaining steppers don’t take that step. I was aware of this “loop hole” but thought the probabilities were too unlikely to be concerned about. However, upon further inspection, due to the pattern of Z steppers and “scheduler backlog detection” the chance is actually about 1 in 100. (Specifically, TIMER_IDLE_REPEAT_TICKS / TIMER_DEFER_REPEAT_TICKS.) So, if one performs enough probe attempts there is a statistically likely chance of a slight shift being introduced to the multiple z steppers.

This should not impact traditional single-mcu endstop based probing as trsync_do_trigger() is issued from timer context (not task context).

Purely for testing purposes one can try to reproduce the problem with the patch below (this only addresses stm32 machines and this patch totally breaks scheduler priority so it’s not recommended to run outside of a test):

--- a/src/generic/armcm_timer.c
+++ b/src/generic/armcm_timer.c
@@ -137,7 +137,7 @@ timer_dispatch_many(void)
             // Check if there are too many repeat timers
             if (diff < (int32_t)(-timer_from_us(1000)))
                 try_shutdown("Rescheduled timer in the past");
-            if (sched_tasks_busy()) {
+            if (0 && sched_tasks_busy()) {
                 timer_repeat_until = now + TIMER_REPEAT_TICKS;
                 return TIMER_DEFER_REPEAT_TICKS;
             }
diff --git a/src/generic/timer_irq.c b/src/generic/timer_irq.c
index 40e9336b2..84014df54 100644
--- a/src/generic/timer_irq.c
+++ b/src/generic/timer_irq.c
@@ -55,7 +55,7 @@ timer_dispatch_many(void)
             // Check if there are too many repeat timers
             if (diff < (int32_t)(-timer_from_us(1000)))
                 try_shutdown("Rescheduled timer in the past");
-            if (sched_tasks_busy()) {
+            if (0 && sched_tasks_busy()) {
                 timer_repeat_until = now + TIMER_REPEAT_TICKS;
                 return now + TIMER_DEFER_REPEAT_TICKS;
             }

Unfortunately, a real fix is a bit involved. It’ll take some time to figure out how to address this issue.

-Kevin

I also like the macro. I updated the original post to recommend using the macro. Thanks!

As for the fix, my initial thinking is that there are two ways to address this.

  1. Work could be done to improve the scheduler backlog detection. This is probably worthwhile in general, as giving tasks priority when the scheduler isn’t actually backlogged is not ideal. However, this wont truly close the “loop hole” - just make it drastically less likely. It would also require an mcu reflash. There are minor variants of the scheduler code on different architectures, so testing becomes annoying.
  2. The host homing code could be changed to detect this skew and correct it. The Klipper host code does know of this skew (the code’s knowledge of the position is matching the stepper driver’s knowledge of the position). This has several advantages in general - it would allow for multiple z steppers to be on multiple different mcus. Unfortunately it is quite a bit of work as we have to schedule the correction moves and the homing code isn’t really setup to schedule movements during that point in the homing logic. Ideally, a skew correction would occur simultaneously with the probe lift action - but that’s even more work I suspect.

Not sure.
-Kevin

I could call trsync_do_trigger() from the timer callback. I already have a watchdog timer running that monitors for loss of signal from the ADC sensor (e.g. due to an error). I could schedule that to run at the step rate (similar to the existing endstop timer) and trigger from in there. Not nearly big of a lift as scheduling additional steps to fix the drift.

Maybe you can do something similar for multi-mcu homing? command_trsync_trigger() would have to schedule a timer, if I’m understanding how that works?

The main test rig is just a tool board.

That is an interesting idea I hadn’t thought of. It would be a hack, but would likely work in practice. Technically it doesn’t fully close the loop hole as the task could technically schedule the trigger timer to also be midway between steps, but that’d be a 1 in a billion type of thing.

So, another possibility to consider. It would require mcu reflash.

Thanks,
-Kevin

Here’s an initial version of the improved scheduler backlog detection:

--- a/src/sched.c
+++ b/src/sched.c
@@ -19,7 +19,7 @@ static struct timer periodic_timer, sentinel_timer, deleted_timer;
 
 static struct {
     struct timer *timer_list, *last_insert;
-    int8_t tasks_status;
+    int8_t tasks_status, tasks_busy;
     uint8_t shutdown_status, shutdown_reason;
 } SchedStatus = {.timer_list = &periodic_timer, .last_insert = &periodic_timer};
 
@@ -209,7 +209,10 @@ sched_wake_tasks(void)
 uint8_t
 sched_tasks_busy(void)
 {
-    return SchedStatus.tasks_status >= TS_REQUESTED;
+    if (SchedStatus.tasks_busy >= TS_REQUESTED)
+        return 1;
+    SchedStatus.tasks_busy = SchedStatus.tasks_status;
+    return 0;
 }
 
 // Note that a task is ready to run
@@ -243,7 +246,7 @@ run_tasks(void)
             irq_disable();
             if (SchedStatus.tasks_status != TS_REQUESTED) {
                 // Sleep processor (only run timers) until tasks woken
-                SchedStatus.tasks_status = TS_IDLE;
+                SchedStatus.tasks_status = SchedStatus.tasks_busy = TS_IDLE;
                 do {
                     irq_wait();
                 } while (SchedStatus.tasks_status != TS_REQUESTED);

It should be possible to use this instead of the timer_irq.c / armcm_timer.c changes I posted above.

This should squash the priority issue in practice - but can’t fully guarantee atomicity. (If the mcu actually is overloaded then it could still schedule tasks between steps but the chances should be dramatically lower.)

-Kevin

EDIT: Improved patch.

Assuming the probabilities were low is what ended up allowing this bug, is it safe to assume it won’t happen? isn’t there a safer way to handle it where such probability doesn’t even exist? the option 2 (detect on host code) you mentioned earlier?

While what you are saying definitely makes sense, there might be more issues.
I can only talk about my personal experience as stated in the bug report beforehand:
With klicky my back left corner was always too low which immediately went away when switching to beacon.
I didn’t change any other hardware and klicky was never on a separate MCU, everything was on the same mainboard.
I have no idea what causes that issue but since it was extremely consistent with klicky, e en across multiple different build plates, I would assume it’s not the probe and since it immediately went away with beacon I would assume it was not the frame or bed either.
It is really amazing you already discovered that, but there might be more issues at play here.

Just to be clear, the issue here does not appear similar to the issue discussed at [BUG] Accumulating microstep shift during probe moves related to endstop oversampling - #3 by koconnor . The issue here would not impact the typical single-mcu endstop setups. The issue here is primarily confined to multi-mcu homing setups.

In either case, if concerned about impact, best to run the test contained at the link above - it should detect any kind of “accumulating slippage”.

And yes, all of the possible solutions have trade-offs which is why they need to be considered.

Cheers,
-Kevin

I tried the sched.c patch and that was unsuccessful for me:
klippy_sched_c_patch.log (860.9 KB)

I also tried moving the trigger inside the timer callback and that too failed. If anything it looks worse which is not what we expected:

# Before
// mcu: stepper_x:149 stepper_y:-3 stepper_z:-77 stepper_z1:14 stepper_z2:-13 stepper_z3:7
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:4.999733 stepper_z1:4.999733 stepper_z2:4.999733 stepper_z3:4.999733

# After
// mcu: stepper_x:149 stepper_y:-3 stepper_z:-66 stepper_z1:25 stepper_z2:-2 stepper_z3:18
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:5.001452 stepper_z1:5.001452 stepper_z2:5.001452 stepper_z3:5.001452

# Before
// MSCNT: 0000013d mscnt=317
// MSCNT: 00000074 mscnt=116
// MSCNT: 0000000d mscnt=13
// MSCNT: 000002dd mscnt=733

# After
// MSCNT: 00000148 mscnt=328
// MSCNT: 00000069 mscnt=105
// MSCNT: 00000018 mscnt=24
// MSCNT: 000002d2 mscnt=722

klippy_timer_trigger_patch.log (741.0 KB)

I did not speed up the timer interval, I left it running at the senor’s sample rate. I figured this shouldn’t matter at all.


I changed the macro to run PROBE_ACCURACY as we should see an improvement when we get this right and I want before/after measurements. No one would be mad about these results over 100 probes:

probe accuracy results:
maximum 0.003631, minimum 0.001310,
range 0.002321, average 0.002613, median 0.002596,
standard deviation 0.000448

But consider how many lost microsteps are in those results?!

Kevin’s patch worked for me:

# before

mcu: stepper_x:15 stepper_y:13 stepper_z:-12 stepper_z1:-12 stepper_z2:-12 stepper_z3:-12
stepper: stepper_x:350.000000 stepper_y:-0.000000 stepper_z:5.000000 stepper_z1:5.000000 stepper_z2:5.000000 stepper_z3:5.000000
kinematic: X:175.000000 Y:175.000000 Z:5.000000
toolhead: X:175.000000 Y:175.000000 Z:5.000000 E:0.000000

MSCNT:      00000321 mscnt=801
MSCNT:      000000e1 mscnt=225
MSCNT:      00000325 mscnt=805
MSCNT:      000000d9 mscnt=217

# after

mcu: stepper_x:15 stepper_y:13 stepper_z:-12 stepper_z1:-12 stepper_z2:-12 stepper_z3:-12
stepper: stepper_x:350.000000 stepper_y:-0.000000 stepper_z:5.000000 stepper_z1:5.000000 stepper_z2:5.000000 stepper_z3:5.000000
kinematic: X:175.000000 Y:175.000000 Z:5.000000
toolhead: X:175.000000 Y:175.000000 Z:5.000000 E:0.000000

MSCNT:      00000321 mscnt=801
MSCNT:      000000e1 mscnt=225
MSCNT:      00000325 mscnt=805
MSCNT:      000000d9 mscnt=217

@garethky It looks like after your patch you consistently lost 11 steps on all z steppers. This would be closer to what was reported in the original “drift” issue.

It looks like something was off in your test:

toolhead: X:130.000000 Y:140.000000 Z:5.001502 E:0.000000
gcode: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

and:

toolhead: X:130.000000 Y:140.000000 Z:4.999663 E:0.000000
gcode: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

Note the discrepancy between the requested gcode position and the requested toolhead position. So, you didn’t make it back to the same starting position when you ran the second get_position. Sounds like a bed_mesh, z_thermal_adjust, or something like that was in effect.

Also, I would recommend changing the macro to go to an explicit XYZ position (not just Z). See [BUG] Accumulating microstep shift during probe moves related to endstop oversampling - #3 by koconnor

FWIW, your logs look like the patch was successful for you, as the initial report was a relative change in Z stepper locations, where as now you have an absolute change, which appears to be the result of some gcode transform in effect

Cheers,
-Kevin

1 Like

That’s probably it! I re-enabled the z-thermal-comp sections because I was actually printing something for once and forgot to turn them off.

I’ll re-test with the patch.

Re-tested the sched.c patch and I’m still getting a discrepancy:

// mcu: stepper_x:6 stepper_y:492 stepper_z:365 stepper_z1:1174 stepper_z2:1150 stepper_z3:304
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:5.000027 stepper_z1:5.000027 stepper_z2:5.000027 stepper_z3:5.000027
// toolhead: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000
// gcode: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

// MSCNT: 0000016d mscnt=365
// MSCNT: 0000038a mscnt=906
// MSCNT: 0000009e mscnt=158
// MSCNT: 000002d0 mscnt=720

// mcu: stepper_x:6 stepper_y:492 stepper_z:365 stepper_z1:1178 stepper_z2:1156 stepper_z3:312
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:5.000027 stepper_z1:5.000027 stepper_z2:5.000027 stepper_z3:5.000027
// toolhead: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000
// gcode: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

// MSCNT: 0000016d mscnt=365
// MSCNT: 00000386 mscnt=902
// MSCNT: 000000a4 mscnt=164
// MSCNT: 000002c8 mscnt=712

Question: my setup is running the HX711 sensor code that does a lot of IRQ disable blocks for bit-bang. Are either of you using neopixels or similar in your machines?

Why am I asking that… when I was making up the timer patch I tried 2 versions, with and without disabling the irq’s. There is a note on trsync_do_trigger to disable the irq’s:

// Activate a trigger (caller must disable IRQs)
void
trsync_do_trigger(struct trsync *ts, uint8_t reason)

But the version that does that is the one that lost the 11 microsteps.

I re-tested with the code as it is in endstop.c without the irq fence and it lost no microsteps (I think my original test was foiled by z_thermal_adjust):

# Before
// mcu: stepper_x:-56 stepper_y:6 stepper_z:21 stepper_z1:-100 stepper_z2:136 stepper_z3:38
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:5.000004 stepper_z1:5.000004 stepper_z2:5.000004 stepper_z3:5.000004
// kinematic: X:130.000000 Y:140.000000 Z:5.000004
// toolhead: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

// MSCNT: 00000183 mscnt=387
// MSCNT: 000003ea mscnt=1002
// MSCNT: 0000012c mscnt=300
// MSCNT: 000002a2 mscnt=674

# After
// mcu: stepper_x:-56 stepper_y:6 stepper_z:21 stepper_z1:-100 stepper_z2:136 stepper_z3:38
// stepper: stepper_x:270.000000 stepper_y:-10.000000 stepper_z:5.000004 stepper_z1:5.000004 stepper_z2:5.000004 stepper_z3:5.000004
// kinematic: X:130.000000 Y:140.000000 Z:5.000004
// toolhead: X:130.000000 Y:140.000000 Z:5.000000 E:0.000000

// MSCNT: 00000183 mscnt=387
// MSCNT: 000003ea mscnt=1002
// MSCNT: 0000012c mscnt=300
// MSCNT: 000002a2 mscnt=674

Could bit-bang code be interrupting the scheduler in cases where we thing the timing needs to be tight?

That might explain the pattern of steppers being delayed in order. The sensor read task interrupts the stepper halting and takes far longer than would be ideal. When the steppers are halted from inside the timer callback the steppers get halted before the next run of the bit-bang task.


Probe accuracy wasn’t any better with the patch:

// probe accuracy results:
maximum 0.002330, minimum -0.000102,
range 0.002432, average 0.001223, median 0.001294,
standard deviation 0.000527

However, plotting the results shows a slow shift from around 0 to 0.0015 which accounts for much of the range. The worst probe-to-probe variance is less (I assume high probe-probe variance is really lost microsteps). We could be measuring things in the physical machine moving, hysteresis of the physical components etc. at this scale.