[BUG] Accumulating microstep shift during probe moves related to endstop oversampling

ray_boy · October 15, 2024, 9:55pm

This is not my issue but it is a bug which could explain some odd behavior by many many users. I am also creating the discussion post to meet the criteria setout by the project moderator so it can get progressed. (edit: Beacon is not affected by this bug)

Github Issues Link: [BUG] Accumulating microstep shift during probe moves related to endstop oversampling. · Issue #6711 · Klipper3d/klipper · GitHub

(edit 2:) Unfortunately the linked issue above has been closed but Arksine has posted a PR to address this issue: mcu: fix endstop probing by Arksine · Pull Request #6712 · Klipper3d/klipper · GitHub

Abbreviated issue details, refer to github issues for all details

Issue Description

During Z-axis probing operations, there’s a consistent microstep difference between the trigger position and the halt position that should not be present, depending on driver microstep settings and probing speed. This difference is added to the commanded stepper position which causes it to accumulate over multiple probe moves, resulting in inaccurate bed meshes (with probes that require movement in Z) and potentially affecting first layer consistency and printed geometry.

Reproduction Steps

Configure Z steppers to use 64 microsteps
Set probing speed to 3 mm/s
Apply diff in the relevant code section to log the microstep difference.
Perform Z probing operations (e.g loop PROBE and optionally GET_POSITION at the same x/y position).
grep "Probe move" ~/printer_data/logs/klippy.log to show the step difference, and grep "Setting toolhead position" ~/printer_data/logs/klippy.log, to show the continuous drift.

Note: this may require a sufficiently high resolution z-axis or faster probe speeds to reproduce, this bug is consistently reproduced with TR8*4 leadscrews (a rotation distance of 4mm), 1.8 degree steppers, 64 microsteps and 3 mm/s of probing speed. Anything resulting in more steps generated pr sec will show the issue. The microstep shift scales with steps per second.

Observed Behavior

A consistent microstep shift occurs between trigger and halt positions during probing
The shift accumulates with each probe move, moving the nozzle closer to the bed
Bed mesh results are shifted, with error scaling with point count and probe samples

Expected Behavior

Trigger and halt positions should align without a consistent shift during probing operations. Unsure about multi-mcu setups, but the issue would be present there as well.

Additional Observations

The issue is specific to probing; homing exhibits the same shift but doesn’t have major real-world impact as the position is reset, so no accumulation.
Reducing microsteps to 16 eliminates the issue when probing at 3 mm/s.
Driver interpolation does not affect the outcome.
Setting ENDSTOP_SAMPLE_COUNT to 1 eliminates the issue at the cost of disabling noise filtering (which seems unnecessary on modern printer hardware)
Increasing ENDSTOP_SAMPLE_COUNT exacerbates the problem (e.g., 32 samples increases error to 4-5 microsteps per probe move in our case)
The issue has been reproduced on several machines with different stepper drivers (TMC2209, TMC5160)
Some setups exhibit larger errors than others despite using the same stepper and driver configurations
Increased probing speeds may increase the error.
There are possible edge cases where X and Y axes may drift as well.

garethky · October 18, 2024, 5:00pm

This code is the particular kind of tricky that is most difficult to reason about: temporally tricky. I spent a lot of time reading the endstop and homing code for the load cell work. I have to agree with what kevin said: the Trigger Position and the Halt Position should not be the same. Since ENDSTOP_SAMPLE_COUNT is hard coded to 4, you should end up not having those two things being equal as there should be ~3 samples worth of overshoot in the Halt Position.

I had always assumed that the MCU had to be queried to find out the exact number of microsteps that were executed before halting, since only it knows where it got to in the decompressed step queue. I don’t know the itersolve.c part of the codebase at all so I cant say that’s whats going on when stepper.get_mcu_position() is called. But from a glance it doesn’t look like it.

~~What would make this worse? Setting ENDSTOP_SAMPLE_COUNT = 16? Setting microsteps to 256? Can you give kevin something that produces a consistent gross error (vs a small error by chance?)~~ (You have these things in the tail of the post)

(I’m selfishly watching this issue to see if it has an impact on load cell probing accuracy/repeatability)

koconnor · October 19, 2024, 11:19pm

I’ve spent many hours researching this issue and unfortunately I am not able to reproduce a problem. In conversations with @Arksine, he has also spent time on this and is also unable to reproduce a problem. So, at this point, we need more information to track down the cause. I am definitely interested in identifying and fixing any problems in this critical area of the code.

We can both readily reproduce the consistent microstep shift occurs between trigger and halt positions during probing however we can not reproduce a problem resulting from that. It is normal for these two internal positions to be different - it’s why the code tracks two positions.

It would help to have the full Klipper log from an event. With the full information in the log it can help me locally recreate a closer replica of the impacted printers. Unfortunately I have not yet had access to a complete log.

It would also help to run some simple tests to check for a “shift accumulation”. The process is pretty simple - add the following macro to the printer.cfg and run it:

[gcode_macro TEST_MICROSTEP_DRIFT]
gcode:
    G90
    G1 X50 Y50 Z25
    M400
    GET_POSITION
    DUMP_TMC STEPPER=stepper_z

    PROBE SAMPLES=70

    G1 X50 Y50 Z25
    M400
    GET_POSITION
    DUMP_TMC STEPPER=stepper_z

During the above test, record the text reported from GET_POSITION and the MSCNT report from DUMP_TMC (eg, MSCNT: 0000003b mscnt=59). The key to the test is that the reported positions from the pre-probe report and the post-probe report should be identical. These commands report Klipper’s position tracking and the stepper motor driver’s position tracking. A “shift accumulation” will show up as a deviation between the two sets of values. A deviation generally indicates that Klipper has an internal error.

If a deviation is reported then I definitely want to see the full Klipper log from the event.

It may also be the case that a simple PROBE doesn’t reveal the problem. (At this point, we can’t reproduce so it’s hard to say for sure what is causing the problem.) Feel free to run other probe commands to see if some other sequence generates the result. Just make sure that the intervening commands don’t purposely change the internal positions (eg, don’t run G28, don’t run QUAD_GANTRY_LEVEL, Z_TILT_ADJUST, FORCE_MOVE, SET_KINEMATIC_POSITION, use a SET_GCODE_OFFSET, activate a bed mesh, or similar) and make sure to always return to the same position after probing (eg, G1 X50 Y50 Z25) prior to running GET_POSITION.

If a problem is found, best to issue an M112 command and then attach the full log here. (Details for generating the log are at Contact - Klipper documentation ).

-Kevin

EDIT: New updates to test macro from garethky.

koconnor · October 21, 2024, 1:58am

2 posts were split to a new topic: Z stepper microstep shift on loadcell

Sineos · October 22, 2024, 4:58am

A post was split to a new topic: Tmc_autotune extra

koconnor · October 27, 2024, 2:17am

FYI, the mainline version of the Klipper code will now warn if it detects stepper motors getting out of sync after a probe attempt. If one sees a log message like Stepper 'stepper_z1' position skew after probe: pos -200631 now -200630 it is a result of this new check. If you do get the message, make sure your micro-controllers are flashed to the lasted version of the code. If they are on the latest version of the code, please open a report on Klipper Discourse and be sure to attach the full Klipper log.

Cheers,
-Kevin

tinkerman · October 28, 2024, 8:59pm

For a long time I had issues with bed mesh and was thrown off to diagnose issues with inconsistent Z distances during print by many other factors:

Temperature affecting readings of my inductive probe
Thermal expansion
inconsistent bed levelling
Crappy lead screw nuts
Probe movement on its fixation point

After upgrading my printer to have 3 independent Zed drivers, and auto levelling the bed (Z tilt), the issue persisted.

In a strike of luck I realized that during bed mesh sometimes the first trigger of the probe sometimes would be slightly inaccurate. So I increased the sampling and voila: After so long I could finally use my entire bed and had perfect first layers. I made a mesh for each temperature / plate, and automated the macro to load the proper mesh according to print bed temperature.

This had been working flawlessly until recently. I had to reflash my board so it got a version from a couple weeks back now and now I am having issues again. Multi sampling no longer produces an accurate mesh, the skew due to my bed warp gets amplified and I am now back limited to using a small area of the bed.

I am wondering if it could be related to this issue and if there were semi recent updates on this area that could be causing this.

I still don’t know how to properly debug this again again beyond painstakingly measuring the accuracy of each point (12x10).

I am going to experiment a bit, perhaps reflash the board again, to see if I can get back to my short lived dream of perfect first layer.

garethky · October 30, 2024, 5:06am

What did you increase exactly? Was it the samples value in the probe section?:

[probe]
samples: 4

Could you follow the testing procedure with the macro and see if your setup is impacted?

tinkerman · October 30, 2024, 6:56am

Yes, it wasn’t defined before, I have set it to 2. I have now increased it to 3 to reduce the odds of having adhesion issues. Probing speed is 5 mm/s (Z)

Right, for some reason I completely missed, will try it out

tinkerman · October 30, 2024, 7:56pm

If I understood the goal of the test correctly, then I got a reproducible result. I even had to reduce the samples in half because 70 would simply cause too big of a drift that was hitting retry limit. Initial zed positions:

stepper_z:25.000081
stepper_z1:25.000081
stepper_z2:25.000081

MSCNT: 000000ec mscnt=236

After 35 samples:

stepper_z:25.000000
stepper_z1:25.000625
stepper_z2:25.000625

MSCNT: 000000ec mscnt=236

MSCNT didn’t change, but positions did.

I wish I could repeat this test when I was not having this issue.

I rolled over the log file before this test.

Hopefully there is something to uncover from it (even if my own blunder).

klippy.log (154.3 KB)

Edit: I noticed an odd pattern on the drift.z1 and z2 drift by the same amount whilst Z drifts separately (causing a tilt that will significantly get worse over time, the taller the print is. Z tilt gets bigger the taller the previous print was). The reason being, I believe is that z1 and z2 lead screws having matching pitches, whilst Z has it’s own pitch (Z, z1 and z2 rotation distances reflect this)

koconnor · October 31, 2024, 2:11am

Thanks. Your log does seem to indicate something incorrect occurred. However there were some code changes recently in this area (both mcu and host) that you don’t have on your install. Could you update to the latest code, reflash the micro-controllers, and retry?

-Kevin

tinkerman · October 31, 2024, 8:55am

Hi Kevin,

I have updated both Host and MCU and ran the test again (35 samples). Result:

Before:
stepper_z:83
stepper_z1:41
stepper_z2:56

After:
stepper_z:83
stepper_z1:33
stepper_z2:48

This result make very little sense as this way too big of a difference, which I would be able to visually distinguish.

klippy_0.12.0-347.log (140.2 KB)

Sineos · October 31, 2024, 9:29am

Are you sure with these results? Are you able to reproduce and validate them?

Side note:
You can already see in the log the warning of the newly introduced check:

Stepper 'stepper_z1' position skew after probe: pos -17312 now -17313
Stepper 'stepper_z2' position skew after probe: pos -17297 now -17298

Edit:
When you rerun it, it might make sense to issue a M112 right after the test finished. Maybe the extended debug information are helpful.

tinkerman · October 31, 2024, 9:33am

I ran Z_TILT_ADJUST right after the previous test and then re-ran the test and found that the positions are whole numbers (again), which I find extremely unlikely.

before:
stepper_z:-2
stepper_z1:-4
stepper_z2:22

after: retry exceeded.

So I tried again and got:

before:
stepper_z:-2
stepper_z1:-30
stepper_z2:-4

after:
stepper_z:-2
stepper_z1:-44
stepper_z2:-18

Z seems consistent, but Z1 and z2 are all over the place.
Gonna powercycle everything and test again and rerun M112 (don’t know what it does)…

Klippy (no rollover):
klippy_0.12.0-347_retest.log (380.4 KB)

Sineos · October 31, 2024, 9:36am

It is the emergency shutdown command. In addition to the shutdown, it will log extended information that might be helpful for further diagnosis. Not sure if it has an impact in this case, but at least in would not hurt.

tinkerman · October 31, 2024, 9:49am

Ahh, that’s helpful

Learning everyday, thanks.

So, I was looking at the wrong numbers (the whole number part). Either way, after power cycling:

before:
stepper_z:25.000168
stepper_z1:24.999543
stepper_z2:24.999543

after:
stepper_z:25.000168
stepper_z1:25.000168
stepper_z2:25.000168

It already looks better to me. Z now matches and z1 and z2 drift together. What are the odds there is an electrical issue here? Though I find strange the drift together. The motors are of different models and run under different currents.

Hum… Had reboot the host, after M112. Mainsail was working, camera was off and restarting klipper didn’t have any effect.

klippy_0.12.0-347_powercycle.log (399.8 KB)

Edit: I will double check this test. Find it odd that z1 and z2 realigned with Z after the test. I expect them to differ as part of tilt adjustment.

Edit 2: Tested again and for the first time I got a matching before and after. Gonna increase sampling back to 70 and retest

tinkerman · October 31, 2024, 10:36am

70 samples doesn’t work, I think it’s just a statistical impediment due to hardware limitation (it eventually hiccups with the tolerance I set).

But after running it twice (neither finished), The starting values for all z positions matched.

I will post here the rolled logs and start a real test, with a new bed mesh and print. I am optimistic that the issue is resolved.

klippy_0.12.0-347_70samples.log (255.5 KB)

tinkerman · October 31, 2024, 12:45pm

Now, this is what I like to see:

Even though not everything is perfectly tuned, I am getting perfectly parallel parts in relation to the bed and perfectly flush layers. The sound of a well calibrated printer is when only the fan is clearly audible. No sound of the nozzle scraping the parts during travel.

This print earlier today failed in that beautiful string we all grown very intimate of. It had happened by this layer already. Now I feel more confident to boost the speed a bit more.

Thanks for the hard and good work on this. Truly appreciate the effort.

I have perfect first layer again and next print I will use the full area of the bed.

Sineos · October 31, 2024, 1:01pm

Great that it is working for you. Just your log would indicate the contrary at a cursory glance

LifeOfBrian · October 31, 2024, 1:38pm

Is your fan duct blocken by the sock of the heating block?
Does not look like the air can reach a spot below the nozzle tip.

Topic		Replies	Views
Testers Needed! Multi_MCU Homing/Probing Fix Developers	1	366	December 31, 2024
Probe triggered prior to movement only at 1 spot General Discussion	3	1876	April 10, 2024
Z stepper microstep shift on multi-mcu homing / loadcell / ldc1612 General Discussion	27	804	December 26, 2024
One second pause during probing General Discussion	17	1910	May 6, 2023
BTT Microprobe Configuration Problem General Discussion	12	1601	January 16, 2024