Intermittent stepcompress "Invalid sequence" on CoreXY+input_shaper+probe_eddy_current — bed_mesh fade strongly affects MTBF

Basic Information:

Printer Model: Ender 5 Max (CoreXY)
MCU / Printerboard: Creality CR4NS200323C10 mainboard (120 MHz STM32) + CR-NOZZLE_V21 (120 MHz) + BTT Eddy USB (RP2040)
Host / SBC: Creality Nebula Pad (Ingenic T31X, MIPS XBurst2, single-core)
Klipper version: v0.13.0-628-g373f200ca + 1 local commit (see note below)
klippy.log: attached (4 crashes consolidated)

Issue summary

Intermittent stepcompress oid=X i=<negative> c=1 a=0: Invalid sequence shutdowns
during long prints, with the same C-level signature across 4 occurrences but
randomized magnitudes. The error always hits a CoreXY stepper (X or Y), never Z
or E. Reducing bed_mesh fade_end from 20 to 10 multiplied mean-time-to-crash
by roughly 8× without eliminating the bug.

Related (now closed) thread:

The other report there is on Voron 2.4 + Cartographer; same C-level signature,
different setup, same oid=8 hint that XY motors are specifically affected.

Setup

  • Kinematics: corexy
  • Steppers: TMC2209, interpolate: True, microsteps 16 (reduced from 32 as mitigation)
  • Input shaper: ZV @ 50.6 Hz on X, MZV @ 41.8 Hz on Y
  • Probe: [probe_eddy_current] (BTT Eddy USB), tap_threshold: 28
  • Bed mesh: rapid_scan, 15×15, fade_start: 1.0, currently fade_end: 10
  • max_velocity: 1000, max_accel: 4700 (reduced from 6500)

“Dirty” disclosure

I’m aware of the policy. My setup deviates from pristine in two ways:

  1. One local commit on top of master that filters a STEPPER_STEP_BOTH_EDGE
    warning for an older Creality MCU firmware. Pure cosmetics in
    klippy/configfile.py, 4 lines, no functional change.
  2. One untracked module (klippy/extras/probe_eddy_auto_calibrate.py),
    which is only invoked during calibration commands — never loaded or
    touched during the prints that crash.

I can rebuild and test without the cosmetic patch on request. The untracked
module is inert during prints by design.

Crash signature

All 4 crashes share the identical stack trace:
b’stepcompress o=X i= c=1 a=0: Invalid sequence’
b"Error in syncemitter ‘stepper_<X|Y>’ step generation"
Exception in flush_handler
Traceback (most recent call last):
File “/usr/data/klipper/klippy/extras/motion_queuing.py”, line 198, in _flush_handler
self._advance_flush_time(0., want_sg_time)
File “/usr/data/klipper/klippy/extras/motion_queuing.py”, line 156, in _advance_flush_time
raise self.mcu.error(“Internal error in stepcompress”)
mcu.error: Internal error in stepcompress

# Stepper (oid) i (signed) i (uint32 hex) Print elapsed Notes
1 x (oid=5) (negative) -– ~7 h microsteps=32, accel 6500, fade_end=20
2 y (oid=8) -60092 0xFFFF1A04 2 h 41 microsteps=16, accel 4700, fade_end=20
3 x (oid=5) -16778133 0xFEFFFC2B 1 h 53 microsteps=16, accel 4700, fade_end=20
4 y (oid=8) -2092144 0xFFE01650 17 h 11 microsteps=16, accel 4700, fade_end=10

Key observations

  1. Always X or Y, never Z or E. 4/4. With CoreXY, X and Y are the two motors
    driven by the input shaper’s joint convolution; Z and E use independent trapqs.

  2. i values look like data corruption, not deterministic overflow. The hex
    values 0xFFFF1A04, 0xFEFFFC2B, 0xFFE01650 don’t share a modular structure
    or a common bit pattern. They look like a uint32 read that captured a partially
    updated state.

  3. bed_mesh fade_end is in the causal chain. Reducing fade_end from 20
    to 10 changed average time-to-crash from ~3.5 h (3 crashes) to 17 h (1 crash),
    roughly an 8× improvement, but did not eliminate the bug.

  4. Other mitigations reduce frequency but don’t fix the bug:

    • microsteps 32 → 16
    • max_accel 6500 → 4700
    • jerks lowered to 5
      None eliminated it. mcu_awake is consistently 0.000–0.001 across all four
      crashes — the host CPU is nowhere near saturation.
  5. memavail is healthy at crash time (127 MB+ free). Not a memory issue.

Hypothesis (offered for discussion, not as a conclusion)

After reading through chelper/stepcompress.c, itersolve.c, kin_shaper.c,
kin_corexy.c, steppersync.c and motion_queuing.py, I couldn’t find a
single-threaded path that produces move.interval >= 0x80000000 in check_line.
The iterative solver maintains low_time monotonically within an invocation,
and last_flush_time is monotone across invocations.

The fact that:

  • the issue is statistical and only triggers on shared-trapq steppers (X/Y on
    CoreXY)
  • Z and E (separate trapqs) never trigger
  • fade_end (more shared bed_mesh state per stepper-frame) strongly affects MTBF
  • the corruption pattern looks like a partial read of pos->clock32 - (uint32_t)last_step_clock

…is consistent with a race in the per-stepper step generation introduced by
PR #6992 (a89694ac6, Sep 3, 2025). The follow-up fix 3c01f71d9
(Sep 24, 2025) is present in my build, plus 220 days of additional commits.

I freely admit this is speculation. The Voron+Cartographer case in the closed
thread points the same way (oid=8, same stack, post-multi-thread Klipper),
and Sineos there did mention “it could be some race condition that may hit
home or not, depending on circumstances not fully understood”
.

What I’d find most useful

  1. Diagnostic instrumentation. Is there a recommended way to instrument
    check_line() or compress_bisect_add() to dump the queue state and the
    most recent moves at error time? My setup is reliably reproducible (one crash
    per 3–17 h of printing), so capturing a structured dump on the next occurrence
    would be much more informative than further code reading.

  2. Sequential step generation as a diagnostic. A build flag (or even a
    one-off patch) that interleaves se_start_gen_steps and se_finalize_gen_steps
    per syncemitter in steppersyncmgr_gen_steps() would conclusively answer the
    multi-thread-race vs other-cause question. Has anyone tried this?

  3. Cross-check from another reporter. If anyone else here is hitting this
    on a CoreXY + input_shaper + probe_eddy_current/cartographer/beacon setup,
    I’d love a confirmation that fade_end also affects the MTBF on their side.
    That would strongly support the trapq-sharing hypothesis.

Happy to provide additional logs, extracts, run any test (including a vanilla
rebuild to drop the dirty flag) and report back. Targeting the underlying issue,
not asking for a hand-holding workaround.

cc @nefelim4ag @koconnor

crash4_extract.log (63.9 KB)

Alas, AI is not helpful.
If you would like, I can generate you a longer answer without a word of a meaning.

Otherwise, we need a full log that contains the error/reproduction in the first place.
From the start of the machine to the crash, which was not modified.
You can zip it if it is too large.

-Timofey

Thank you for answering :slight_smile:

I understand IA is not always the way and you are right. But usefull sometimes ^^

My english is not always good enough to chat about technical things. So I used Claude to translate and synthetise.

klippy_log.zip (2.5 MB)

You may find the full log attached in zip.

Hmmm, if I ignore all of the modifications.
I guess the only clue that I have is a frequent update of SCV.
IIRC, I’ve seen a similar log and a similar issue.

For now, I can only suggest eliminating frequent SCV updates.
It is possible that there is a bug, but I’m unable to reproduce it even if I do SCV updates every other line of G-code.

Thank you for taking the time !

I checked the source gcode : SET_VELOCITY_LIMIT has 110717 occurencies. Only 5 distinct values (5/7/9/15/20).

Does that sounds normal for you ?

Do you think a cached wrapper macro for SET_VELOCITY_LIMIT would makes a difference ?

May I ask you another point related to this issue : reducing bed_mesh fade_end from 20 to 10 made the print able to go from ~3.5h to 17h. May it be an additional clue for you ?


Thanks by the way, your commit 4cc47cf56 validates the approach we had in our custom module

Alas, these types of errors are quite hard to debug. There have been sporadic reports of individuals running into the issue, but I’ve not seen any widespread reports of problems.

It’s possible that a “race condition” could corrupt memory like that, but I don’t see any issues with the code. In particular the time updating looks okay to me.

I’ve added a new debugging PR at Improved debugging on "Internal error in stepcompress" by KevinOConnor · Pull Request #7271 · Klipper3d/klipper · GitHub for additional debugging during one of these events (it should dump the trapq and past stepper moves).

Other than that, one can add errorf() calls to the any part of the C code.

Well, I guess you could change steppersyncmgr_gen_steps() to call se_finalize_gen_steps() immediately after se_start_gen_steps(). It’d still be multi-threaded, but at least it would be sequential.

For what it is worth, it seems very unusual that you can repeatedly get this error, when so many other machines don’t ever get the error. If I had to guess, it’d be due to something on your host machine (maybe something quirky in the architecture, something in gcc exposing an issue, or similar).

I can’t think of any way that a fade_end or many SET_VELOCITY_LIMIT commands would cause an internal step compress error.

Finally, if you can produce a log using pristine code then I’ll try to take a look at it. (It’d have to be pristine for me to look at it, because errors like this require many hours of debugging and I can’t afford to do that if there is any possibility of unknown code.)

Maybe that helps a little,
-Kevin