Intermittent stepcompress "Invalid sequence" on CoreXY+input_shaper+probe_eddy_current — bed_mesh fade strongly affects MTBF

Basic Information:

Printer Model: Ender 5 Max (CoreXY)
MCU / Printerboard: Creality CR4NS200323C10 mainboard (120 MHz STM32) + CR-NOZZLE_V21 (120 MHz) + BTT Eddy USB (RP2040)
Host / SBC: Creality Nebula Pad (Ingenic T31X, MIPS XBurst2, single-core)
Klipper version: v0.13.0-628-g373f200ca + 1 local commit (see note below)
klippy.log: attached (4 crashes consolidated)

Issue summary

Intermittent stepcompress oid=X i=<negative> c=1 a=0: Invalid sequence shutdowns
during long prints, with the same C-level signature across 4 occurrences but
randomized magnitudes. The error always hits a CoreXY stepper (X or Y), never Z
or E. Reducing bed_mesh fade_end from 20 to 10 multiplied mean-time-to-crash
by roughly 8× without eliminating the bug.

Related (now closed) thread:

The other report there is on Voron 2.4 + Cartographer; same C-level signature,
different setup, same oid=8 hint that XY motors are specifically affected.

Setup

  • Kinematics: corexy
  • Steppers: TMC2209, interpolate: True, microsteps 16 (reduced from 32 as mitigation)
  • Input shaper: ZV @ 50.6 Hz on X, MZV @ 41.8 Hz on Y
  • Probe: [probe_eddy_current] (BTT Eddy USB), tap_threshold: 28
  • Bed mesh: rapid_scan, 15×15, fade_start: 1.0, currently fade_end: 10
  • max_velocity: 1000, max_accel: 4700 (reduced from 6500)

“Dirty” disclosure

I’m aware of the policy. My setup deviates from pristine in two ways:

  1. One local commit on top of master that filters a STEPPER_STEP_BOTH_EDGE
    warning for an older Creality MCU firmware. Pure cosmetics in
    klippy/configfile.py, 4 lines, no functional change.
  2. One untracked module (klippy/extras/probe_eddy_auto_calibrate.py),
    which is only invoked during calibration commands — never loaded or
    touched during the prints that crash.

I can rebuild and test without the cosmetic patch on request. The untracked
module is inert during prints by design.

Crash signature

All 4 crashes share the identical stack trace:
b’stepcompress o=X i= c=1 a=0: Invalid sequence’
b"Error in syncemitter ‘stepper_<X|Y>’ step generation"
Exception in flush_handler
Traceback (most recent call last):
File “/usr/data/klipper/klippy/extras/motion_queuing.py”, line 198, in _flush_handler
self._advance_flush_time(0., want_sg_time)
File “/usr/data/klipper/klippy/extras/motion_queuing.py”, line 156, in _advance_flush_time
raise self.mcu.error(“Internal error in stepcompress”)
mcu.error: Internal error in stepcompress

# Stepper (oid) i (signed) i (uint32 hex) Print elapsed Notes
1 x (oid=5) (negative) -– ~7 h microsteps=32, accel 6500, fade_end=20
2 y (oid=8) -60092 0xFFFF1A04 2 h 41 microsteps=16, accel 4700, fade_end=20
3 x (oid=5) -16778133 0xFEFFFC2B 1 h 53 microsteps=16, accel 4700, fade_end=20
4 y (oid=8) -2092144 0xFFE01650 17 h 11 microsteps=16, accel 4700, fade_end=10

Key observations

  1. Always X or Y, never Z or E. 4/4. With CoreXY, X and Y are the two motors
    driven by the input shaper’s joint convolution; Z and E use independent trapqs.

  2. i values look like data corruption, not deterministic overflow. The hex
    values 0xFFFF1A04, 0xFEFFFC2B, 0xFFE01650 don’t share a modular structure
    or a common bit pattern. They look like a uint32 read that captured a partially
    updated state.

  3. bed_mesh fade_end is in the causal chain. Reducing fade_end from 20
    to 10 changed average time-to-crash from ~3.5 h (3 crashes) to 17 h (1 crash),
    roughly an 8× improvement, but did not eliminate the bug.

  4. Other mitigations reduce frequency but don’t fix the bug:

    • microsteps 32 → 16
    • max_accel 6500 → 4700
    • jerks lowered to 5
      None eliminated it. mcu_awake is consistently 0.000–0.001 across all four
      crashes — the host CPU is nowhere near saturation.
  5. memavail is healthy at crash time (127 MB+ free). Not a memory issue.

Hypothesis (offered for discussion, not as a conclusion)

After reading through chelper/stepcompress.c, itersolve.c, kin_shaper.c,
kin_corexy.c, steppersync.c and motion_queuing.py, I couldn’t find a
single-threaded path that produces move.interval >= 0x80000000 in check_line.
The iterative solver maintains low_time monotonically within an invocation,
and last_flush_time is monotone across invocations.

The fact that:

  • the issue is statistical and only triggers on shared-trapq steppers (X/Y on
    CoreXY)
  • Z and E (separate trapqs) never trigger
  • fade_end (more shared bed_mesh state per stepper-frame) strongly affects MTBF
  • the corruption pattern looks like a partial read of pos->clock32 - (uint32_t)last_step_clock

…is consistent with a race in the per-stepper step generation introduced by
PR #6992 (a89694ac6, Sep 3, 2025). The follow-up fix 3c01f71d9
(Sep 24, 2025) is present in my build, plus 220 days of additional commits.

I freely admit this is speculation. The Voron+Cartographer case in the closed
thread points the same way (oid=8, same stack, post-multi-thread Klipper),
and Sineos there did mention “it could be some race condition that may hit
home or not, depending on circumstances not fully understood”
.

What I’d find most useful

  1. Diagnostic instrumentation. Is there a recommended way to instrument
    check_line() or compress_bisect_add() to dump the queue state and the
    most recent moves at error time? My setup is reliably reproducible (one crash
    per 3–17 h of printing), so capturing a structured dump on the next occurrence
    would be much more informative than further code reading.

  2. Sequential step generation as a diagnostic. A build flag (or even a
    one-off patch) that interleaves se_start_gen_steps and se_finalize_gen_steps
    per syncemitter in steppersyncmgr_gen_steps() would conclusively answer the
    multi-thread-race vs other-cause question. Has anyone tried this?

  3. Cross-check from another reporter. If anyone else here is hitting this
    on a CoreXY + input_shaper + probe_eddy_current/cartographer/beacon setup,
    I’d love a confirmation that fade_end also affects the MTBF on their side.
    That would strongly support the trapq-sharing hypothesis.

Happy to provide additional logs, extracts, run any test (including a vanilla
rebuild to drop the dirty flag) and report back. Targeting the underlying issue,
not asking for a hand-holding workaround.

cc @nefelim4ag @koconnor

crash4_extract.log (63.9 KB)

Alas, AI is not helpful.
If you would like, I can generate you a longer answer without a word of a meaning.

Otherwise, we need a full log that contains the error/reproduction in the first place.
From the start of the machine to the crash, which was not modified.
You can zip it if it is too large.

-Timofey

Thank you for answering :slight_smile:

I understand IA is not always the way and you are right. But usefull sometimes ^^

My english is not always good enough to chat about technical things. So I used Claude to translate and synthetise.

klippy_log.zip (2.5 MB)

You may find the full log attached in zip.

Hmmm, if I ignore all of the modifications.
I guess the only clue that I have is a frequent update of SCV.
IIRC, I’ve seen a similar log and a similar issue.

For now, I can only suggest eliminating frequent SCV updates.
It is possible that there is a bug, but I’m unable to reproduce it even if I do SCV updates every other line of G-code.

Thank you for taking the time !

I checked the source gcode : SET_VELOCITY_LIMIT has 110717 occurencies. Only 5 distinct values (5/7/9/15/20).

Does that sounds normal for you ?

Do you think a cached wrapper macro for SET_VELOCITY_LIMIT would makes a difference ?

May I ask you another point related to this issue : reducing bed_mesh fade_end from 20 to 10 made the print able to go from ~3.5h to 17h. May it be an additional clue for you ?


Thanks by the way, your commit 4cc47cf56 validates the approach we had in our custom module