Load Cell caused 'Rescheduled timer in the past'?

Basic Information:

:white_check_mark: Printer Model: Voron 2.4 + Nextruder
:white_check_mark: MCU / Printerboard: Octopus Pro / stm32f446xx @ 180 MHz
:white_check_mark: Host / SBC: Raspberry Pi 4 Model B Rev 1.4
:white_check_mark: klippy.log: klippy-error.log (669.1 KB)

Describe your issue:

I’m hoping one of you that’s better than me at parsing that log file can give me some insights into what is going on.

In this PR: PR: Load Cell Probe by garethky · Pull Request #6871 · Klipper3d/klipper · GitHub I rebased on top of mainline that has the step on both edges optimization turned on. When I did this I got a crash when printing: MCU 'mcu' shutdown: Rescheduled timer in the past. I also noticed higher CPU usage on the MCU while printing.

The printer got through the initial startup routine, including bed meshing, without issue. The first crash happened on a rapid at the start of the print. So I dialed back the rapid speeds. Then a second crash happened while it was printing a first layer perimeter, so not a high step rate. That makes me suspicious that its just random bad timing luck.

I tried to narrow down the root cause with a couple of experiments:

  • Disabling the step on both edges optimization and re-flashing fixes the issue
  • Hacking the code to turn off the load cell sensor while not probing also fixes the issue.

The latest code in the PR actually turns off using the sensor data while the a homing move is not happening: code
This should mean that this version is essentially the same as not having this PR at all and just running a [load_cell] with the HX71x sensor while printing. So perhaps the root cause is already be in mainline. Unfortunately I only have the 1 test machine and no alternate way to home Z, or I’d air print with the sensor running to verify this idea.

I have to admit that I’m finding it very hard not to say “see Rescheduled timer in the past:face_with_peeking_eye:

Good luck and thanks for keeping this up with such dedication. I look forward to seeing the final results.

1 Like

for my chip: STM32F446 the 3-stepper ticks is 205. I think the lowest i value is i=320 on stepper_y.

mcu_awake=0.363 mcu_task_avg=0.000130
I believe this means the MCU was awake 36.3% of the time, which seems very high. With ‘both edges’ turned off I think its more like 2%.

Well, the log seems to indicate the error message is at is says - the timer scheduler got backlogged to the point it couldn’t keep up with the requested schedule.

The way I diagnose issue like this is by using the scripts/logextract.py tool. That will reorganize the log by the ordering occurring on the micro-controller. Debugging - Klipper documentation

From that analysis, at the time of the shutdown it looks like stepper_y was heavily active:

stepper_y queue_step 134: t=74001650230(43463.758944) p=26241 i=321 c=-831 a=0

That is an interval of 321 - which is a step every ~1.78us on the stm32f446 (which runs at 180Mhz). The log doesn’t seem to indicate much other stepper activity at the time (it seems stepperz-z3 are running, but at a relatively slow step rate).

That is a high step rate (580K steps per second to just the y), but the stm32f446 should have been able to handle it (the benchmarks show 1 stepper active can handle an interval=46 and 3 steppers active can handle an interval=205). The “rescheduled timer in the past” error is only raised if the scheduler falls behind by 1ms. So, by the time of the error, stepper_y was likely ~560 steps behind.

The reason disabling “step on both edges” changed the behavior is subtle - the different stepper driver implementations have a slightly different pattern to how they check that they are keeping up. So, it likely changed the pattern of error detection, but in no way fixed the underlying problem. (The stepper was likely still wildly behind on its requested schedule.)

So, the big question is, what other timers were running on the mcu at the time of the fault? This is where knowing the exact code running and exact configuration is critical. So, I’m not able to offer much help. What you are looking for is all the timers that may be running and what their “rest_ticks” are set to. Tasks are unlikely to be a culprit (unless they are holding irqs disabled for 100s of microseconds). It’s also possible that general coding errors (like enabling irqs when not supposed to, or disabling irqs and forgetting to reenable them, forgetting to set a timer to its correct rest time, etc.) could cause problems like this.

Hope that helps a little. If you can reproduce on code with just the minimal load_cell implementation (such that I can have good confidence I know exactly all the code running) then I may be able to offer more insight.

Cheers,
-Kevin

That does look suspicious, but it is unclear if that is a symptom or a cause. You can see the raw mcu message that generated that statistic in the log:

Receive: 24 43462.742937 43462.736409 13: seq: 12, stats count=2789 sum=65403029 sumsq=4294967295

This indicates that over the last 5 seconds the task code went idle 2789 times and was busy for 65403029 ticks when not idle. So, roughly 1/14th of the last 5 seconds were spent with tasks not idle (180000000 * 5 / 65403029).

Alas, it doesn’t indicate if tasks were busy because timers were so active they weren’t yielding to tasks, or if the tasks themselves were just super busy.

My interpretation is the former, as timers have a higher priority than tasks. Timers will yield to tasks, but for only 5% of the cpu (in src/generic/armcm_timer.c see TIMER_DEFER_REPEAT_TICKS / TIMER_REPEAT_TICKS == 5%).

Even with 5% of the cpu given to tasks, the stm32f446 should have been able to complete the step schedule it was requested. So, something seems amiss with the active timers.

-Kevin

1 Like

Ok, sorry to say, the I cant replicate the issue now. I tried to get the printer back into the failing state by make clean, make menuconfig, make & flash. I tried it twice, both print jobs completed successfully.

I wrote a macro to move the machine in a 200mm square at 500mm/s with the Z axis in motion as well. The macro is the spikes that you can see in this graph:


Awake time is pretty comparable to the crash case. But I couldn’t make it crash with this test.

Actually printing the load is back to what I expect, around 2% steady state. This is a print start routine including a bed mesh. Peak awake time is 4%.

Heisenbug? :disappointed_face:

Maybe I failed to make clean after the rebase and some sort of cruft got compiled in?

1 Like

Well, I guess that’s good news.

Do you know what is causing those “awake time” spikes though? Seems very odd.

-Kevin

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.