Is Linux still working after commit 913d099

printer: custom
mcu: IPC RT-Linux Debian 12
host: IPC RT-Linux Debian 12
klippy.log: see below

We get errors on an IPC when using Klipper. After some diggings, we found out that there were some big changes in the commit 913d099: linux: Use Unix signals to notify when a timer is pending.
To our understanding, the variable must_wake_timers should be declared as atomic. The use of the thread_fence() function also improve the behavior.
After these changes, the error happens but not so often. We have the impression that there are some other bugs.

Did anyone try the last version on a Raspberry PI?

printer_tsn.cfg (1.2 KB)

INFO:root:Starting Klippy...
WARNING:root:No log file specified! Severe timing issues may result!
INFO:root:Start printer at Mon Mar  6 05:32:26 2023 (1678098746.9 5438.9)
INFO:root:mcu 'mcu': Starting connect
INFO:root:Loaded MCU 'mcu' 111 commands (v0.11.0-123-g6b75f54e-dirty-20230306_045721-debian / gcc: (Debian 12.2.0-14) 12.2.0 binutils: (GNU Binutils for Debian) 2.40)
MCU 'mcu' config: ADC_MAX=4095 CLOCK_FREQ=50000000 MCU=linux PCA9685_MAX=4096 PWM_MAX=32768 STATS_SUMSQ_BASE=256
INFO:root:Sending MCU 'mcu' printer configuration...
INFO:root:Configured MCU 'mcu' (804 moves)
INFO:root:Stats 5439.4: gcodein=0  mcu: mcu_awake=0.000 mcu_task_avg=0.000000 mcu_task_stddev=0.000000 bytes_write=910 bytes_read=4450 bytes_retransmit=0 bytes_invalid=0 send_seq=117 receive_seq=117 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50012581 sysload=0.99 cputime=0.172 memavail=6377172 print_time=0.000 buffer_time=0.000 print_stall=0
INFO:root:Stats 5440.4: gcodein=0  mcu: mcu_awake=0.000 mcu_task_avg=0.000000 mcu_task_stddev=0.000000 bytes_write=916 bytes_read=4466 bytes_retransmit=0 bytes_invalid=0 send_seq=118 receive_seq=118 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50004218 sysload=0.99 cputime=0.173 memavail=6377172 print_time=0.000 buffer_time=0.000 print_stall=0
INFO:root:Stats 5441.4: gcodein=0  mcu: mcu_awake=0.001 mcu_task_avg=0.000006 mcu_task_stddev=0.000007 bytes_write=922 bytes_read=4496 bytes_retransmit=0 bytes_invalid=0 send_seq=119 receive_seq=119 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50002339 sysload=0.99 cputime=0.174 memavail=6377172 print_time=0.000 buffer_time=0.000 print_stall=0
INFO:root:Stats 5442.4: gcodein=0  mcu: mcu_awake=0.001 mcu_task_avg=0.000006 mcu_task_stddev=0.000007 bytes_write=928 bytes_read=4512 bytes_retransmit=0 bytes_invalid=0 send_seq=120 receive_seq=120 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50001836 sysload=0.99 cputime=0.175 memavail=6377172 print_time=0.000 buffer_time=0.000 print_stall=0
INFO:root:Stats 5443.4: gcodein=0  mcu: mcu_awake=0.001 mcu_task_avg=0.000006 mcu_task_stddev=0.000007 bytes_write=934 bytes_read=4528 bytes_retransmit=0 bytes_invalid=0 send_seq=121 receive_seq=121 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50001468 sysload=0.99 cputime=0.176 memavail=6376932 print_time=0.000 buffer_time=0.000 print_stall=0
INFO:root:Stats 5444.4: gcodein=7  mcu: mcu_awake=0.001 mcu_task_avg=0.000006 mcu_task_stddev=0.000007 bytes_write=1397 bytes_read=4675 bytes_retransmit=0 bytes_invalid=0 send_seq=137 receive_seq=137 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0 freq=50001084 sysload=1.00 cputime=0.184 memavail=6376436 print_time=8.899 buffer_time=0.000 print_stall=0
INFO:root:MCU 'mcu' shutdown: Stepper too far in past
clocksync state: mcu_freq=50000000 last_clock=424369783 clock_est=(5439.597 187302430 50001084.588) min_half_rtt=0.000018 min_rtt_time=5444.338 time_avg=5439.597(1.804) clock_avg=187302430.244(90189134.398) pred_variance=1569082861.216
Dumping serial stats: bytes_write=1479 bytes_read=4702 bytes_retransmit=0 bytes_invalid=0 send_seq=140 receive_seq=140 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 stalled_bytes=0
Dumping send queue 100 messages
Sent 0 5438.992786 5438.992786 9: seq: 18, identify offset=960 count=40
Sent 1 5438.992867 5438.992867 9: seq: 19, identify offset=1000 count=40
Sent 2 5438.992948 5438.992948 9: seq: 1a, identify offset=1040 count=40
Sent 3 5438.993029 5438.993029 9: seq: 1b, identify offset=1080 count=40
Sent 4 5438.993109 5438.993109 9: seq: 1c, identify offset=1120 count=40
Sent 5 5438.993189 5438.993189 9: seq: 1d, identify offset=1160 count=40
Sent 6 5438.993269 5438.993269 9: seq: 1e, identify offset=1200 count=40
Sent 7 5438.993350 5438.993350 9: seq: 1f, identify offset=1240 count=40
Sent 8 5438.993431 5438.993431 9: seq: 10, identify offset=1280 count=40
Sent 9 5438.993512 5438.993512 9: seq: 11, identify offset=1320 count=40
Sent 10 5438.993592 5438.993592 9: seq: 12, identify offset=1360 count=40
Sent 11 5438.993673 5438.993673 9: seq: 13, identify offset=1400 count=40
Sent 12 5438.993753 5438.993753 9: seq: 14, identify offset=1440 count=40
Sent 13 5438.993832 5438.993832 9: seq: 15, identify offset=1480 count=40
Sent 14 5438.993912 5438.993912 9: seq: 16, identify offset=1520 count=40
Sent 15 5438.994019 5438.994019 9: seq: 17, identify offset=1560 count=40
Sent 16 5438.994103 5438.994103 9: seq: 18, identify offset=1600 count=40
Sent 17 5438.994186 5438.994186 9: seq: 19, identify offset=1640 count=40
Sent 18 5438.994267 5438.994267 9: seq: 1a, identify offset=1680 count=40
Sent 19 5438.994347 5438.994347 9: seq: 1b, identify offset=1720 count=40
Sent 20 5438.994427 5438.994427 9: seq: 1c, identify offset=1760 count=40
Sent 21 5438.994507 5438.994507 9: seq: 1d, identify offset=1800 count=40
Sent 22 5438.994588 5438.994588 9: seq: 1e, identify offset=1840 count=40
Sent 23 5438.995083 5438.995083 9: seq: 1f, identify offset=1880 count=40
Sent 24 5438.995256 5438.995256 9: seq: 10, identify offset=1920 count=40
Sent 25 5438.995357 5438.995357 9: seq: 11, identify offset=1960 count=40
Sent 26 5438.995469 5438.995469 9: seq: 12, identify offset=2000 count=40
Sent 27 5438.995565 5438.995565 9: seq: 13, identify offset=2040 count=40
Sent 28 5438.995655 5438.995655 9: seq: 14, identify offset=2080 count=40
Sent 29 5438.995742 5438.995742 9: seq: 15, identify offset=2120 count=40
Sent 30 5438.995828 5438.995828 9: seq: 16, identify offset=2160 count=40
Sent 31 5438.995913 5438.995913 9: seq: 17, identify offset=2200 count=40
Sent 32 5438.995999 5438.995999 9: seq: 18, identify offset=2240 count=40
Sent 33 5438.996086 5438.996086 9: seq: 19, identify offset=2280 count=40
Sent 34 5438.996202 5438.996202 9: seq: 1a, identify offset=2320 count=40
Sent 35 5438.996336 5438.996336 9: seq: 1b, identify offset=2360 count=40
Sent 36 5438.996429 5438.996429 9: seq: 1c, identify offset=2400 count=40
Sent 37 5438.996556 5438.996556 9: seq: 1d, identify offset=2440 count=40
Sent 38 5438.996680 5438.996680 9: seq: 1e, identify offset=2480 count=40
Sent 39 5438.996804 5438.996804 9: seq: 1f, identify offset=2520 count=40
Sent 40 5438.996929 5438.996929 9: seq: 10, identify offset=2560 count=40
Sent 41 5438.997047 5438.997047 9: seq: 11, identify offset=2600 count=40
Sent 42 5438.997138 5438.997138 9: seq: 12, identify offset=2640 count=40
Sent 43 5438.997223 5438.997223 9: seq: 13, identify offset=2680 count=40
Sent 44 5438.997307 5438.997307 9: seq: 14, identify offset=2720 count=40
Sent 45 5438.997390 5438.997390 9: seq: 15, identify offset=2760 count=40
Sent 46 5438.997478 5438.997478 9: seq: 16, identify offset=2800 count=40
Sent 47 5438.997561 5438.997561 9: seq: 17, identify offset=2840 count=40
Sent 48 5438.997645 5438.997645 9: seq: 18, identify offset=2880 count=40
Sent 49 5438.997729 5438.997729 9: seq: 19, identify offset=2920 count=40
Sent 50 5438.997811 5438.997811 9: seq: 1a, identify offset=2960 count=40
Sent 51 5438.997895 5438.997895 9: seq: 1b, identify offset=3000 count=40
Sent 52 5438.997999 5438.997999 9: seq: 1c, identify offset=3040 count=40
Sent 53 5438.998131 5438.998131 9: seq: 1d, identify offset=3080 count=40
Sent 54 5438.998220 5438.998220 9: seq: 1e, identify offset=3081 count=40
Sent 55 5439.005002 5439.005002 6: seq: 1f, get_uptime
Sent 56 5439.055478 5439.055478 6: seq: 10, get_clock
Sent 57 5439.106425 5439.106425 6: seq: 11, get_clock
Sent 58 5439.157199 5439.157199 6: seq: 12, get_clock
Sent 59 5439.207887 5439.207887 6: seq: 13, get_clock
Sent 60 5439.258668 5439.258668 6: seq: 14, get_clock
Sent 61 5439.309370 5439.309370 6: seq: 15, get_clock
Sent 62 5439.360161 5439.360161 6: seq: 16, get_clock
Sent 63 5439.410785 5439.410785 6: seq: 17, get_clock
Sent 64 5439.411312 5439.411312 7: seq: 18, get_config, get_clock
Sent 65 5439.412888 5439.412888 7: seq: 19, allocate_oids count=12
Sent 66 5439.412933 5439.412933 13: seq: 1a, config_stepper oid=0 step_pin=gpiochip5/gpio0 dir_pin=gpiochip5/gpio1 invert_step=0 step_pulse_ticks=0
Sent 67 5439.412981 5439.412981 12: seq: 1b, config_digital_out oid=9 pin=gpiochip5/gpio2 value=0 default_value=0 max_duration=0
Sent 68 5439.413045 5439.413045 20: seq: 1c, config_endstop oid=1 pin=gpiochip5/gpio3 pull_up=1, config_trsync oid=2, config_stepper oid=3 step_pin=gpiochip6/gpio0 dir_pin=gpiochip6/gpio1 invert_step=0 step_pulse_ticks=0
Sent 69 5439.413082 5439.413082 12: seq: 1d, config_digital_out oid=10 pin=gpiochip6/gpio2 value=0 default_value=0 max_duration=0
Sent 70 5439.413115 5439.413115 10: seq: 1e, config_endstop oid=4 pin=gpiochip6/gpio3 pull_up=1
Sent 71 5439.413155 5439.413155 15: seq: 1f, config_trsync oid=5, config_stepper oid=6 step_pin=gpiochip7/gpio0 dir_pin=gpiochip7/gpio1 invert_step=0 step_pulse_ticks=0
Sent 72 5439.413250 5439.413250 25: seq: 10, config_digital_out oid=11 pin=gpiochip7/gpio2 value=0 default_value=0 max_duration=0, config_endstop oid=7 pin=gpiochip7/gpio3 pull_up=1, config_trsync oid=8, finalize_config crc=706548883
Sent 73 5439.413286 5439.413286 6: seq: 11, get_config
Sent 74 5439.414149 5439.414149 7: seq: 12, stepper_get_position oid=0
Sent 75 5439.414735 5439.414735 7: seq: 13, stepper_get_position oid=3
Sent 76 5439.415258 5439.415258 7: seq: 14, stepper_get_position oid=6
Sent 77 5440.396408 5440.396408 6: seq: 15, get_clock
Sent 78 5441.382025 5441.382025 6: seq: 16, get_clock
Sent 79 5442.367840 5442.367840 6: seq: 17, get_clock
Sent 80 5443.353471 5443.353471 6: seq: 18, get_clock
Sent 81 5443.713516 5443.713516 42: seq: 19, trsync_start oid=5 report_clock=405403501 report_ticks=5000000 expire_reason=2, stepper_stop_on_trigger oid=3 trsync_oid=5, trsync_set_timeout oid=5 clock=417903501, endstop_home oid=4 clock=405403501 sample_ticks=750 sample_count=4 rest_ticks=1250 pin_value=0 trsync_oid=5 trigger_reason=1
Sent 82 5443.810952 5443.810952 13: seq: 1a, queue_digital_out oid=10 clock=405514889 on_ticks=1

Please follow the directions given when you opened the thread here

Does this formatting follow better the directions?

This does not look like a pristine klippy.log. Where does it come from?

Sorry see attachment
klippy.log (78.2 KB)

Is this a dedicated Klipper host or running in a VM?

It is a dedicated host.

Commit 913d0992 was implemented over a year and a half ago - so yes, many many people use the Linux MCU with that code.

From your log: MCU 'mcu' shutdown: Stepper too far in past

The “Linux MCU” system was not designed to control stepper motors.

To support the precise timing requirements of a stepper motor would likely require a “real-time Linux kernel”, locking the process into ram, possibly locking the process to a single cpu core, and possibly additional Klipper code changes. I don’t know of anyone currently working on that.

-Kevin

Isn’t he using a real-time OS? I was looking in the www with “IPC RT-Linux Debian 12”, but didn’t get a hit. But RT smells like real-time OS.

@guillaume What kind of OS are you using?

Kind regards, hcet14

We are using Debian Bookworm and we installed the package linux-image-rt-amd64 with apt.

I agree with @koconnor. We had some good behaviour using cpuset to isolate the CPUs. However there are still sporadic errors. But this goes beyond this thread.

1 Like

Just to be sure, did you run the linux mcu code with the -r parameter to request real-time scheduling?

-Kevin

Yes.

We also tested using with a higher priority for in linux/main.c:
sp.sched_priority = 88;

Okay. Just as reference, there was a similar discussion on this at RPi 3 multiple mcu randomly throws "MCU 'host' shutdown: Stepper too far in past" · Issue #3387 · Klipper3d/klipper · GitHub .

-Kevin

EDIT: Also possibly of interest is: Use Unix signals in linux mcu code to notify when a timer is pending by KevinOConnor · Pull Request #4862 · Klipper3d/klipper · GitHub , the PRs linked from there, and its predecessor Convert linux mcu code to use signal based "irq" handling by KevinOConnor · Pull Request #4190 · Klipper3d/klipper · GitHub .

Just throwing this out but if you want actual real-time support on Linux you need to switch the Klipper process to a realtime scheduler, such as SCHED_DEADLINE.

We had major improvements with following:

  • Isolating the CPU using systemd slices
  • Compile with gnu17
  • set the scheduler priority to 88 in src/linux/main.c
  • declare must_wake_timers as atomic_int
  • verify that the timer is expired directly using timer_gettime() function
  • added some atomic_thread_fence() - certainly too many

We are still working on it…

2 Likes

Reverting 913d099 solved the last issues.

Interesting. It’s a little surprising that signals would cause worse scheduling.

What were the final steps that you needed? What kind of benchmarks are you now obtaining?

-Kevin