Graceful I2C Error Handling for Non-Critical Sensors

The I2C support in Klipper was originally written (by me) to configure a handful of DAC devices to support setting stepper motor current levels. Since then, that basic low-level i2c support has been utilized for a long list of devices - accelerometers, displays, temperature sensors, probes, etc. The original support didn’t do much with error handling because there was no need to do much with error handling (not being able to configure the requested stepper current was cause for a full shutdown). It can also be quite painful to properly handle i2c errors due to the large variety of micro-controller hardware and obtuse rules for their error handling.

Since that time, @nefelim4ag has done a lot of work to improve the i2c error handling. Most errors are now passed up to the host for handling. There’s still work ongoing to further improve that handling (both in the mcu code and in the host), but the general error handling support is much better.

At a high-level, I don’t see an issue with retrying failed attempts for bme280 type temperature devices. Just be aware that it’ll take some time to make those changes, test them, and have confidence in them. In a nutshell, we want to make sure that devices that do have an appropriate error handling system can handle errors, but we also want to make sure that critical devices that fail aren’t silently ignored.

Maybe that helps a little,
-Kevin

2 Likes

Hi Kevin,

First things first, this does help and on a more general note thank you for all your work on Klipper. I’m sure you get that a bunch, but seriously, this is an amazing piece of software/firmware.

I have been following these PRs (thank you @nefelim4ag!) and I’m very excited about that better error handling that’s been worked on and the direction things are moving.

I do want to be clear that my intent in creating this thread or the ensuing discussion was not to fault anyone for how I2C error handling works in Klipper - far from it. I had thought that based on the host handling errors now that I could handle exceptions directly but couldn’t figure out why I kept still getting shutdowns and wanted to understand. Timofey helped me figure out what’s going on there.

About the history of the code - thanks for that and got it - I wasn’t aware that DACs controlling motors were the first set of devices implemented and I can see why you went with the auto-shutdown approach based on that. An error in setting motor current could be considered critical and could do real harm. I don’t want to dismiss that as a concern and am not advocating for unwinding protections for critical devices.

I think we agree about the bme280-style devices and I absolutely recognize this is not a weekend thing. I am happy to test what I can to try and validate any solutions and fully expect this is a “decent chunk of ‘26” thing. Good code beats quick code for a shipping product IMO.

An idea I had been kicking around was if there was some way (and desire) to add a boolean config option for “non_critical” or similar. False by default, only exposed on specific device types that could be potentially diagnostic/informational and/or don’t control motion/heaters/etc (think accelerometers, sensors, displays, etc). If set to true, allow for a (admittedly more complicated) different error handling mode where NACKs result in either a retry beyond the 5x already present, skipped measurement or a soft reset if supported; followed by a degrade in place if not. If non_critical is false, auto-shutdown it is. This could reduce the burden of writing error handlers for each device type because you could exclude known critical devices and add them later if a non-critical application is found.

I still don’t know if START NACKs necessarily indicate failure so much as noise/device not ready, but I admit I haven’t scrubbed all the code to figure out what breaks if they happen and aren’t caught. At least for the SGP40/BME280, the nevermore controller project (which runs on a Pi Pico W) seems to drop failed measurements and retry 1s later without adverse effect.

Thanks again,

Nick

Hi!

For fun I decided to try removing the auto-shutdown from bus.py on an I2C NACK to see what would happen with my machine. I only have BME280 and SGP40s on I2C, everything else is using UART or another interface.

Interestingly, throughout an 06h 26m 03s print where the 2x BME280s and 2x SGP40s get sampled every 1s (at least 138,000 I2C reads/writes), I encountered 6x START_READ_NACKs or START_NACKs with no adverse effects. The errors appeared transient and no problems encountered.

I did hit 1x genuine NACK from the SGP40 when writing to the device and following this, the next read from the BME280 fails with a BUS_TIMEOUT. Something broke here, but I don’t know what exactly. At any rate, 1s later the SGP40 recovered from this (all future reads/writes succeed), but the BME280 degrades in place and returns self.reactor.NEVER, so it stops updating.

Obviously more testing is needed, but my initial smoke test seems to indicate that START NACKs in this context are benign and while there’s more to see with the true NACK + BUS_TIMEOUT, it also appeared to be a transient error.

Thinking on it a bit more, if I had to hypothesize:

  • The MCU issues an I2C write to the SGP40 with some data
  • The write either gets corrupted or the device isn’t ready - unclear what exactly happens - and we get a NACK
  • The next read to the BME280 doesn’t get a START_READ_NACK, it gets a BUS_TIMEOUT - something on the bus is hung.
  • 1s later, the SPG40 code gets back up and running and the next SGP40 write/read cycle (and all afterwards) succeed.

To me, this chain of events indicates:

  • whatever the error was, it was transient (at least for the SGP40).
  • either we had a bus conflict, or the bus itself was locked up somehow, but not fatally.

Relevant Log Section:

Stats 96814.4: gcodein=0 canstat_mcu: bus_state=active rx_error=48 tx_error=0 tx_retries=0 mcu: mcu_awake=0.018 mcu_task_avg=0.000005 mcu_task_stddev=0.000006 bytes_write=43686427 bytes_read=14097910 bytes_retransmit=0 bytes_invalid=0 send_seq=949739 receive_seq=949739 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=22 upcoming_bytes=0 freq=180000227 rpi: mcu_awake=0.000 mcu_task_avg=0.000008 mcu_task_stddev=0.000010 bytes_write=155020 bytes_read=479651 bytes_retransmit=0 bytes_invalid=0 send_seq=25801 receive_seq=25801 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=49999572 adj=49999447 canstat_toolhead: bus_state=active rx_error=3 tx_error=0 tx_retries=4354903 toolhead: mcu_awake=0.008 mcu_task_avg=0.000006 mcu_task_stddev=0.000011 bytes_write=11869601 bytes_read=5878691 bytes_retransmit=0 bytes_invalid=0 send_seq=279058 receive_seq=279058 retransmit_seq=0 srtt=0.006 rttvar=0.004 rto=0.025 ready_bytes=53 upcoming_bytes=0 freq=12000204 adj=12000135 canstat_stealthmax: bus_state=active rx_error=3 tx_error=0 tx_retries=0 stealthmax: mcu_awake=0.027 mcu_task_avg=0.000123 mcu_task_stddev=0.000387 bytes_write=2148631 bytes_read=4243846 bytes_retransmit=2917 bytes_invalid=0 send_seq=199052 receive_seq=199052 retransmit_seq=198952 srtt=0.001 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998657 adj=63998409 sd_pos=11732019 heater_bed: target=110 temp=110.0 pwm=0.457 Raspberry_Pi: temp=38.5 mcu: temp=30.0 toolhead_mcu: temp=82.6 Chamber: temp=64.6 BME_OUT: temp=54.6 BME_IN: temp=55.5 SGP_OUT: temp=477.0 SGP_IN: temp=480.0 print_time=25281.331 buffer_time=1.579 print_stall=0 extruder: target=250 temp=250.1 pwm=0.542 sysload=0.26 cputime=7176.384 memavail=205628

Stats 96815.4: gcodein=0 canstat_mcu: bus_state=active rx_error=48 tx_error=0 tx_retries=0 mcu: mcu_awake=0.018 mcu_task_avg=0.000005 mcu_task_stddev=0.000006 bytes_write=43691108 bytes_read=14098736 bytes_retransmit=0 bytes_invalid=0 send_seq=949826 receive_seq=949826 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=15 upcoming_bytes=0 freq=180000315 rpi: mcu_awake=0.000 mcu_task_avg=0.000008 mcu_task_stddev=0.000010 bytes_write=155026 bytes_read=479680 bytes_retransmit=0 bytes_invalid=0 send_seq=25802 receive_seq=25802 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=49999571 adj=49999439 canstat_toolhead: bus_state=active rx_error=3 tx_error=0 tx_retries=4355314 toolhead: mcu_awake=0.008 mcu_task_avg=0.000006 mcu_task_stddev=0.000011 bytes_write=11870643 bytes_read=5878953 bytes_retransmit=0 bytes_invalid=0 send_seq=279077 receive_seq=279077 retransmit_seq=0 srtt=0.005 rttvar=0.003 rto=0.025 ready_bytes=22 upcoming_bytes=0 freq=12000201 adj=12000151 canstat_stealthmax: bus_state=active rx_error=3 tx_error=0 tx_retries=0 stealthmax: mcu_awake=0.027 mcu_task_avg=0.000123 mcu_task_stddev=0.000387 bytes_write=2148715 bytes_read=4244006 bytes_retransmit=2917 bytes_invalid=0 send_seq=199060 receive_seq=199060 retransmit_seq=198952 srtt=0.003 rttvar=0.003 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998659 adj=63998470 sd_pos=11733178 heater_bed: target=110 temp=110.0 pwm=0.427 Raspberry_Pi: temp=38.5 mcu: temp=30.2 toolhead_mcu: temp=82.7 Chamber: temp=64.6 BME_OUT: temp=54.6 BME_IN: temp=55.5 SGP_OUT: temp=477.0 SGP_IN: temp=480.0 print_time=25281.868 buffer_time=1.116 print_stall=0 extruder: target=250 temp=250.3 pwm=0.486 sysload=0.26 cputime=7176.478 memavail=205628

Stats 96816.4: gcodein=0 canstat_mcu: bus_state=active rx_error=48 tx_error=0 tx_retries=0 mcu: mcu_awake=0.018 mcu_task_avg=0.000005 mcu_task_stddev=0.000006 bytes_write=43693767 bytes_read=14099444 bytes_retransmit=0 bytes_invalid=0 send_seq=949880 receive_seq=949880 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=49 upcoming_bytes=0 freq=180000290 rpi: mcu_awake=0.000 mcu_task_avg=0.000008 mcu_task_stddev=0.000010 bytes_write=155032 bytes_read=479696 bytes_retransmit=0 bytes_invalid=0 send_seq=25803 receive_seq=25803 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=49999570 adj=49999051 canstat_toolhead: bus_state=active rx_error=3 tx_error=0 tx_retries=4355650 toolhead: mcu_awake=0.008 mcu_task_avg=0.000006 mcu_task_stddev=0.000011 bytes_write=11871756 bytes_read=5879225 bytes_retransmit=0 bytes_invalid=0 send_seq=279098 receive_seq=279098 retransmit_seq=0 srtt=0.005 rttvar=0.003 rto=0.025 ready_bytes=7 upcoming_bytes=0 freq=12000201 adj=12000020 canstat_stealthmax: bus_state=active rx_error=3 tx_error=0 tx_retries=0 stealthmax: mcu_awake=0.027 mcu_task_avg=0.000123 mcu_task_stddev=0.000387 bytes_write=2148794 bytes_read=4244161 bytes_retransmit=2917 bytes_invalid=0 send_seq=199067 receive_seq=199067 retransmit_seq=198952 srtt=0.003 rttvar=0.003 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998659 adj=63998033 sd_pos=11735019 heater_bed: target=110 temp=109.9 pwm=0.427 Raspberry_Pi: temp=39.4 mcu: temp=30.2 toolhead_mcu: temp=82.6 Chamber: temp=64.7 BME_OUT: temp=54.6 BME_IN: temp=55.5 SGP_OUT: temp=477.0 SGP_IN: temp=480.0 print_time=25282.897 buffer_time=1.144 print_stall=0 extruder: target=250 temp=250.0 pwm=0.637 sysload=0.26 cputime=7176.589 memavail=205628

SGP40: Error writing data

Traceback (most recent call last):

  File "/home/nick/klipper/klippy/extras/sgp40/__init__.py", line 270, in _handle_step

    self.i2c.i2c_write(cmd)

  File "/home/nick/klipper/klippy/extras/bus.py", line 262, in i2c_write

    self.i2c_transfer(data, minclock=minclock, reqclock=reqclock,

  File "/home/nick/klipper/klippy/extras/bus.py", line 291, in i2c_transfer

    raise mcu.error(err_msg)

mcu.error: MCU 'stealthmax' I2C request to addr 89 reports error NACK

Stats 96817.4: gcodein=0 canstat_mcu: bus_state=active rx_error=48 tx_error=0 tx_retries=0 mcu: mcu_awake=0.018 mcu_task_avg=0.000005 mcu_task_stddev=0.000006 bytes_write=43696830 bytes_read=14100146 bytes_retransmit=0 bytes_invalid=0 send_seq=949939 receive_seq=949939 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=34 upcoming_bytes=0 freq=180000387 rpi: mcu_awake=0.000 mcu_task_avg=0.000008 mcu_task_stddev=0.000010 bytes_write=155038 bytes_read=479712 bytes_retransmit=0 bytes_invalid=0 send_seq=25804 receive_seq=25804 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=49999569 adj=49999254 canstat_toolhead: bus_state=active rx_error=3 tx_error=0 tx_retries=4355990 toolhead: mcu_awake=0.008 mcu_task_avg=0.000006 mcu_task_stddev=0.000011 bytes_write=11872937 bytes_read=5879525 bytes_retransmit=0 bytes_invalid=0 send_seq=279119 receive_seq=279119 retransmit_seq=0 srtt=0.005 rttvar=0.003 rto=0.025 ready_bytes=43 upcoming_bytes=0 freq=12000197 adj=12000085 canstat_stealthmax: bus_state=active rx_error=3 tx_error=0 tx_retries=0 stealthmax: mcu_awake=0.027 mcu_task_avg=0.000123 mcu_task_stddev=0.000387 bytes_write=2148885 bytes_read=4244316 bytes_retransmit=2917 bytes_invalid=0 send_seq=199074 receive_seq=199074 retransmit_seq=198952 srtt=0.003 rttvar=0.004 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998656 adj=63998303 sd_pos=11736982 heater_bed: target=110 temp=109.9 pwm=0.442 Raspberry_Pi: temp=38.9 mcu: temp=30.1 toolhead_mcu: temp=82.6 Chamber: temp=64.7 BME_OUT: temp=54.6 BME_IN: temp=55.5 SGP_OUT: temp=477.0 SGP_IN: temp=480.0 print_time=25284.011 buffer_time=1.258 print_stall=0 extruder: target=250 temp=249.9 pwm=0.648 sysload=0.26 cputime=7176.718 memavail=205428

BME280: Error reading data

Traceback (most recent call last):

  File "/home/nick/klipper/klippy/extras/bme280.py", line 415, in _sample_bme280

    data = self.read_register('PRESSURE_MSB', 8)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/nick/klipper/klippy/extras/bme280.py", line 771, in read_register

    params = self.i2c.i2c_read(regs, read_len)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/nick/klipper/klippy/extras/bus.py", line 271, in i2c_read

    return self.i2c_transfer(write, read_len, retry=retry)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/nick/klipper/klippy/extras/bus.py", line 291, in i2c_transfer

    raise mcu.error(err_msg)

mcu.error: MCU 'stealthmax' I2C request to addr 119 reports error BUS_TIMEOUT

toolhead: max_velocity: 300.000000

max_accel: 5000.000000

minimum_cruise_ratio: 0.500000

square_corner_velocity: 5.000000

Stats 96818.4: gcodein=0 canstat_mcu: bus_state=active rx_error=48 tx_error=0 tx_retries=0 mcu: mcu_awake=0.018 mcu_task_avg=0.000005 mcu_task_stddev=0.000006 bytes_write=43700253 bytes_read=14100862 bytes_retransmit=0 bytes_invalid=0 send_seq=950004 receive_seq=950004 retransmit_seq=0 srtt=0.002 rttvar=0.001 rto=0.025 ready_bytes=51 upcoming_bytes=0 freq=180000356 rpi: mcu_awake=0.000 mcu_task_avg=0.000008 mcu_task_stddev=0.000010 bytes_write=155044 bytes_read=479728 bytes_retransmit=0 bytes_invalid=0 send_seq=25805 receive_seq=25805 retransmit_seq=0 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=49999568 adj=49998857 canstat_toolhead: bus_state=active rx_error=3 tx_error=0 tx_retries=4356439 toolhead: mcu_awake=0.007 mcu_task_avg=0.000006 mcu_task_stddev=0.000011 bytes_write=11874024 bytes_read=5879811 bytes_retransmit=0 bytes_invalid=0 send_seq=279140 receive_seq=279140 retransmit_seq=0 srtt=0.005 rttvar=0.004 rto=0.025 ready_bytes=28 upcoming_bytes=0 freq=12000200 adj=11999945 canstat_stealthmax: bus_state=active rx_error=3 tx_error=0 tx_retries=0 stealthmax: mcu_awake=0.031 mcu_task_avg=0.000138 mcu_task_stddev=0.000525 bytes_write=2148965 bytes_read=4244493 bytes_retransmit=2917 bytes_invalid=0 send_seq=199081 receive_seq=199081 retransmit_seq=198952 srtt=0.002 rttvar=0.002 rto=0.025 ready_bytes=0 upcoming_bytes=0 freq=63998656 adj=63997762 sd_pos=11739227 heater_bed: target=110 temp=109.9 pwm=0.442 Raspberry_Pi: temp=38.9 mcu: temp=30.1 toolhead_mcu: temp=82.7 Chamber: temp=64.7 BME_OUT: temp=54.6 BME_IN: temp=55.5 SGP_OUT: temp=477.0 SGP_IN: temp=480.0 print_time=25284.803 buffer_time=1.049 print_stall=0 extruder: target=250 temp=249.9 pwm=0.638 sysload=0.26 cputime=7176.834 memavail=205444