The I2C support in Klipper was originally written (by me) to configure a handful of DAC devices to support setting stepper motor current levels. Since then, that basic low-level i2c support has been utilized for a long list of devices - accelerometers, displays, temperature sensors, probes, etc. The original support didn’t do much with error handling because there was no need to do much with error handling (not being able to configure the requested stepper current was cause for a full shutdown). It can also be quite painful to properly handle i2c errors due to the large variety of micro-controller hardware and obtuse rules for their error handling.
Since that time, @nefelim4ag has done a lot of work to improve the i2c error handling. Most errors are now passed up to the host for handling. There’s still work ongoing to further improve that handling (both in the mcu code and in the host), but the general error handling support is much better.
At a high-level, I don’t see an issue with retrying failed attempts for bme280 type temperature devices. Just be aware that it’ll take some time to make those changes, test them, and have confidence in them. In a nutshell, we want to make sure that devices that do have an appropriate error handling system can handle errors, but we also want to make sure that critical devices that fail aren’t silently ignored.
First things first, this does help and on a more general note thank you for all your work on Klipper. I’m sure you get that a bunch, but seriously, this is an amazing piece of software/firmware.
I have been following these PRs (thank you @nefelim4ag!) and I’m very excited about that better error handling that’s been worked on and the direction things are moving.
I do want to be clear that my intent in creating this thread or the ensuing discussion was not to fault anyone for how I2C error handling works in Klipper - far from it. I had thought that based on the host handling errors now that I could handle exceptions directly but couldn’t figure out why I kept still getting shutdowns and wanted to understand. Timofey helped me figure out what’s going on there.
About the history of the code - thanks for that and got it - I wasn’t aware that DACs controlling motors were the first set of devices implemented and I can see why you went with the auto-shutdown approach based on that. An error in setting motor current could be considered critical and could do real harm. I don’t want to dismiss that as a concern and am not advocating for unwinding protections for critical devices.
I think we agree about the bme280-style devices and I absolutely recognize this is not a weekend thing. I am happy to test what I can to try and validate any solutions and fully expect this is a “decent chunk of ‘26” thing. Good code beats quick code for a shipping product IMO.
An idea I had been kicking around was if there was some way (and desire) to add a boolean config option for “non_critical” or similar. False by default, only exposed on specific device types that could be potentially diagnostic/informational and/or don’t control motion/heaters/etc (think accelerometers, sensors, displays, etc). If set to true, allow for a (admittedly more complicated) different error handling mode where NACKs result in either a retry beyond the 5x already present, skipped measurement or a soft reset if supported; followed by a degrade in place if not. If non_critical is false, auto-shutdown it is. This could reduce the burden of writing error handlers for each device type because you could exclude known critical devices and add them later if a non-critical application is found.
I still don’t know if START NACKs necessarily indicate failure so much as noise/device not ready, but I admit I haven’t scrubbed all the code to figure out what breaks if they happen and aren’t caught. At least for the SGP40/BME280, the nevermore controller project (which runs on a Pi Pico W) seems to drop failed measurements and retry 1s later without adverse effect.
For fun I decided to try removing the auto-shutdown from bus.py on an I2C NACK to see what would happen with my machine. I only have BME280 and SGP40s on I2C, everything else is using UART or another interface.
Interestingly, throughout an 06h 26m 03s print where the 2x BME280s and 2x SGP40s get sampled every 1s (at least 138,000 I2C reads/writes), I encountered 6x START_READ_NACKs or START_NACKs with no adverse effects. The errors appeared transient and no problems encountered.
I did hit 1x genuine NACK from the SGP40 when writing to the device and following this, the next read from the BME280 fails with a BUS_TIMEOUT. Something broke here, but I don’t know what exactly. At any rate, 1s later the SGP40 recovered from this (all future reads/writes succeed), but the BME280 degrades in place and returns self.reactor.NEVER, so it stops updating.
Obviously more testing is needed, but my initial smoke test seems to indicate that START NACKs in this context are benign and while there’s more to see with the true NACK + BUS_TIMEOUT, it also appeared to be a transient error.