Graceful I2C Error Handling for Non-Critical Sensors

Hi Klipper devs,

Been really happy with my machine recently, but I’ve been trying to solve a particularly annoying issue with intermittent I2C errors killing prints. I am running a Nevermore StealthMax which uses a pair of BME280+SGP40 sensors.

There’s currently PR #6738 and a separate klipper module for SGP40 sensors, klipper-sgp40. That module has largely tracked the changes introduced in Klipper over the last year and a new PR is being worked on to try and handle I2C errors without crashing. I’ve been testing the code on my printer by manually connecting and disconnecting the wires to an SGP40 during runtime.

From what I can tell, with some changes I mentioned in the PR, the code is correctly able to catch certain exceptions related to the I2C read fails. However the exception serialhdl.error: Unable to obtain 'i2c_response' response doesn’t get caught, even with a blanket try...except Exception as e: wrapper around the I2C measurement code in the init.py file.

Aside from hacking i2ccmds.c’s i2c_shutdown_on_err, I can’t seem to figure out how to catch this error and discard the failed measurement. In my case, these sensors are non-critical. If they drop, they only affect my satisfaction knowing how much air pollution my printer is spewing out.

Would anyone with deeper knowledge of the serialhdl and i2c_transfer functions be able to help point me in the right direction? Happy to dig deeper in the code, but multi-threaded python code is not my forte and I’m having trouble following the function calls.

Right now, host side I2C does directly invoke shutdown: klipper/klippy/extras/bus.py at master · Klipper3d/klipper · GitHub

So, you can’t intercept it.

From my PoV, it should crash in this case.
It is better to pull the SDA low to GND with tweezers - this is an I2C level shenanigans.
Where the disconnected sensor is UB in general.
For example, BME280/SHT31 should be initialized and have a state.

And normally, I think, existing I2C error/state reporting should be enough to guess what happened/what is wrong, I think, in general, it is weak pull-ups for long wires.

MCU side shutdown will happen for the sensors that work with I2C directly from the MCU, and do not have appropriate I2C error handling.
MPU series, for example, will trigger the shutdown.
LDC1612 - will not.

Otherwise, the host-side I2C IO should not be able to trigger MCU shutdown after the: I2C klippy's side error handling by nefelim4ag · Pull Request #7013 · Klipper3d/klipper · GitHub

Hope that helps,
-Timofey


Ref SGP40 topic: Support for SGP40 Sensors


Ah, and there is no “threading” mostly.
From a sensor perspective, your code is linear.

The only other thread is real IO, where/when your code is blocked upon IO.

So, you can interpret code as if it were single-threaded.

You can check: Code overview - Klipper documentation
And see the real threads with names inside the htop, for example.


Ah, and this one sounds like a communication issue with MCU
I’ve heard people use Bluetooth now to connect the nevermore

1 Like

Thanks for the quick reply!

Got it, that would explain a lot. So if status doesn’t come back up as a success we fail out automatically before we can return to the caller.

Agreed, this was a quick and dirty hack to see what would happen without having to probe up the PCB. I happen to have easy access to the DuPont connector so it made it preferable.

I also think it could be weak pull ups, but I need to check the total resistance on the line to see.

This is interesting. I didn’t think I was getting MCU side shutdown, but admittedly I hadn’t dug super deep into it.

Yeah, that’s one mechanism. The nevermore-controller project is what uses Bluetooth, but it relies on a Pi Pico W. isik’s Tech in the US makes a Nevermore PCB that relies on an STM32 microcontroller and runs native klipper over CAN. I didn’t think I was losing comms to the MCU but admittedly I didn’t look too deeply here.

Thanks so much for talking through this. I hadn’t realized that a lot of the code was single-threaded so that makes things way easier to parse. I’ll check out that resource you linked as well!

Okay so digging more into it - I’m seeing that the code is now modeled on how the BME280 does measurements in principle. In this case, if we get an I2C NACK, I personally don’t care since this sensor is non-critical. Today, however, an I2C NACK is a fatal error if I understand your last message and bus.py correctly.

The behavior we’re going for in the Nevermore community (so far as I’m aware), is:

  • During init, we absolutely cannot tolerate I2C NACKs without retry since we have to set up the sensor (similar to the BMExxx series of chips, I think).

  • During operation, if we miss a single measurement because of an I2C NACK (bus stuck high, measurement not ready, etc), this is OK because we are non-critical (only control a filtering fan).

  • My preference from here is that if we have a true failure (i.e. repeated I2C NACKs/errors) then we should degrade in place (this appears to be what the BME280 code does where it sets its data output to 0, then effectively stops all future measurements):
    In normal mode data shadowing is performed

        # So reading can be done while measurements are in process
        try:
            if self.chip_type == 'BME280':
                data = self.read_register('PRESSURE_MSB', 8)
            elif self.chip_type == 'BMP280':
                data = self.read_register('PRESSURE_MSB', 6)
            else:
                return self.reactor.NEVER
        except Exception:
            logging.exception("BME280: Error reading data")
            self.temp = self.pressure = self.humidity = .0
            return self.reactor.NEVER
    

Is there some middle ground exception state where we can flag NACKs as non-fatal? Would this be of interest to the devs? It would need to be targeted, but for specific non-critical sensors would prevent a lot of wasted plastic.

Shortly speaking, there was a plan to retry NACKs on the host side: I2C klippy's side error handling by nefelim4ag · Pull Request #7013 · Klipper3d/klipper · GitHub.
Where some transactional device-specific quirks can be implemented like so:
klipper/klippy/extras/sht3x.py at master · Klipper3d/klipper · GitHub

Probably, in the foreseeable future, it will be enabled.

Then, if you do personally experience NACKs, consider this:

  • If sensor-specific support is correct, and there are weak pull-ups, then one can reduce the I2C rate.
  • One can use SW I2C to workaround HW I2C issues or get an arbitrary I2C rate.

Often, there will not be a big difference between HW/SW I2Cs, as long as they work at the same speed/rate. Where SW can be more predictable and easier to test sensor specific implementation.
The only difference performance-wise is that HW will continue transfer while firmware is inside the interrupt context, or the MCU can be too weak, to provide a 400k rate. Otherwise, time spent in the transfer would be roughly the same, and so the load.

If one suspects that you can’t read from the device too often (Ex: sht3x: reads should be retried with at least 0.5s pause · Klipper3d/klipper@2585acc · GitHub), one can try to reproduce that and read often.
As far as I’m aware, BME280 does not have such a quirk.

If the device returns NACK after specific valid sequences (above), and one can reproduce that, it is a bug in implementation.

Hope that helps,
-Timofey

1 Like

Thanks so much for your reply Timofey!

Really looking forward to this, I had seen that that PR was implementing phase 2 of the 3 phase plan you and Kevin were discussing, so I know there’s more going on.

This makes sense. If the device is NACKing us either it’s not ready or there was an error (depending on the circumstances).


In my case, however, I’m getting NACKs that appear to be line noise related (not speed related) as I’m running at 100K I2C and no other work is being done by this MCU besides providing PWM/tach readings for a fan.

Taking the SGP40 as an example because I’ve been staring at it so long…Whether for a read or a write, we start with our address byte:

In this case, after the address byte is sent, I think generally an I2C_BUS_START_NACK or I2C_BUS_START_READ_NACK can be disregarded and retried. Many things could have caused it:

  • Device is busy and unable to respond to the request
  • Line noise or bus conflict on SDA could have flipped a bit causing the device to not get its address properly (so it doesn’t know it needs to ACK)
  • Line noise or bus conflict caused the ACK bit to flip.

Either way, the device isn’t mis-configured or being instructed to do something harmful with this instruction, so this is something where I’d feel fine retrying it until something else needs the bus. If it can’t be recovered, degrade in place and flag a warning, but don’t kill the print since nothing bad has necessarily happened - the device is just “missing”.


In the case of a I2C_BUS_NACK (which I assume are the NACKs back from the device), that’s more complicated. The below is the end of the I2C write command to start a measurement from the SGP40:

Depending on the device here, we could have NACKs for a bunch of reasons:

  • Attempting to write to a non-existent register
  • Incorrect command bytes (either from Line noise or mistaken code)
  • Bad CRC in from line noise, etc.
  • Sending a command that the device isn’t ready for (because it’s processing another command).

These require more care, but don’t necessarily indicate a HW failure, in my opinion. Raising an exception for the host to handle would allow for a range of solutions (retries beyond the 5x already in the code, reset commands, etc) before we consider the device “dead”.


The I2C_BUS_TIMEOUT is one that I’m not sure how to handle. Is this a stuck bus like during a read? Or the I2C hardware is locked up on the MCU itself?

Thanks so much,

Nick

Most of the time, yes, HW is stuck for whatever reason.
So, it is necessary to spend time reproducing that and dump the HW status registers/check datasheets.

For example: stm32: f0 i2c clean nackcf interrupt on handle by nefelim4ag · Pull Request #7108 · Klipper3d/klipper · GitHub

-Timofey

1 Like

Got it, that makes sense. And that’s something I would definitely flag an error on and maybe kill the print for (since we don’t necessarily know what else is on that bus in a generic sense).

For the I2C_BUS_START_NACK and I2C_BUS_START_READ_NACK I can’t think of a scenario where we could actually cause harm severe enough to warrant killing a print universally vs raising an exception and letting the individual module code handle what happens next.

Today, a bad EMI environment and long wires can create a scenario where we flip some bits, but in the address specifically I think this is potentially acceptable depending on the module, right?

I’ve been watching this thread and waiting to come in and ask about how the wiring was done for the two BME280 sensors because it seems that you have a hardware problem that you’re trying to address with a software solution.

I2C is really designed to connect devices on the same PCB. It is very noise tolerant for devices connected within 30cm or so on a reasonably designed PCB. If you’re connecting the BME280 sensors with a long pair of wires, you’re gonna have problems because you are using the devices out of specification.

The ideal situation would be a PCB with an MCU that is local to the sensors and can connect to your host using USB, CAN or serial. This way you avoid the long wires that are going to add to the line capacitance as well as become more susceptible to motor and AC noise.

Can I suggest that you get an Arduino, load it with the Klipper firmware and connect it to your host using USB? Remember to make sure that you go into “Optional features” in make menuconfig to only use I2C along with any other features that will make things simpler for you to minimize the .bin file size so it will fit in the AVR.

You could also use a Toolhead Controller Module (ie an EBB36/42 or FLY SHT36/42), load Klipper on it and connect the I2C pins on the board to the BME280 sensors and then connect to the host using USB.

I’m saying that you should use USB for the connection because you don’t need any high voltage/high currents for driving a motor.

1 Like

Apologies I was rewriting that and accidentally sent. I didn’t mean for that to sound the way it did.

I agree that I2C was not designed for this and we’re out of spec. I’ve done I2C over long runs professionally including via cables, but it was with shielded cables and Signal Integrity work done to make sure that it was robust, knowing what we were doing was off-spec.

Unfortunately (and frustratingly), I think the standalone sensor + wires + PCB board setup is the exact environment that a lot of typical Klipper users find themselves in today. I don’t think I’ve ever seen a BME280 natively integrated into a board with an MCU marketed at 3D printing, yet plenty of folks try to use them with Klipper (myself included). I can’t speak for every sensor that Klipper supports over I2C but many come in small 4 pin PCBs with just the sensor and no smarts onboard.

The Nevermore Stealthmax design unfortunately doesn’t have the space (at least in the V1 iteration) to mount 2 extra MCU boards to access the 2 separate SGP40+BME280 sensor stacks). But because the individual measurements are non-critical it really doesn’t matter if we get a NACK in practice which is why I’m looking for a software/firmware solution to get existing hardware working

I think those events are unlikely. If there are no pull-ups, it is a hardware design problem (Ex: Eddy).
But it still can work with a slower rate.

Where EMI sounds also more like a weak explanation for everything, it can happen, but it is much less likely in normal circumstances than one might expect.

I do have long wires (1 meter), I did have 10k pull-ups at 400k, and it worked, and it is routed in the same sleeve as the DC power lines and 2 fans to the heater (Spool Dryer).
My wires are spaggetty wires, twisted together as well as a child does braids
But I did replace it with 4.7k because I want a nice 3.3V peak in the signal.

IDK which modules and how they are connected (Nevermore docs), but most available had 10k by default.
So, 1 BME + 1 SGP40 should have ~5k pull-up resistance wich should be more than enough for a normal use case, even more with a 100k rate.

Where it is even possible to do 10 meters with impedance matching: https://www.youtube.com/watch?v=RMcq6Ab88KM&t=779s

I’m not saying it is not possible to have electrical issues;
It is possible, I’m saying it should work.
All issues that I had were because of my own errors in SHT3x implementation.

So, again, I’m suggesting reproducing the errors that you suspect and getting an idea of what they look like. For example:

  • One can send a command with a wrong CRC.
  • One can validate the read with CRC. I would expect there will be errors if you assume that they are here.
  • One can do a double read/write in a row (sensor not ready).

Where the current code should be simpler to work with, because now every I2C write/read should be blocking, so simple code pauses should work as expected.
Also, taking into account that ACK is when the device pulls the line low, it also seems also unlickely to me to be bitflip there, it should be a pretty powerful bit flip to overdrive line pull-down of 3.3mA+, I think.

I’m not a real electrical engineer,
-Timofey

And probably, if there are several devices on the same line, they should experience issues ~equally in case of electrical issues, like EMI doesn’t know with whom you are talking, actually.

Why do you think that’s a bad thing? Personally, I think it’s one of Klipper’s strengths.

You haven’t shared any images, but I can’t imagine adding an Arduino or EBB/FLY SHT board in a 3D printed case would be impossible to fit or any more unsightly than running wires from the StealthMax.

If you can’t guarantee that your NAKs are valid, how are you validating any the data that you are receiving? This includes system initialization (where I believe correct communications is critical).

You’re looking to add complexity and customized operation for a single use case based on an out of specification situation to fairly simple code that is stable, robust and well tested.

It just doesn’t seem like the right corrective action to the issue that you’re describing.

Without seeing how the wiring is implemented or looking at oscilloscope images it’s impossible to know for sure what’s going on here.

My guess would be that the waveform at the main controller board is sufficiently distorted by poor cabling (sorry @nick for characterizing it this way) that the edges are rounded to the point where the polled voltage for a bit (or the ACK) is at an indeterminate voltage, outside the specific “Vol” or “Voh” ranges of the MCU. Noise or ringing may be a contributing factors but most likely it’s an impedance mismatch between what is expected at the sensor and MCU chips and the wires in between them.

@nick Can you 'scope the lines?

If you can’t, would you be willing to set up an Arduino as an I2C controller as an experiment to see if that confirms that you have a wiring problem?

Maybe there’s something else happening here and your assumption that it’s a wiring problem is wrong.

1 Like

I’m saying it’s unfortunate because when users do this (like myself and all other Stealthmax users) the common reply is “this is out of spec and shouldn’t work anyway.” Which I think sets very unrealistic expectations - IMO if the firmware supports a sensor, either it should support the use case that the vast majority of folks deploy it in, or a note should be made explaining exactly what behavior to expect.

Unfortunately it is. Here’s an image of the interior of the stealthmax with wiring covers removed. All space that isn’t occupied by sensors/devices is needed for adequate airflow through the filter.

I’ve added a pic with a Pi 3B+ for scale, but that central sensor stack is in the middle of a filter basket with maybe 3” diameter to work with and every inch needed for airflow through the fan chamber.

The data is validated by the device and the host via an 8-bit CRC appended to all data words. If the CRC is incorrect we toss the measurement and/or the device will toss the command (and possibly NACK, although I have to re-read the spec on that 2nd bit).

I’m not saying that I can’t guarantee NACKs are valid, I’m saying they’re not critical. If we lose data from a NACK, we can simply toss the measurement and retry. We’re not controlling heaters or moving toolheads, just estimating air quality. If my measurements are slightly off becuase of a blip on the line, that’s fine and for most people running BME280s as chamber temp sensors I’d imagine the same.

But I absolutely agree that during system init, it’s critical we do not lose data - this case makes complete sense to have NACKs kill the machine - but they’re limited to the time before we start melting plastic.

Yes I’m adding complexity, but I’d argue not unreasonably. It is a completely normal use case within the Klipper community (see BME280s as chamber temp sensors). The code is robust but I’d argue that’s partially because it is nuclear right now - if we get a NACK because a device wasn’t ready for any reason, we kill the machine. I can’t think of another system where that is how I2C errors are handled.

The Nevermore is definitely a special case and way more involved than most applications, I agree. But the use of sensors that Klipper supports while out of spec for I2C is very much normal within the Klipper community to date.

Unfortunately I’m not able to at the moment (just recovering from a cold and need the printer back up and operational for a job), but I’ll see what I can do over the next couple of months. I don’t have a functioning Arduino to test, but I do have a Pi that I could try using for an I2C controller.

No offense taken re: poor cabling, I don’t love the cabling either. I would have at a minimum preferred a shielded cable or at least twisted pair with good conductors and definitely not berg pins. Your assertion that the waveform is distorted or marginal definitely could be the case and I wouldn’t rule out impedance mismatches either.

Agreed and oof (re: the Eddy). I’m checking the board to see if we have pull ups, but I’d be shocked if they were completely absent.

I love this description and I think that’s way better than my current wiring (pic above)

Fair point. I’ll take a shot at repro’ing the specific errors like you mention as well. Really glad to know that each I2C read/write is blocking - at least now bus conflicts should be reduced.

What’s frustrating is I can have the sensors running for hours during a print and it will get a random NACK at 4.5hrs into an 8hr print and kill the print. This happened back when I was using standalone BME280s for chamber temp sensors as well as now, although I understand way more of the mechanisms now :slight_smile:

Btw, from my small experience, you want to make the dupons’ metallic part tighter with tweezers, so they should be hard to disconnect, and maybe apply some thermalglue so they can’t rattle from vibrations

1 Like

Thanks for the tip! I’ll give that a shot as well.

I just probed the i2c lines, looks like total pull up resistance is 9.9 or 10K which is probably too high unless cap is low and I don’t have that info to hand.

You should go through:

Make sure you do a line capacitance calculation.

Yep, I’m familiar with picking I2C pull ups and line capacitance. I should be in spec (<100pF).

Per this calculator and assuming I’m not wildly off with the Dk of FEP wire, I should be around 20-50pF max for my wiring. Which should be fine even with 10K pull ups

In any case, whatever the reason for a NACK (bit flip in the I2C shift register, thermal event leading to sensor delayed processing, voltage fluctuation from particularly high current transient and insufficient decoupling caps, etc), I think logging an exception after retry on at least I2C_START_NACK or I2C_START_READ_NACK are probably safe since in this scenario we haven’t done anything other than make some unacknowledged noise on the bus. The device may be down but what to do then can and could vary by device I should think.

For regular NACKs when we do write commands, there’s a bit more nuance and what to do probably depends on the criticality of the sensor, the operation in progress in the host code, risk of damage, etc