Trying to understand Lost communication with MCU

I’m trying to understand why I keep getting Lost communication with MCU.

Looking at the code it can only happen here:

def check_active(self, print_time, eventtime):
        if self._steppersync is None:
            return
        offset, freq = self._clocksync.calibrate_clock(print_time, eventtime)
        self._ffi_lib.steppersync_set_time(self._steppersync, offset, freq)
        if (self._clocksync.is_active() or self.is_fileoutput()
            or self._is_timeout):
            return
        self._is_timeout = True
        logging.info("Timeout with MCU '%s' (eventtime=%f)",
                     self._name, eventtime)
        self._printer.invoke_shutdown("Lost communication with MCU '%s'" % (
            self._name,))

For this to happen clocksync.is_active() has to be false, witch means:

def is_active(self):
        return self.queries_pending <= 4

Has to be false.

Im trying to understand if that 4 is arbitrary or if it some hard condition that needs to be met.

The only place where queries are added is here:

# MCU clock querying (_handle_clock is invoked from background thread)
    def _get_clock_event(self, eventtime):
        self.serial.raw_send(self.get_clock_cmd, 0, 0, self.cmd_queue)
        self.queries_pending += 1
        # Use an unusual time for the next event so clock messages
        # don't resonate with other periodic events.
        return eventtime + .9839

Witch leads me to belive it might be some race condition where too many events are queried and that causes the connection to drop?

Hope I made myself clear.
Kind regards,
Franco.

This is a code to sync time, which happens to report lost of communication.
Klipper tries to support different ways of connecting MCUs and different MCUs by themself.

So, every “second” we query the current time from MCU.
If the current time can’t be synced in N seconds, we assume there is a serious issue and just do “stop the world”.

4 queries, probably because:

  • more than 4 seconds of unresponsiveness is too much.
  • Klipper currently builds around timings, so one of the expectations is that the timer does not wrap in 10 seconds. We can control time only in a frame of 1<<31 ticks, so 5 seconds in the end.

That is it.

Witch leads me to belive it might be some race condition where too many events are queried and that causes the connection to drop?

Under normal conditions, you can take a look at your bandwidth and MCU load. But if the bus is really busy, it is most likely to get a Timer Too Close error (PWM updates should happen in <300ms).
Klippy busy (fail to schedule in time, cpu/memory pressure & etc) - same error.
If MCU is busy Rescheduled timer in the past (if MCU lagging behind in >1ms).

So, most probably your error is a lost of communication with MCU.

I did not know it occured 1 each seccond, then my suspicion does not make sense.
Thanks a lot for the explanation. I understand that the communication is lost but I’m curious as why is it lost. And why in 4 seconds is unable to recover in that case.
I’ll try taking a look at the MCU load as you sugested.

1 Like

I may kindly suggest reproducing it and showing logs in the general section.
It may be also helpful to show how your things are connected if there is something unusual.
Generally, with USB, a loss of communication is a logical disconnect from the system, so the error will be a little different. TTY is lost - write error, can0 lost write error. (Dmesg may help with USB).
In the case of UART GPIO there is no data about the line state at the Linux side AFAIK.

You may find it helpful to use a log visualizer, like this one https://sineos.github.io

Anyway, it is better to move support questions there, if there are any.

2 Likes