Potential issue with Klipper timing

I am trying to understand the host<->MCU communication and operation and I am confused about something that I am seeing.

Right after the configuration phase, host issues a ADC query cycle to start querying one of the temperature sensors. However, the start time of the request is based on a clock value that is queried before the entire configuration phase started.

I modified the serialqueue.c code to dump request and response packets, no other modifications to any of the Klipper code. The log is in the form:

[<timestamp>] <msg type>: <msg content>

What I get is this:

[250266027803227] request: [7 24 2 20 114 66 126]
[250266027822714] request: [8 25 38 12 83 76 54 126]
[250266027843672] request: [14 26 26 13 3 0 0 156 206 142 0 38 114 126]
[250266027852691] request: [10 27 27 13 201 159 0 129 127 126]
[250266027875498] request: [10 28 31 0 128 111 0 221 81 126]
[250266027895330] request: [7 29 21 1 212 74 126]
[250266027946184] request: [11 30 14 2 55 37 0 24 122 190 126]
[250266027958382] request: [11 31 26 14 54 0 0 0 122 179 126]
[250266027978920] request: [9 16 31 3 34 1 6 117 126]
[250266028000812] request: [7 17 21 4 38 68 126]
[250266028057593] request: [13 18 14 5 128 96 128 97 0 24 135 242 126]
[250266028066745] request: [11 19 26 15 95 0 0 0 160 234 126]
[250266028087446] request: [9 20 31 6 35 1 11 0 126]
[250266028292303] request: [60 21 21 7 14 8 19 18 0 24 26 16 5 0 0 0 31 9 36 1 21 10 38 17 84 26 18 2 0 0 156 206 142 0 27 18 201 159 0 14 11 66 67 0 24 26 19 52 0 0 0 5 142 144 251 140 111 92 90 126]
[250266528749898] request: [59 22 39 12 185 244 142 0 221 96 8 129 219 221 0 128 243 57 129 252 13 4 28 13 182 248 231 102 0 39 17 186 152 221 64 221 96 8 129 219 221 0 136 91 129 249 111 4 28 18 182 249 142 24 0 3 245 33 126]

The decoded stream is:

[112748328] request[24]: allocate_oids {'count': 20}
[112748912] request[25]: config_analog_in {'oid': 12, 'pin': 'PF3'}
[112750960] request[26]: config_digital_out {'oid': 13, 'pin': 'PA3', 'value': 0, 'default_value': 0, 'max_duration': 60000000}
[118759176] request[27]: set_digital_out_pwm_cycle {'oid': 13, 'cycle_ticks': 1200000}
[118759648] request[28]: config_endstop {'oid': 0, 'pin': 'PG15', 'pull_up': 0}
[118762536] request[29]: config_trsync {'oid': 1}
[118763064] request[30]: config_stepper {'oid': 2, 'step_pin': 'PD7', 'dir_pin': 'PC5', 'invert_step': 0, 'step_pulse_ticks': 24}
[124768296] request[31]: config_digital_out {'oid': 14, 'pin': 'PD6', 'value': 0, 'default_value': 0, 'max_duration': 0}
[124772160] request[16]: config_endstop {'oid': 3, 'pin': 'PC2', 'pull_up': 1}
[124774120] request[17]: config_trsync {'oid': 4}
[124774968] request[18]: config_stepper {'oid': 5, 'step_pin': 'PG0', 'dir_pin': 'PG1', 'invert_step': 0, 'step_pulse_ticks': 24}
[130779912] request[19]: config_digital_out {'oid': 15, 'pin': 'PF15', 'value': 0, 'default_value': 0, 'max_duration': 0}
[130783024] request[20]: config_endstop {'oid': 6, 'pin': 'PC3', 'pull_up': 1}
[130785224] request[21]: config_trsync {'oid': 7}
[130785568] request[21]: config_stepper {'oid': 8, 'step_pin': 'PB3', 'dir_pin': 'PB2', 'invert_step': 0, 'step_pulse_ticks': 24}
[136789968] request[21]: config_digital_out {'oid': 16, 'pin': 'PA5', 'value': 0, 'default_value': 0, 'max_duration': 0}
[136792800] request[21]: config_endstop {'oid': 9, 'pin': 'PC4', 'pull_up': 1}
[136794544] request[21]: config_trsync {'oid': 10}
[136794832] request[21]: config_analog_in {'oid': 17, 'pin': 'PF4'}
[136795904] request[21]: config_digital_out {'oid': 18, 'pin': 'PA2', 'value': 0, 'default_value': 0, 'max_duration': 60000000}
[142802464] request[21]: set_digital_out_pwm_cycle {'oid': 18, 'cycle_ticks': 1200000}
[142802752] request[21]: config_stepper {'oid': 11, 'step_pin': 'PE2', 'dir_pin': 'PE3', 'invert_step': 0, 'step_pulse_ticks': 24}
[148807440] request[21]: config_digital_out {'oid': 19, 'pin': 'PD4', 'value': 0, 'default_value': 0, 'max_duration': 0}
[148810336] request[21]: finalize_config {'crc': 3793667695}
[148817408] request[22]: query_analog_in {'oid': 12, 'clock': 121440000, 'sample_ticks': 12000, 'sample_count': 8, 'rest_ticks': 3600000, 'min_value': 14777, 'max_value': 32269, 'range_check_count': 4}

In the raw stream the timestamp is wall clock time in nanoseconds. In the decoded stream, the timestamp is controller cycle count. The string request[X] in the decoded stream shows the block sequence number.

According to the wall clock time, there are 250266528749898 - 250266027822714 = 500927184 nanoseconds between host configuring the sensor and requesting the start of the poll cycle. At 12Mhz, that is 6035267 controller clock cycles.

According to the Klipper log file, the last controller clock that was received was last_clock=112738400 and the query cycle start time is 121440000.

So, Klipper somehow thinks that the controller cannot possible take more than 121440000 - 112738400 = 8701600 cycles to finish the entire configuration phase.

I don’t see how that is OK. It may work on currently supported controllers but if there ever is one that takes more cycles to do the same work, then this would result in an error.

Shouldn’t Klipper query the controller clock after the configuration has been finalized and ack’ed by the controller before issuing any commands?

From the high level it happens somewhere here:

    def _build_config(self):
        if not self._sample_count:
            return
        self._oid = self._mcu.create_oid()
        self._mcu.add_config_cmd("config_analog_in oid=%d pin=%s" % (
            self._oid, self._pin))
        clock = self._mcu.get_query_slot(self._oid)
        sample_ticks = self._mcu.seconds_to_clock(self._sample_time)
        mcu_adc_max = self._mcu.get_constant_float("ADC_MAX")
        max_adc = self._sample_count * mcu_adc_max
        self._inv_max_adc = 1.0 / max_adc
        self._report_clock = self._mcu.seconds_to_clock(self._report_time)
        min_sample = max(0, min(0xffff, int(self._min_sample * max_adc)))
        max_sample = max(0, min(0xffff, int(
            math.ceil(self._max_sample * max_adc))))
        self._mcu.add_config_cmd(
            "query_analog_in oid=%d clock=%d sample_ticks=%d sample_count=%d"
            " rest_ticks=%d min_value=%d max_value=%d range_check_count=%d" % (
                self._oid, clock, sample_ticks, self._sample_count,
                self._report_clock, min_sample, max_sample,
                self._range_check_count), is_init=True)
    def get_query_slot(self, oid):
        slot = self.seconds_to_clock(oid * .01)
        t = int(self.estimated_print_time(self._reactor.monotonic()) + 1.5)

and here is receive of configuation:

    def _get_identify_data(self, eventtime):
        # Query the "data dictionary" from the micro-controller
        identify_data = b""
        while 1:
            msg = "identify offset=%d count=%d" % (len(identify_data), 40)
            try:
                params = self.send_with_response(msg, 'identify_response')
            except error as e:
                logging.exception("%sWait for identify_response",
                                  self.warn_prefix)
                return None
            if params['offset'] == len(identify_data):
                msgdata = params['data']
                if not msgdata:
                    # Done
                    return identify_data
                identify_data += msgdata
    def _start_session(self, serial_dev, serial_fd_type=b'u', client_id=0):
        self.serial_dev = serial_dev
        self.serialqueue = self.ffi_main.gc(
            self.ffi_lib.serialqueue_alloc(serial_dev.fileno(),
                                           serial_fd_type, client_id),
            self.ffi_lib.serialqueue_free)
        self.background_thread = threading.Thread(target=self._bg_thread)
        self.background_thread.start()
        # Obtain and load the data dictionary from the firmware
        completion = self.reactor.register_callback(self._get_identify_data)
        identify_data = completion.wait(self.reactor.monotonic() + 5.)
        if identify_data is None:
            logging.info("%sTimeout on connect", self.warn_prefix)
            self.disconnect()
            return False
        msgparser = msgproto.MessageParser(warn_prefix=self.warn_prefix)
        msgparser.process_identify(identify_data)

What I want to say, there is no connection between these events.
In the 5 seconds, the host should receive controller dict with supported commands and actual compiled configuration (pins, frequency).
The host starts to do something as fast as it is received full config and know supported command list.

Then we send config_analog_in and with a pause of around 1.5s query actual sensor state.

finalize_config only needed to enable move_queue

static void
move_finalize(void)
{
    if (is_finalized())
        shutdown("Already finalized");
    struct move_queue_head dummy;
    move_queue_setup(&dummy, sizeof(*move_free_list));
    move_list = alloc_chunks(move_item_size, 1024, &move_count);
    move_reset();
}

void
command_finalize_config(uint32_t *args)
{
    move_finalize();
    config_crc = args[0];
}
DECL_COMMAND(command_finalize_config, "finalize_config crc=%u");

There is no connection between query_analog_in and finalize_config events:
MCU will be able to respond to query_analog_in as long, as it happens after config_analog_in.

The only issue that can happen, is if the query_analog_in command arrives too late and you will receive “Timer too close”.

This logic doesn’t make sense to me.

I my mind, there is an implicit connection between finalize_config and query_analog_in. That connection is that the controller processes commands serially. Therefore, the controller cannot process the query_analog_in command until all other commands have completed.

So, again, the clock value in the query_analog_in command is built on an assumption that the controller will only take a certain amount of time to complete all previous commands, which is the problem that I am highlighting.

Since the controller processes commands serially, the clock value in the query_config_in command means that the host is expecting the controller to complete all previously sent commands in 8701600 cycles.

All commands that the host sends to the controller which are dependent on controller time should be send after the host has confirmed that the controller has finished all configuration commands.

In other words, after the host has sent all configuration commands, the host should:

  1. send finalize_config,
  2. send get_config,
  3. wait for is_config,
  4. query the controller clock.

Only then should any commands dependent on the controller clock be sent.

I my mind, there is an implicit connection between finalize_config and query_analog_in . That connection is that the controller processes commands serially. Therefore, the controller cannot process the query_analog_in command until all other commands have completed.

The source of truth is in the code.

You can try to find code in the MCU that will guarantee that behavior and refuse to answer to query_analog_in until finalize_config.

If there is a theoretical issue with the current solution, it would be helpful if you describe it, like: “If A happens before B, then C happens”.

If query_analog_in arrives too late (placed before all timers), then “Timer Too Close” error happens. Because of this, it is scheduled in 1.5s future.

Right now what you describe above sounds like a personal opinion, like:
“If query_analog_in arrives before finalize_config, it is wrong”

The source of truth is in the code.

That’s just saying “if it’s coded that way, it’s correct because the code can’t be wrong.”

If there is a theoretical issue with the current solution, it would be helpful if you describe it, like: “If A happens before B, then C happens”.

I’ve already described it in my original post but here it is again:

“The host sends query_config_in command without confirming that the controller has finished processing all previous commands. Since the query_config_in command contains a absolute controller clock value which is calculated based on the last controller time received before the configuration phase began, if the controller takes longer to complete all previous commands than the host expects, adding the analog query timer will result in “Timer too close””

If query_analog_in arrives too late (placed before all timers), then “Timer Too Close” error happens. Because of this, it is scheduled in 1.5s future.

This is exactly my point. It’s an arbitrary time in the future. May be, it is has been derived by some experiments but there is no guarantee that it would always be sufficient. A much more robust implementation would be query the controller clock after is_config reply has been received.

Right now what you describe above sounds like a personal opinion, like:
“If query_analog_in arrives before finalize_config, it is wrong”

I am not sure that it is. I am pretty sure that if you mentally follow my description, you’ll arrive at the same conclusion.

Ask yourself the question: is there any guarantee that the 1.5s that the host uses is and will always be sufficient? If you answer is “Yes”, prove it without pointing to the current source code.

I guess the real question is: Is it a real-world problem or just a theoretical one? I’m not aware of any reports with a TTC error in such an early stage.

Probably, there are just enough areas where something like this could be improved by investing a lot of time, and taking the risk of introducing regressions.

1 Like

Okay, I made a simple tests:

diff --git a/klippy/mcu.py b/klippy/mcu.py
index 09585348f..08e15a3b3 100644
--- a/klippy/mcu.py
+++ b/klippy/mcu.py
@@ -708,7 +708,14 @@ class MCU:
             if prev_crc is None:
                 logging.info("Sending MCU '%s' printer configuration...",
                              self._name)
+                delayed = []
                 for c in self._config_cmds:
+                    if "finalize_config" in c:
+                        logging.info("delay finalize config")
+                        delayed.append(c)
+                        continue
+                    self._serial.send(c)
+                for c in delayed:
                     self._serial.send(c)
             else:
                 for c in self._restart_cmds:

This one specifically reorders everything and finalize_config happens after query_analog_in.
It just works.

If we do something like this:

diff --git a/klippy/mcu.py b/klippy/mcu.py
index 09585348..22afb7e3 100644
--- a/klippy/mcu.py
+++ b/klippy/mcu.py
@@ -709,6 +709,9 @@ class MCU:
                 logging.info("Sending MCU '%s' printer configuration...",
                              self._name)
                 for c in self._config_cmds:
+                    if "finalize_config" in c:
+                        logging.info("delay finalize config")
+                        self._reactor.pause(self._reactor.monotonic() + 3)
                     self._serial.send(c)
             else:
                 for c in self._restart_cmds:

Because I delayed all commands by 3 seconds - it gets Timer Too Close.
This is technically what are you reporting, that somehow there is a slow connection or too much stuff to initialize, and query_analog_in will arrive late.

Receive queue
Receive: 80 1030.380877 1030.380634 11: seq: 11, uptime high=1 clock=136760487
Receive: 81 1030.431552 1030.431013 10: seq: 12, clock clock=157091418
Receive: 82 1030.482554 1030.482231 10: seq: 13, clock clock=177493401
Receive: 83 1030.533557 1030.533182 10: seq: 14, clock clock=197895359
Receive: 84 1030.584558 1030.583688 11: seq: 15, clock clock=218297369
Receive: 85 1030.634765 1030.634675 11: seq: 16, clock clock=238380989
Receive: 86 1030.685563 1030.684890 11: seq: 17, clock clock=258701325
Receive: 87 1030.736565 1030.735707 11: seq: 18, clock clock=279103281
Receive: 88 1030.787568 1030.787065 11: seq: 19, clock clock=299505291
Receive: 89 1030.800570 1030.800383 11: seq: 1a, clock clock=304705807
Receive: 90 1031.784713 1031.784605 11: seq: 1b, clock clock=698381495
Receive: 91 1031.862623 1031.861825 8: seq: 1c, debug_result val=766
Receive: 92 1031.862787 1031.862693 8: seq: 1d, debug_result val=1018
Receive: 93 1031.863615 1031.863047 10: seq: 1e, config is_config=0 crc=0 is_shutdown=0 move_count=0
Receive: 94 1032.769662 1032.769442 11: seq: 1b, clock clock=1092382631
Receive: 95 1033.754062 1033.753535 11: seq: 1c, clock clock=1486021051
Receive: 96 1034.401695 1033.753535 15: seq: 1c, stats count=353 sum=298902 sumsq=2801466
Receive: 97 1034.738762 1034.737976 11: seq: 1d, clock clock=1880059679
Receive: 98 1034.867279 1034.866255 12: seq: 1f, shutdown clock=1931280214 static_string_id=Timer too close
Send queue
Sent 83 1032.769442 1032.769442 6: seq: 1a, get_clock
Sent 84 1033.753535 1033.753535 6: seq: 1b, get_clock
Sent 85 1034.737976 1034.737976 6: seq: 1c, get_clock
Sent 86 1034.866215 1034.866215 11: seq: 1d, finalize_config crc=2316766483
Sent 87 1034.866255 1034.866255 26: seq: 1e, query_analog_in oid=22 clock=1393032704 sample_ticks=400000 sample_count=8 rest_ticks=120000000 min_value=8962 max_value=32269 range_check_count=4
Sent 88 1034.866278 1034.866278 13: seq: 1f, queue_digital_out oid=23 clock=810174213 on_ticks=0
Sent 89 1034.866294 1034.866294 13: seq: 10, queue_digital_out oid=24 clock=810184082 on_ticks=0
Sent 90 1034.866307 1034.866307 13: seq: 11, queue_digital_out oid=25 clock=810188978 on_ticks=0
Sent 91 1034.866322 1034.866322 13: seq: 12, queue_digital_out oid=26 clock=810192795 on_ticks=0
Sent 92 1034.866349 1034.866349 26: seq: 13, query_analog_in oid=27 clock=1413032704 sample_ticks=400000 sample_count=8 rest_ticks=120000000 min_value=3605 max_value=32269 range_check_count=4
Sent 93 1034.866365 1034.866365 15: seq: 14, queue_digital_out oid=28 clock=810201501 on_ticks=400000
Sent 94 1034.866386 1034.866386 19: seq: 15, query_counter oid=1 clock=1309032704 poll_ticks=4000 sample_ticks=400000000
Sent 95 1034.866402 1034.866402 15: seq: 16, queue_digital_out oid=29 clock=810210948 on_ticks=114000
Sent 96 1034.866414 1034.866414 13: seq: 17, queue_digital_out oid=30 clock=810214734 on_ticks=0
Sent 97 1034.866439 1034.866439 25: seq: 18, query_analog_in oid=31 clock=1429032704 sample_ticks=400000 sample_count=8 rest_ticks=120000000 min_value=5523 max_value=8044 range_check_count=4
Sent 98 1034.866766 1034.866766 64: seq: 19, query_analog_in oid=32 clock=1433032704 sample_ticks=400000 sample_count=8 rest_ticks=120000000 min_value=13677 max_value=32269 range_check_count=4, query_counter oid=2 clock=1313032704 poll_ticks=80000 sample_ticks=400000000, query_analog_in oid=34 clock=1441032704 sample_ticks=400000 sample_count=8 rest_ticks=120000000 min_value=13677 max_value=31983 range_check_count=4
Sent 99 1034.867278 1034.867278 14: seq: 1a, queue_digital_out oid=35 clock=810288768 on_ticks=0, get_config

For me it looks like in this hypothetical situation, where it can not happen in time (like initialization took 2 seconds), I’m not sure you will be able to print at all.

OK, if you guys think that this will never be an issue, that’s fine. We can drop it.

However, I did want to point out that in my log, the time difference between Klipper sending the config command and the query command for the same OID is not 1.5s. It’s 0.5s. So, obviously, the code path is different.

I really appreciate your work. Trust me, I highly appreciate each and every measure that would reduce the number of TTC errors, as they are one of the seven plagues for every Klipper believer.

It just needs to make a difference and here I am not sure (but as well cannot proof the opposite). Nevertheless, thanks! :+1:

Having found this theoretical issue, my gut feeling is that there might be other such inter dependencies lurking through out the code.

Of course, I don’t have a shred of proof beyond what I’ve outlined here but I just don’t understand how the host could calculate time beyond the controller clock since it’s running so much faster.

Also, in a lot of cases, the reports that we get on the Voron discord are saying that the Pi was barely loaded (< 10% load) and users keep getting them even after configuring Klipper to run on a dedicated core.

Look at the host CPU loading at the time of the TTC (use the klippy.log analysis tool that @Sineos has written) - often there is a spike at that time. You host may run at an average of 10% but it’s the loading spikes that cause the TTCs.

Could you give me a pointer to the tool?

https://sineos.github.io/index.html

1 Like