There are some communication-related PRs #6823 and #6811, and I just ask myself a question:
“How can I determine if there is a bus issue? Like retransmits, kernel bus waiting, can internal stuff?”
Like, maybe I do not understand something, but srtt/rttvar/rto are interesting, but most of the time are just constants srtt=0.001 rttvar=0.001 rto=0.025
, bandwidth is complicated because of different communication possibilities available now and cumulative, so feels unresponsive.
So, I got to the thought, that maybe, in case of bus issues (does not matter where they are between Klippy and MCU), there should be a latency spike and it is enough to say, there is something in between, and distinguish issues like python gc slow, CPU overloaded, memory pressure & etc.
This is the suggested solution, maybe there is a better one:
diff --git a/klippy/chelper/serialqueue.c b/klippy/chelper/serialqueue.c
index c207495c..26a9c128 100644
--- a/klippy/chelper/serialqueue.c
+++ b/klippy/chelper/serialqueue.c
@@ -56,7 +56,7 @@ struct serialqueue {
uint64_t send_seq, receive_seq;
uint64_t ignore_nak_seq, last_ack_seq, retransmit_seq, rtt_sample_seq;
struct list_head sent_queue;
- double srtt, rttvar, rto;
+ double srtt, rttvar, rto, maxrtt;
// Pending transmission message queues
struct list_head pending_queues;
int ready_bytes, upcoming_bytes, need_ack_bytes, last_ack_bytes;
@@ -204,6 +204,11 @@ update_receive_seq(struct serialqueue *sq, double eventtime, uint64_t rseq)
else if (sq->rto > MAX_RTO)
sq->rto = MAX_RTO;
sq->rtt_sample_seq = 0;
+ // Track spikes
+ double half_rtt = delta / 2.0;
+ sq->maxrtt *= exp(-0.1 * half_rtt);
+ if (half_rtt > sq->maxrtt)
+ sq->maxrtt = half_rtt;
}
if (list_empty(&sq->sent_queue)) {
pollreactor_update_timer(sq->pr, SQPT_RETRANSMIT, PR_NEVER);
@@ -932,13 +937,13 @@ serialqueue_get_stats(struct serialqueue *sq, char *buf, int len)
snprintf(buf, len, "bytes_write=%u bytes_read=%u"
" bytes_retransmit=%u bytes_invalid=%u"
" send_seq=%u receive_seq=%u retransmit_seq=%u"
- " srtt=%.3f rttvar=%.3f rto=%.3f"
+ " srtt=%.3f rttvar=%.3f rto=%.3f maxrtt=%.3f"
" ready_bytes=%u upcoming_bytes=%u"
, stats.bytes_write, stats.bytes_read
, stats.bytes_retransmit, stats.bytes_invalid
, (int)stats.send_seq, (int)stats.receive_seq
, (int)stats.retransmit_seq
- , stats.srtt, stats.rttvar, stats.rto
+ , stats.srtt, stats.rttvar, stats.rto, stats.maxrtt
, stats.ready_bytes, stats.upcoming_bytes);
}
Just track max rtt with continuous decay, I hope it is slow enough to show in the stats when there is a one-time spike once per second.
There will be a spike in case of missed scheduling of the next digital event, timer too close, or anything, CAN-related #6810.
In case of uart connection and too high stepping, which overloads the bus (who knows), there will be a spike.
Like, I think it would behave like that. If there is a way to reproduce some things, like trsync_timeout
, it would be interesting to look at this.
From my surface testing, it shows the spikes when there is a lot of data transmission from the data_logger.py
, my hacky bus_bandwidth tests, or homing under dump load:
mcu: mcu_awake=2.628 mcu_task_avg=0.000011 mcu_task_stddev=0.000000 bytes_write=2236826 bytes_read=30768335 bytes_retransmit=9 bytes_invalid=0 send_seq=42316 receive_seq=42316 retransmit_seq=2 srtt=0.001 rttvar=0.000 rto=0.025 maxrtt=0.001
mcu: mcu_awake=2.645 mcu_task_avg=0.000011 mcu_task_stddev=0.000000 bytes_write=2061871 bytes_read=17998939 bytes_retransmit=9 bytes_invalid=0 send_seq=38529 receive_seq=38528 retransmit_seq=2 srtt=0.000 rttvar=0.000 rto=0.025 maxrtt=0.002
Thanks.
I use USB serial and can test USB CAN if needed, but I lack a real can/serial setup.
P.S.
Looks like this code only tracks rtt for a valid transmission, so, there should be something a little different to track max rtt with accounting for retransmissions, if it is possible to do so reliably.