MCU ‘sb2040’ shutdown: Timer too close

Basic Information:

Printer Model: Voron Trident 350
MCU / Printerboard: Spider2.2/Fly sb2040 pro/Fly-ercf-can
Host / SBC: Raspberry Pi 3B+
klippy.log

Describe your issue:

When I print a file, the printer ramdomly shutdown as “MCU ‘sb2040’ shutdown: Timer too close” after printing 1~3 hrs.
I try to troubleshoot by using graphstats.py, but it didn’t help. The can bus, host loading, temperature and buffers are all looks normal.
Also I try following actions but didn’t change the result:

  1. Limit the microstep of extruder stepper(which is on sb2040) from 16 to 8
  2. Stop KlipperScreen

klippy.log.zip (3.8 MB)

Check your wiring, there is a communication issue somewhere. You’ve got bytes that are building up in the serial queue over time until it can’t keep up.

sb2040: 
mcu_awake=0.009 mcu_task_avg=0.000013 
mcu_task_stddev=0.000021 bytes_write=18161615 
bytes_read=6470898 bytes_retransmit=7240 
bytes_invalid=0 send_seq=456818 
receive_seq=456818 retransmit_seq=432300 
srtt=0.002 rttvar=0.001 rto=0.025 
ready_bytes=17 
**upcoming_bytes=187798** 
freq=11999976 adj=11999623 
sd_pos=12701115 

upcoming_bytes should be 0 ideally, those are stalled messages in the queue.
It’s clearing them out 17 bytes at a time (via the ready_bytes) but it looks like it’s building up faster than it can clear them out.

You mean there are trafic jam in serial port (CAN),right? And this is possiblly caused by hw wiring not stable?
I’ll change another CAN cable and try again tomorrow.

Yes, Means the CAN bus isn’t having clear communications with your sb2040 for whatever reason. When Klipper talks to the mcu it expects an acknowledgement from the mcu that it recieved it successfully.

When it doesn’t get that it queues it up to resend it at the next possible time.

If it gets too bad the timing of the functions gets off (Klipper says do X at Y time, mcu says I did X at Z time) and if they’re way off you get a timer error.

It looks like your queue is filling up, but probably not at a fast enough rate to immediately trigger the error as it clears it out as it can. I didn’t go back far enough to see if it started getting extremely bad after a certain point.

Underlying issue usually is a flaky connection somewhere.

Okay I just graphed it instead.

You got the “Timer too close” and restart around the 30000 timestamp mark.
So it did build up rather suddenly but you can see a small blip earlier on too.

Once you had a restart it was fine.

Crystal clear!
After temporarily change another CAN for sb2040, the “Timer Too Close” gone. I checked the Upcoming bytes back to less than 10 all the time.
BTW, how to generate the Upcoming Bytes diagram? any command?

I just made a Python script. Here you go, you’ll just need to drop it on your Pi somewhere.

log.zip (1.0 KB)

You’ll need to install matplotlib

pip install matplotlib

Then navigate to wherever you extracted the file and you use it like

python log.py <path-to-log-file> <mcu-name>

as in

python log.py ~/printer_data/klippy.log mcu
or
python log.py ~/tmp/klippy.log mcu

I didn’t want to jack with the Regex more than it was so it just uses the first part of the mcu name before any “_”, in other words, your ercf_mcu would just be “ercf”.

If it can’t find the mcu it will tell you which ones it can find so you can pick from that.

One of these days I’ll get around to making a website where people can upload their klippy.log and it gives them all the info from it. I’ve been thinking about that for a while.

The upcoming_bytes counter indicates data that the Klipper host knows it will need to eventually send to the mcu, but also knows it shouldn’t yet send to the mcu. It is not an indication of an error; it is normal for it to be non-zero. In contrast, a very high (eg, more than 64) value in ready_bytes indicates an issue with communications - this is the counter of bytes that the host would like to send to the mcu, but is unable to due to delays on the communication line.

As to the error reported here - the Klipper host code has been heavily modified (with various ercf and other modifications), so it’s difficult to guess what the host code was attempting at the time of the failure. It does appear the host got overloaded, but why is not clear.

Cheers,
-Kevin

Thanks for the clarification, I was mostly going off of the comparison to the stats from my log and a cursory look at the serialhdl file. I must have gotten the ready_bytes and upcoming bytes mixed up.

From ready_bytes, it looks normal:


From upcoming_bytes, it indicate abnormal:

After I change the CAN Cable, everything back to normal:


1 Like

Here, I updated the script and renamed it to something more obvious.

loggraph.zip (1.3 KB)

It now takes 3 parameters as defined below.
You can run

python loggraph.py -?

To get this same output in the future

Usage: python script.py <log_file_path> <mcu_name> <data_field>
Available data fields: mcu_awake, mcu_task_avg, mcu_task_stddev, bytes_write, bytes_read, bytes_retransmit, bytes_invalid, send_seq, receive_seq, retransmit_seq, srtt, rttvar, rto, ready_bytes, upcoming_bytes, freq

It will now also automatically label the chart right based on what you choose.

@Sineos - You might like this as well. I have some future ideas regarding something similar.

It needs Matplotlib though

pip install matplotlib

perfect!
I’ll try this new one.
Thank you so much

1 Like

klippy.log.20240610.zip (144.3 KB)
This error happen again.
It’s strange as I change another brand new CAN wire and use new CAN solution.
CAN speed: 1M
USB2CAN: Spider v2.2 with TJA1050 CAN transmitter, work in USB CAN bridge mode.
Here I attach my modified files from clean klipper repo which is the latest.
modified_files.zip (56.2 KB)
Also I attached my gcode file.
Dragon-Z-flat.zip (5.2 MB)
Dragon-Z-flat.z01.zip (7.5 MB)

I monitor the whole print progress and extract the klipper.log every 10 mins. Now I find the Upcoming Bytes raise up, I immediately pause the print.


Funny thing is happen: Even the print is paused, the PART FAN is still work with dynamic speed. It looks like the BIG QUEUED command is dynamic fan control.
Now I got this diagram while print pause.

After a while, the dynamic PART FAN stopped.
I got the normal Upcoming_Bytes:

During the whole things happen, I got nothing abnormal on CAN bus:


For sure, the CAN works normally.

What model USB2CAN and how is it wired to your Spider board, and how is your CAN bus wired to the SB2040?

If you’re using the wiring supplied with the SB2040, did you twist the yellow and white wires at roughly even intervals when you ran the wiring?

Well, I rework all of my CAN wires and make sure each H/L are twist together.
I’ll print a large file again to check if any communication jam happen again.

After rework all CAN wires, the upcoming_bytes usually below 1000 and ready_bytes usually under 800. The “Timer Too Closed” and “MCU Lost communication” disappeared.
But, if I enable “dynamic hangout fan”, the issue come out again.
anyway, I’m happy that the wiring issue was gone. Looks like too many M106 would cause sb2040 become stucked.

Same problem

I prepaired simple script to repit problem without printing
Head SB2040 V2

n=10000000
for i in $(seq 1 $n); do 
               echo  "M106 S$[ $RANDOM % 255 + 100 ]";       
                echo  "G4 P$[ $RANDOM % 50  ]" ;    
 done >test.gcode

upcoming_bytes grow very fast when it try quickly change speed of fan

And it fail to timer to close at the end
Is there any timer with 0,1 sec step?
It seems that it fail with 16 bit counter

aha, you really reproduce this issue in a simple & quick way.