Possible threading issue in Klippy?

We are a couple of people trying to make a nice and clean setup on the Creality CR-10 Smart Pro, the discussion is over on Reddit. Two different errors keep plaguing the installs, a “Move out of range” error and a “Timer too close” error. We are running both Mainsail and Fluidd on the internal WiFi board (OpenWRT) and we see the same type of errors. We are using the Creality installed 8GB SD card for storage.

For the former error, (at least) one of the coordinates will be completely out of whack, for instance the error may look like this:
Move out of range: 122.074 3000.000 69.984 [6018.842]

For me it typically happens before the G1 line in

      BED_MESH_CALIBRATE
      BED_MESH_PROFILE LOAD=default

      G1 X295 Y5 Z50 F5000

in the startup script. Once it has happened, only saving&restart of the printer.cfg file will cure it. If only a new attempt at printing is made, it will fail with the same coordinates just after probing is done. A friend of mine with an Ender 3 S1 and an RPi3 also sees the same problem on occasion. This is NOT the same as the configuration mistake typically made when forgetting to account for the probe offset when probing the bed. It will work more than 9 out of 10 times.

The “Timer too close” has happened both when uploading a file from PrusaSlicer and during printing of a Benchy.

The fact that the errors are not exclusive to the Creality WiFi card, but happens more frequently along with the random data corruption suggests to me either a threading problem or a stray pointer somewhere, but I do not have the knowledge of python to quickly find any apparent issues. The two errors may be caused by the same issue or they could be completely separate matters.

Running the latest from GitHub.

It’s not possible to offer any help without a log file showing the errors.

Also when you say internal wifi board, is that the same as the Creality wi-fi box? If it is, you’re not going to have a good experience running Klipper on it. That only has a single core 580Mhz mips processor with 128MB of RAM. That’s half the speed of the original Pi Zero which is already too slow to handle Klipper and a webcam at the same time.

It is the same - not running a webcam on it though. The speed shouldn’t cause commanding random coordinates, although it could explain the MCU shutdowns.

Will capture a set of logs the next time I see the problem, could be today, could be in a week.

Yes, the out of range error is likely unrelated to the hadware you’re using. Without a log though it’s impossible to offer any advice.

The timer too close error is usually either a communication issue, or their wasn’t enough processor time available.

I guess the post was first and foremost to get reports of the problem out of the woodwork, of the three people I know running Klipper all have seen it. So I think it may be underreported and possibly a hint to a larger underlying problem.

Or it is as simple as @jakep_82 hinted at: The box is simply too slow to run Klipper.

Edit:
Another spitball: Guess you are running this OpenWRT fork on it. Maybe just something in there that does not play nice with Klipper.

I’m actively running Klipper on 3 different printers, and I don’t have this problem on any of them. I’m also a helper on Discord where we have 12,000+ members and I’m not seeing this reported by anyone running common hardware. I think it’s more likely something specific to your hardware and/or configuration.

With respect to the hardware being slow; yes, by today’s standards it is, 25 years ago the CPU spec on the Creality WiFi Box would have been found on a mid to high-end SGI workstation. It is plenty fast enough to keep up with the demands of a single 250kbps serial port, it is just a matter of priority. And AFAICT Klipper - at least in KlipperWRT form - does not make any priorities. Everything, hard realtime, soft realtime and non-realtime runs at the same system priority. Only klipper.py is hard realtime, everything else is just a nuisance if they are delayed.

I changed /etc/init.d/klipper to invoke klipper with nice --9, just randomly picking a priority higher than default, I could then successfully print with sustained 100% CPU utilization, a load factor >4 and five dashboards. Dashboards were lagging, of course, but the main thing was that the operation of the printer itself was unaffected. Before I did the change, running find / on it while printing would cause an immediate MCU shutdown.

Hmmm. Didn’t quite expect that one. Overstressed it with external loads yesterday without issue, now it died again, timer too close.

klippy-log.zip (255.8 KB)

Here’s the stat graph. Not sure what’s happening, but you can see the problem at the end.

Stats

There are some tools you can use to analyze the Klipper log (details at Debugging - Klipper documentation ). For example, your log shows ( scripts/graphstats.py klippy-3.log -s ):

The host cpu is clearly struggling to keep up during the entire duration (blue system load average is regularly above 100%). It’s actually fairly impressive that Klipper ran as long as it did. Ultimately, it looks like you also went into memory swap (note the decreasing yellow line) which finally killed it.

FWIW, low end host cpus aren’t a target for me. I don’t know of anyone currently working on it.

-Kevin

Interesting graph. I will admit to being very fresh to Klipper - like 8 days in…

Given a large enough print, that available memory line could become a problem for any CPU, it could look like a memory leak. It could, of course, also be a garbage collector not having time to run… Load above 1 isn’t a problem as long as the processes waiting are soft realtime or less, IMHO there is something very wrong if a sudden spike of incoming web interface requests will cause a disruption to the hard realtime application, it should have explicit priority. That was what I attempted with my change to the startup script, and it almost worked. I just didn’t foresee the swap issue. The Benchy Gcode is only 2.5MB, so not in anyway massive.

I have a NUC running Ubuntu and a few Atlassian applications sitting next to the printer, I guess the next step is to hook that up to the printer and try again.

Given a large enough print, that available memory line could become a problem for any CPU, it could look like a memory leak. It could, of course, also be a garbage collector not having time to run… Load above 1 isn’t a problem as long as the processes waiting are soft realtime or less, IMHO there is something very wrong if a sudden spike of incoming web interface requests will cause a disruption to the hard realtime application, it should have explicit priority. That was what I attempted with my change to the startup script, and it almost worked. I just didn’t foresee the swap issue. The Benchy Gcode is only 2.5MB, so not in anyway massive.

The graph Kevin posted clearly shows your system is heavily loaded. If you run some number crunching application that doesn’t care about latency and processes getting temporarily no computing time or has to wait on disk thats okay. With software like klipper which needs to calculate and send stuff out with not too much delay, that’s not a really good idea.

You can now spend days and weeks analyzing and improving Linux scheduling, nice, ionice, cgroups, running latencytop and whatnot, compile kernels, change schedulers and options, Hz numbers, ticks, noticks, disk cache settings, vm dirty ratio and whatnot. Maybe you get it right. Maybe not because you run into one of the many Linux scheduling issues, especially with direct attached sdcard or mmc storage (i.e. heavy disk access will stall everything no matter how much you optimize everything you can just mitigate it a bit, not fix it).

I’d do some more basic checks i.e. if it’s swapping etc. but if there’s nothing obvious I wouldn’t spend the time and just use a more powerful machine. Much easier and more reliable.

I have a NUC running Ubuntu and a few Atlassian applications sitting next to the printer, I guess the next step is to hook that up to the printer and try again.

You mean running klipper on the same machine that runs Atlassian applications? Those Atlassian Java memory and CPU hogs? That sounds like asking for trouble again :wink:

General info about running Klipper on something other than a Pi can be found at this link.

Take special note of the warnings about running intensive general-purpose computing tasks on the host computer.

enderbender, I don´t disagree that it’s heavily loaded, I’m just saying it should be able to handle it because the fraction taken by the hard realtime system is manageable. Unless the chimney on the Benchy is very compute intensive, which I don’t think it is. I could be mistaken.

But I did do a few more checks and I found that logd was using a 51200kiB memory buffer. Reduced that by a factor 10 and instantly had a lot more RAM available. Printing the bench again with exactly the same Gcode now, so we’ll see how we fare. By the looks of things, it was RAM that took us out.

It’s not easy to get hold of a Pi these days.

Well. This is an interesting coincidence. Now I just had that out of range error you reported subwooferbone. But I have plenty of procesing power, it’s a Pi4. It also happened inside the startscript. Usually first it homes XY, then goes to the front right. It only went to the right, but not to the front. The next move then gave the out of range error.

Edit:
It happened here:

[gcode_macro START_PRINT]
gcode:
  {% set BED_TEMP = params.BED_TEMP|default(70)|float %}
  {% set EXTRUDER_TEMP = params.EXTRUDER_TEMP|default(220)|float %}
  M140 S{BED_TEMP}; Bett Temp setzen ohne warten auf Ziel Temp
  M104 S150; Extruder Temp auf 150 Grad setzen ohne warten auf Ziel Temp
  G28 X Y; Home XY Achse
  G1 X260 Y0 F5000; Nach ganz vorne Rechts fahren

Instead of going to the front right, it only went to the right, not to the front, then it continued heating, after heating was done, the out of range error occured.

klippy-outofrange.log (5.1 MB)

Now I just pressed the “restart” button on the klipperscreen and it’s giving me the exact same error again.

I’ll leave the printer and the Pi on now in case somebody wants me to try or do something to further narrow this down.

Are you sure the printer is in absolute mode (G90) and not relative (G91)?

Issuing G1 X260 Y0 F5000 while in relative mode would do exactly what you describe, and the subsequent G1 X259.5 Y0 F6000 would result in the printer attempting to move X to 519.5 which matches your error.

Looking at your config, you have G91 in your PAUSE macro, but I don’t see a G90 anywhere to put the printer back into absolute. You might want to add that to your START_PRINT macro if it’s not explicitly added by your slicer.