Memory leak, klippy OOM'd

blalor · July 21, 2022, 3:00pm

I believe I’ve found a memory leak in Klipper. About 2.5 hours into a 4h print yesterday Klippy was killed by the Linux OOM killer, after using about 400MB. Unfortunately I didn’t save the dmesg buffer before rebooting, but those things are pretty inscrutable anyway. I’ve been able to confirm and reproduce the memory growth condition using klippy’s batch mode, with the help of memory_profiler and scalene, on both python 2.7.16 and 3.8.0.

A plot from memory_profiler using the attached gcode file looks like this:

profiler setup

~/klippy-env/bin/pip install memory-profiler scalene

repro steps

The following resources are contained in the attached zip file:

failed-job.gcode – print job that triggered the OOM
firmware-btt-octopus-11.dict, huvud.dict – dict files from my two MCUs
processed-from-log.cfg – single config file captured from my klipper install, modified slightly to work with batch mode

repro.zip (6.4 MB)

memory_profiler

This’ll show the memory growth over time; it generates plots like the one above.

~/klippy-env/bin/mprof run ~/klippy-env/bin/python \
     /home/klipper/klipper/klippy/klippy.py \
        processed-from-log.cfg \
        -i failed-job.gcode \
        -o test.serial \
        -d firmware-btt-octopus-11.dict \
        -d huvud=huvud.dict

Generate the plot with mprof plot.

scalene

I used scalene with klipper in a python3 venv; I didn’t try to use it with python2, but it may be compatible if installing from the repo (see scalene#358).

~/klippy-env/bin/scalene --outfile scalene-klippy.json --json \
     /home/klipper/klipper/klippy/klippy.py \
        processed-from-log.cfg \
        -i failed-job.gcode \
        -o test.serial \
        -d firmware-btt-octopus-11.dict \
        -d huvud=huvud.dict

Use the scalene gui to view the results of the profiler run. You can use the data from one of my profiler runs, to get a feel for what’s available: scalene-klippy.zip (contains scalene-klippy.json)

postulation

It looks to me like MCU_stepper.generate_steps is leaking memory.

@koconnor @Sineos can you take a look at this?

blalor · July 22, 2022, 3:55pm

I’ve been crawling through the C code trying to understand how the queues work. Is the sent_queue ever emptied? It appears to be growing unbounded. I’m doing the following to count the entries:

uint64_t count_list(struct list_head *lh) {
    uint64_t count = 0;

    struct list_node *ln;
    for (ln = lh->root; ln->next != lh->root; ln = ln->next) {
        count++;
    }

    return count;
}

…

static double
command_event(struct serialqueue *sq, double eventtime)
{
    …

    fprintf(
        stderr,
        ">>>>> sent_queue,pending_queues,notify_queue,receive_queue,fast_readers,old_sent,old_receive %ld,%ld,%ld,%ld,%ld,%ld,%ld\n",
        count_list(&sq->sent_queue),
        count_list(&sq->pending_queues),
        count_list(&sq->notify_queue),
        count_list(&sq->receive_queue),
        count_list(&sq->fast_readers),
        count_list(&sq->old_sent),
        count_list(&sq->old_receive)
    );

    pthread_mutex_unlock(&sq->lock);
    return waketime;
}

The last line output during my repro is

>>>>> sent_queue,pending_queues,notify_queue,receive_queue,fast_readers,old_sent,old_receive \
    1691730,0,0,0,0,100,100

I never see that number decrease. But I also am not seeing where any of these messages are freed, so I’m definitely missing something.

Sineos · July 22, 2022, 7:44pm

Thanks for tagging me but this is definitively over my pay grade…handing it down to the master of Klipper code @koconnor

koconnor · July 26, 2022, 1:14am

I’m not sure that “batch” mode is a good way to check if there is a memory leak. Klipper disables the python garbage collection and only arranges to call it when the main process is considered idle - but in batch mode it is never going to idle. Also, there are various cleanup checks based on timers that I suspect wont run when in batch mode.

It may help if you can reproduce the issue on a real print, running the pristine code, and attach the full Klipper log from the resulting session.

-Kevin

blalor · July 26, 2022, 2:43am

What’s the best way to do that? There’s nothing in the logs during a normal print that’s different from what I see in batch mode. The error at the time of the OOM was only fr the kernel; Klipper didn’t write anything to the log before being killed.

Or do you mean my hack to count the number of messages in the sent queue? Am I iterating through the linked list correctly?

When should those messages be freed? What is called from Python-land that would initiate that? From what I’ve been able to ascertain the only time that happens is when the mcu is disconnected, and the session is ended: that wouldn’t happen automatically during idle or after a print, would it?

koconnor · July 26, 2022, 1:53pm

Messages on the serialqueue.c sent_queue are intended to be moved to the old_sent list, which is kept at a constant length (see debug_queue_add()).

-Kevin

koconnor · July 26, 2022, 1:58pm

To be clear, it would seem batch mode will never call handle_message() → update_receive_seq() → debug_queue_add() → message_free() as it doesn’t read input. However, this looks like a quirk of batch mode and does not seem to be an indicator of a problem during normal processing.

-Kevin

blalor · July 28, 2022, 12:46am

Thanks, @koconnor. The batch mode limitation was indeed confusing; I thought I was reproducing the issue, and it turns out I wasted a lot of time chasing a problem that only exists in a test scenario. I ran the job in a more realistic way (although with heaters barely above ambient and the filament unloaded) and did not observe any memory growth over the several hours it took. I’ve identified another culprit that I’ll scrutinize more closely, now. I’ve also modified the systemd unit to prevent Klipper from getting OOM’d (by setting the score to -1000); hopefully that limits the damage the next time this happens.

Thanks for your time!

Topic		Replies	Views
Klipper Python high RAM usage General Discussion	2	2689	May 6, 2022
Move queue overflow with large gcode files General Discussion	6	69	November 27, 2024
MCU shutdown due to RAM issues General Discussion	7	139	May 25, 2024
MCU 'mcu' shutdown: Move queue overflow Nr.2 General Discussion	6	75	June 25, 2025
Classic timeout during probe General Discussion	1	78	July 17, 2024

Memory leak, klippy OOM'd

profiler setup

repro steps

memory_profiler

scalene

postulation

Related topics