Memory leak, klippy OOM'd

I believe I’ve found a memory leak in Klipper. About 2.5 hours into a 4h print yesterday Klippy was killed by the Linux OOM killer, after using about 400MB. Unfortunately I didn’t save the dmesg buffer before rebooting, but those things are pretty inscrutable anyway. I’ve been able to confirm and reproduce the memory growth condition using klippy’s batch mode, with the help of memory_profiler and scalene, on both python 2.7.16 and 3.8.0.

A plot from memory_profiler using the attached gcode file looks like this:

profiler setup

~/klippy-env/bin/pip install memory-profiler scalene

repro steps

The following resources are contained in the attached zip file:

  • failed-job.gcode – print job that triggered the OOM
  • firmware-btt-octopus-11.dict, huvud.dict – dict files from my two MCUs
  • processed-from-log.cfg – single config file captured from my klipper install, modified slightly to work with batch mode

repro.zip (6.4 MB)

memory_profiler

This’ll show the memory growth over time; it generates plots like the one above.

~/klippy-env/bin/mprof run ~/klippy-env/bin/python \
     /home/klipper/klipper/klippy/klippy.py \
        processed-from-log.cfg \
        -i failed-job.gcode \
        -o test.serial \
        -d firmware-btt-octopus-11.dict \
        -d huvud=huvud.dict    

Generate the plot with mprof plot.

scalene

I used scalene with klipper in a python3 venv; I didn’t try to use it with python2, but it may be compatible if installing from the repo (see scalene#358).

~/klippy-env/bin/scalene --outfile scalene-klippy.json --json \
     /home/klipper/klipper/klippy/klippy.py \
        processed-from-log.cfg \
        -i failed-job.gcode \
        -o test.serial \
        -d firmware-btt-octopus-11.dict \
        -d huvud=huvud.dict    

Use the scalene gui to view the results of the profiler run. You can use the data from one of my profiler runs, to get a feel for what’s available: scalene-klippy.zip (contains scalene-klippy.json)

postulation

It looks to me like MCU_stepper.generate_steps is leaking memory.

@koconnor @Sineos can you take a look at this?

I’ve been crawling through the C code trying to understand how the queues work. Is the sent_queue ever emptied? It appears to be growing unbounded. I’m doing the following to count the entries:

uint64_t count_list(struct list_head *lh) {
    uint64_t count = 0;

    struct list_node *ln;
    for (ln = lh->root; ln->next != lh->root; ln = ln->next) {
        count++;
    }

    return count;
}

…

static double
command_event(struct serialqueue *sq, double eventtime)
{
    …

    fprintf(
        stderr,
        ">>>>> sent_queue,pending_queues,notify_queue,receive_queue,fast_readers,old_sent,old_receive %ld,%ld,%ld,%ld,%ld,%ld,%ld\n",
        count_list(&sq->sent_queue),
        count_list(&sq->pending_queues),
        count_list(&sq->notify_queue),
        count_list(&sq->receive_queue),
        count_list(&sq->fast_readers),
        count_list(&sq->old_sent),
        count_list(&sq->old_receive)
    );

    pthread_mutex_unlock(&sq->lock);
    return waketime;
}

The last line output during my repro is

>>>>> sent_queue,pending_queues,notify_queue,receive_queue,fast_readers,old_sent,old_receive \
    1691730,0,0,0,0,100,100

I never see that number decrease. But I also am not seeing where any of these messages are freed, so I’m definitely missing something.

Thanks for tagging me but this is definitively over my pay grade…handing it down to the master of Klipper code @koconnor :wink:

1 Like

I’m not sure that “batch” mode is a good way to check if there is a memory leak. Klipper disables the python garbage collection and only arranges to call it when the main process is considered idle - but in batch mode it is never going to idle. Also, there are various cleanup checks based on timers that I suspect wont run when in batch mode.

It may help if you can reproduce the issue on a real print, running the pristine code, and attach the full Klipper log from the resulting session.

-Kevin

What’s the best way to do that? There’s nothing in the logs during a normal print that’s different from what I see in batch mode. The error at the time of the OOM was only fr the kernel; Klipper didn’t write anything to the log before being killed.

Or do you mean my hack to count the number of messages in the sent queue? Am I iterating through the linked list correctly?

When should those messages be freed? What is called from Python-land that would initiate that? From what I’ve been able to ascertain the only time that happens is when the mcu is disconnected, and the session is ended: that wouldn’t happen automatically during idle or after a print, would it?

Messages on the serialqueue.c sent_queue are intended to be moved to the old_sent list, which is kept at a constant length (see debug_queue_add()).

-Kevin

To be clear, it would seem batch mode will never call handle_message()update_receive_seq()debug_queue_add()message_free() as it doesn’t read input. However, this looks like a quirk of batch mode and does not seem to be an indicator of a problem during normal processing.

-Kevin

Thanks, @koconnor. The batch mode limitation was indeed confusing; I thought I was reproducing the issue, and it turns out I wasted a lot of time chasing a problem that only exists in a test scenario. :pensive: I ran the job in a more realistic way (although with heaters barely above ambient and the filament unloaded) and did not observe any memory growth over the several hours it took. I’ve identified another culprit that I’ll scrutinize more closely, now. I’ve also modified the systemd unit to prevent Klipper from getting OOM’d (by setting the score to -1000); hopefully that limits the damage the next time this happens.

Thanks for your time!