Lack of memory barriers?

I was digging through the Klipper source code. The current code to write to memory looks roughly like this:

#define barrier() __asm__ __volatile__("": : :"memory")

static inline void writel(void *addr, uint32_t val) {
    barrier();
    *(volatile uint32_t *)addr = val;
}
static inline uint32_t readl(const void *addr) {
    uint32_t val = *(volatile const uint32_t *)addr;
    barrier();
    return val;
}

Using volatile prevents the compiler from assuming the value is unchanged between reads. So it won’t try and re-read the value.

Barrier prevents the compiler from re-ordering variables on a language level.

But there’s nothing here to stop the CPU from re-ordering reads and writes. On architectures like ARM with weak memory models there’s not guarantee it won’t re-order most self-contained memory accesses like these in to an arbitrary order.

In the case above for the readl/readw/readb and writel/writew/writeb, a memory barrier may not be needed if the accesses are using atomic data types (maybe a machine word?) and the code calling these functions knows it doesn’t need a barrier.

On a single-threaded program using these functions will be fine anyway. However most the pain comes from accessing hardware registers for things like drivers.

A lot of driver code seems to believe barrier() enforces a memory barrier. As a random example, we see commits like rp2040: add barrier in usb_read_ep0_setup. This commit only introduces a compiler barrier, not a memory barrier!

An easier to understand example is https://stm32: Protect message ram with barrier() calls instead of voltaile in fdcan.c .

We have this code segment:

    txfifo->data[0] = msg->data32[0];
    txfifo->data[1] = msg->data32[1];
    barrier();
    SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
    return CANMSG_DATA_LEN(msg);

This compiles to:

   txfifo->id_section = ids;
  34:   f84e 3005       str.w   r3, [lr, r5]
    txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
    txfifo->data[0] = msg->data32[0];
  38:   e9d0 3401       ldrd    r3, r4, [r0, #4]
    txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
  3c:   041b            lsls    r3, r3, #16
    txfifo->data[0] = msg->data32[0];
  3e:   608c            str     r4, [r1, #8]
    txfifo->data[1] = msg->data32[1];
  40:   68c4            ldr     r4, [r0, #12]
    txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
  42:   f403 2370       and.w   r3, r3, #983040 @ 0xf0000
    txfifo->data[1] = msg->data32[1];
  46:   60cc            str     r4, [r1, #12]
    txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
  48:   604b            str     r3, [r1, #4]
    barrier();
    SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
  4a:   2301            movs    r3, #1
  4c:   4908            ldr     r1, [pc, #32]   @ (70 <canhw_send+0x70>)
  4e:   4093            lsls    r3, r2
  50:   f8c1 30d0       str.w   r3, [r1, #208]  @ 0xd0
    return CANMSG_DATA_LEN(msg);

The code is a little verbose as the compiler is interleaving multiple store/load operations as some of this is pointer chasing and generating addresses. I’m assuming this is okay and the order to store data here isn’t too important, otherwise there would be calls to barrier(). Importantly for this commit we see this:

    txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
  48:   604b            str     r3, [r1, #4]
    barrier();
    SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
  4a:   2301            movs    r3, #1
  4c:   4908            ldr     r1, [pc, #32]   @ (70 <canhw_send+0x70>)
  4e:   4093            lsls    r3, r2
  50:   f8c1 30d0       str.w   r3, [r1, #208]  @ 0xd0
    return CANMSG_DATA_LEN(msg);

There is a store to dlc_section, then (1 << w_index) is calculated here rather than earlier due to the barrier, then a store to TXBAR. There are no memory barriers to be seen. The CPU is free to re-order the write to dlc_section to after TXBAR if it sees fit. All the barrier() call is doing is forcing the pointer chasing and left shift to run later than it needs to. In this case there should probably be a DMB between these two instructions.

In some cases barrier() does nothing really useful, such as the rp2040 bootrom code:

    // Set up flash so we can work with it without XIP getting in the way
    flash_enter_xip_prepare();
    irq_disable();
    barrier();
    connect_internal_flash();

The compiler generally doesn’t re-order function calls like this so it doesn’t do anything:

    asm volatile("cpsid i" ::: "memory");
20007c3c:       b672            cpsid   i

    // Set up flash so we can work with it without XIP getting in the way
    flash_enter_xip_prepare();
    irq_disable();
    barrier();
    connect_internal_flash();
20007c3e:       f7ff ffb3       bl      20007ba8 <connect_internal_flash>
    flash_exit_xip();
20007c42:       f7ff ffb9       bl      20007bb8 <flash_exit_xip>
    flash_cs_force(0);
20007c46:       2000            movs    r0, #0
20007c48:       f7ff ffc6       bl      20007bd8 <flash_cs_force>

It seems like a lot of code is written with the impression that there is a memory barrier, but there isn’t. The compiler doesn’t specify it and there’s no barrier in the assembly.

Maybe Klipper should rewrite barrier to do dmb?

I haven’t tested running this yet but this patch should portably add a memory and compiler fence:

diff --git a/klippy/chelper/compiler.h b/klippy/chelper/compiler.h
index ab16efbd4..c27afacdd 100644
--- a/klippy/chelper/compiler.h
+++ b/klippy/chelper/compiler.h
@@ -2,7 +2,7 @@
 #define __COMPILER_H
 // Low level definitions for C languange and gcc compiler.
 
-#define barrier() __asm__ __volatile__("": : :"memory")
+#define barrier() __atomic_thread_fence(__ATOMIC_SEQ_CST)
 
 #define likely(x)       __builtin_expect(!!(x), 1)
 #define unlikely(x)     __builtin_expect(!!(x), 0)
diff --git a/src/compiler.h b/src/compiler.h
index ab16efbd4..c27afacdd 100644
--- a/src/compiler.h
+++ b/src/compiler.h
@@ -2,7 +2,7 @@
 #define __COMPILER_H
 // Low level definitions for C languange and gcc compiler.
 
-#define barrier() __asm__ __volatile__("": : :"memory")
+#define barrier() __atomic_thread_fence(__ATOMIC_SEQ_CST)
 
 #define likely(x)       __builtin_expect(!!(x), 1)
 #define unlikely(x)     __builtin_expect(!!(x), 0)

It’s a semi-related issue but the Linux timer signal handler doesn’t use a memory barrier or signal safe access to variables. There was a PR over at linux/timer: Use volatile sig_atomic_t in signal handler to avoid undefined behavior by mattaw · Pull Request #6108 · Klipper3d/klipper · GitHub which deliberately avoided using volatile and instead just assumed there was a memory barrier in readl/writel, which there isn’t.

Edit: Running this seems fine.

Thanks for reviewing.

The Klipper mcu code is written for specific micro-controllers (typically ARM cortex-m, avr, or similar). So, the code is only intended to run on these specific types of platforms.

It is my general understanding that the MCU CPUs that Klipper supports will never reorder memory reads/writes with respect to themselves. That is, if the assembler does something like x=0; y=1; x=2; y=3; ... ; print(x); print(y); it’s never going to report x=2 and y=1. The majority of places Klipper issues an explicit memory barrier is to avoid the compiler reordering instructions because the compiler doesn’t know that an interrupt handler could run. In these situations we don’t have to worry about the cpu reordering writes, because the cpu never reorders writes with respect to itself. That is, an interrupt handler could “interrupt” normal task execution, but any read/write issued by the task is atomic wrt the interrupt handler and vice-versa.

Similarly, it is my general understanding that these mcu cpus do not reorder writes to device memory (as long as those writes are to the same device). That is, at the assembler level, something like USB_REG1 = 1; USB_REG2 = 2; will never get reordered. (If this were not the case, then nearly every single write to device memory would need a barrier, and we don’t see the vendors do that in their own code.)

There are some places in the Klipper code where we do issue cpu specific memory barriers (eg, git grep -i '__[di][sm]b'). In these cases, we’re protecting against specific races that could plausibly occur. It is of course possible that we’ve missed some specific cases where a race could occur and that should have a more stringent cpu barrier. If you know of a specific case, please let me know.

I don’t think it is a good idea to bulk add cpu barrier instructions to the code. There are certain areas of the code that have been heavily optimized - we’ve optimized for individual instructions on the scheduler and stepper code paths for example. So, I think we should add barriers where they are needed, and try to limit them to only the places where they are needed.

Cheers,
-Kevin

The two examples you cited do look like they would be safer with a __DMB().

In fdcan.c, it looks like there are writes to two different devices (general memory vs SOC_CAN), and we have seen some issues with reordering when that happens. Interestingly, it seems changes were made to src/atsam/fdcan.c and src/atsamd/fdcan.c, but it looks like they weren’t ported to the original src/stm32/fdcan.c . It may also be possible that the stm32 cpus don’t in practice reorder in that case, but no harm in using a more explicit barrier.

Similarly, the rp2040 usb code also appears to be writes to two different devices (usb_dpram vs usb_hw). We’ve seen this can be an issue on some devices (for example see commit 52af688245a663ad2a2b44f6b7c939d5c3e17f6f). It might not be an issue on rp2040/rp2350, but also not an issue to use a more explicit barrier.

Cheers,
-Kevin

There’s no way to know if a CPU reorders writes or reads without having access to micro architectural information which ARM doesn’t release. They instead expect you to program according to their memory model and use memory barriers.

Some CPUs such as high end Cortex Ms have deep pipelines and super scalar architectures. Do these reorder memory? We don’t know, and we can’t measure it in Klipper’s case as it’s single threaded and memory ordering always preserves sequential consistency in the current theead’s perspective.

Saying that these CPUs don’t reorder memory is a belief we can’t really verify and one that gets less convincing with each new CPU that needs fast memory access.

Yes, if you don’t want register writes to be reordered you do need barriers. It is worrying to find out there’s a lot of memory accesses that should be ordered but aren’t. I know vendor BSPs don’t do ordering, but these can be buggy too. Klipper’s core code should at least be using proper barriers for IO access.

You refer to barrier() as a memory barrier again which it isn’t and gives people the wrong impression of what it does. I showed some uses in the opening post that barrier() is treated as a memory barrier and used in places where a memory barrier is used. It doesn’t make much sense to call barrier() in the rp2040 bootrom code if it isn’t expected to be a memory barrier.

A good first step to fixing this problem would be to start looking at the code and adding proper memory barriers for IO access. Using a full barrier is a decent short term fix until the code can be evaluated to find out what barrier() is mistankely or intentionally used for. Even if you know it’s not a memory barrier, it seems other developers have assumed it is a memory barrier.

Regarding the performance impact of memory barriers, I’m not sure how bad it would be and whether it’s offset by deeper pipelines. It could even be that in-order CPUs no-op the instructions.

Regardless it’s important to write code according to the ISAs memory model, not a memory model that isn’t guaranteed to be used.

Reading AN4838 - Introduction to memory protection unit management on STM32 MCUs we can see this warning:

The Cortex-M7 implements the speculative prefetch feature, which allows speculative accesses to normal
memory locations (for example: FMC, Quad-SPI devices). When a speculative prefetch happens, it may impact
memories or devices that are sensitive to multiple accesses (such as FIFOs, LCD controller). It may also disturb
the traffic generated by another masters such as LCD-TFT or DMA2D with higher bandwidth consumption when a
speculative prefetch happens. In order to protect normal memories from a speculative prefetch, it is
recommended to change memory attributes from normal to a strongly ordered or to device memory thanks to the
MPU. For more details about configuring memory attributes, refer to Section 6: MPU setting example with
STM32Cube HAL on Armv6 and Armv7 architectures

It turns out on Cortex M devices at least we can use a memory protection unit to mark memory as device memory that enforces strongly ordered accesses. This may be one way to mitigate some of these problems. This wouldn’t fix the issue on non-Cortex M architectures, such as Linux. This would at least rule out a wide swath of issues. The compiler won’t re-order volatile memory accesses so this would make most drivers work okay I think.

But to address another point above, I thought a bit more about this code from the rp2040 INTR fix:

    usb_hw->sie_status = USB_SIE_STATUS_SETUP_REC_BITS;
    barrier();
    dpram_memcpy(data, (void*)usb_dpram->setup_packet, max_len);
    barrier();
    if (usb_hw->intr & USB_INTR_SETUP_REQ_BITS) {
        // Raced with next setup packet
        usb_notify_ep0();
        return -1;
    }

I think the introduction of dpram_memcpy with volatile semantics would make these barrier() calles useless? I’m not sure.

Just as a small update to myself: Adding memory barriers (not compiler barriers) won’t stop all out of order memory access. The main reason is that memory barriers doesn’t stop load speculation, so even if everything was ordered correctly the CPU may still speculate loads anyway.

The proper solution for this is to mark the IO registers memory address space as strongly-ordered device memory. This can be done on ARM chips mostly the same way, but it’s a different range of address space per chip.