I was digging through the Klipper source code. The current code to write to memory looks roughly like this:
#define barrier() __asm__ __volatile__("": : :"memory")
static inline void writel(void *addr, uint32_t val) {
barrier();
*(volatile uint32_t *)addr = val;
}
static inline uint32_t readl(const void *addr) {
uint32_t val = *(volatile const uint32_t *)addr;
barrier();
return val;
}
Using volatile prevents the compiler from assuming the value is unchanged between reads. So it won’t try and re-read the value.
Barrier prevents the compiler from re-ordering variables on a language level.
But there’s nothing here to stop the CPU from re-ordering reads and writes. On architectures like ARM with weak memory models there’s not guarantee it won’t re-order most self-contained memory accesses like these in to an arbitrary order.
In the case above for the readl/readw/readb and writel/writew/writeb, a memory barrier may not be needed if the accesses are using atomic data types (maybe a machine word?) and the code calling these functions knows it doesn’t need a barrier.
On a single-threaded program using these functions will be fine anyway. However most the pain comes from accessing hardware registers for things like drivers.
A lot of driver code seems to believe barrier() enforces a memory barrier. As a random example, we see commits like rp2040: add barrier in usb_read_ep0_setup. This commit only introduces a compiler barrier, not a memory barrier!
An easier to understand example is https://stm32: Protect message ram with barrier() calls instead of voltaile in fdcan.c .
We have this code segment:
txfifo->data[0] = msg->data32[0];
txfifo->data[1] = msg->data32[1];
barrier();
SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
return CANMSG_DATA_LEN(msg);
This compiles to:
txfifo->id_section = ids;
34: f84e 3005 str.w r3, [lr, r5]
txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
txfifo->data[0] = msg->data32[0];
38: e9d0 3401 ldrd r3, r4, [r0, #4]
txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
3c: 041b lsls r3, r3, #16
txfifo->data[0] = msg->data32[0];
3e: 608c str r4, [r1, #8]
txfifo->data[1] = msg->data32[1];
40: 68c4 ldr r4, [r0, #12]
txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
42: f403 2370 and.w r3, r3, #983040 @ 0xf0000
txfifo->data[1] = msg->data32[1];
46: 60cc str r4, [r1, #12]
txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
48: 604b str r3, [r1, #4]
barrier();
SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
4a: 2301 movs r3, #1
4c: 4908 ldr r1, [pc, #32] @ (70 <canhw_send+0x70>)
4e: 4093 lsls r3, r2
50: f8c1 30d0 str.w r3, [r1, #208] @ 0xd0
return CANMSG_DATA_LEN(msg);
The code is a little verbose as the compiler is interleaving multiple store/load operations as some of this is pointer chasing and generating addresses. I’m assuming this is okay and the order to store data here isn’t too important, otherwise there would be calls to barrier(). Importantly for this commit we see this:
txfifo->dlc_section = (msg->dlc & 0x0f) << 16;
48: 604b str r3, [r1, #4]
barrier();
SOC_CAN->TXBAR = ((uint32_t)1 << w_index);
4a: 2301 movs r3, #1
4c: 4908 ldr r1, [pc, #32] @ (70 <canhw_send+0x70>)
4e: 4093 lsls r3, r2
50: f8c1 30d0 str.w r3, [r1, #208] @ 0xd0
return CANMSG_DATA_LEN(msg);
There is a store to dlc_section, then (1 << w_index) is calculated here rather than earlier due to the barrier, then a store to TXBAR. There are no memory barriers to be seen. The CPU is free to re-order the write to dlc_section to after TXBAR if it sees fit. All the barrier() call is doing is forcing the pointer chasing and left shift to run later than it needs to. In this case there should probably be a DMB between these two instructions.
In some cases barrier() does nothing really useful, such as the rp2040 bootrom code:
// Set up flash so we can work with it without XIP getting in the way
flash_enter_xip_prepare();
irq_disable();
barrier();
connect_internal_flash();
The compiler generally doesn’t re-order function calls like this so it doesn’t do anything:
asm volatile("cpsid i" ::: "memory");
20007c3c: b672 cpsid i
// Set up flash so we can work with it without XIP getting in the way
flash_enter_xip_prepare();
irq_disable();
barrier();
connect_internal_flash();
20007c3e: f7ff ffb3 bl 20007ba8 <connect_internal_flash>
flash_exit_xip();
20007c42: f7ff ffb9 bl 20007bb8 <flash_exit_xip>
flash_cs_force(0);
20007c46: 2000 movs r0, #0
20007c48: f7ff ffc6 bl 20007bd8 <flash_cs_force>
It seems like a lot of code is written with the impression that there is a memory barrier, but there isn’t. The compiler doesn’t specify it and there’s no barrier in the assembly.
Maybe Klipper should rewrite barrier to do dmb?