RFC: Klipper MCU (eBPF) side interpreter/runtime code sideload

There is some blur line when and which code should be implemented in the MCU.
It seems to be cumbersome to implement support for every additional Accelerometer, Angle sensor, display, whatever.

There are somewhat 2 issues:

  • Code bloat and limited MCU flash/ram memory
  • MCU code should be as simple as possible

So, I’ve thought about this idea for a while, and it seems that basically, the addition of any other language to the existing C, Python, Jinja2 would increase complexity and toolchain even further.

I had hopes in Forth initially, but it would be too much; it is another family of languages.
It is already hard enough to ask to work with C and Python

But, there is Linux eBPF and related infrastructure, even a Clang compiler backend.
Where one can write “arbitrary” code on a subset of C.
So, the only limitation it is Linux only.
Well, ISA is open and pretty simple to some degree.

So, I went ahead and implemented a portable version:

From the host side view, one can implement basic sensor support like so:

#define timer_read_time() (BPF_CALL_N(1))
#define debug(a) (BPF_CALL_N1(2, a)

#define i2c_dev_write(a, b, c)      (BPF_CALL_N3(3, a, b, c))
#define i2c_dev_read(a, b, c, d, e) (BPF_CALL_N5(4, a, b, c, d, e))
// Other shared definitions

__section("prog")
int32_t
task(struct i2cdev_s *i2c)
{
    uint8_t reg_len = 1;
    uint8_t reg[2] = {0x0, 0x1};
    uint8_t read_len = 6;
    uint8_t resp[read_len]
    int ret = i2c_dev_read(i2c, reg_len, reg, read_len, resp);
    // Do something, call bulk report & etc
    return ret;
}

Then Python should compile that:
clang -O0 -Wall --target=bpf -c bpf.c -o bpf.elf
It is possible to debug, disassemble, and run on the host; it only requires mocking the platform calls

Then we can allocate memory on the MCU and submit the program.
There can be 2 places where it can/should be able to work

TX event hook:

Task hook:

  • To not compile in accelerometers or angle sensors, it is possible to implement them that way

Basically, that is it.

Limitations, caveats:
Original ISA is 64-bit, and registers are 64-bit; every pointer manipulation is 64-bit.
Right now, I somewhat lean toward simply ignoring that, because there are no alternatives or a way to ask clang to emit only 32-bit ALU pointer arithmetic.

It is possible, though, to map ALU64 to ALU32 operations on a 32-bit target, and by so reduce the register size and overall overhead.

But I guess, for initial implementation and interoperability with Linux MCU, it is better not to do that at this stage.

Thanks!
-Timofey


P.S. I think it is possible to convert ISA to 32 bit/64 bit, for byte code memory store efficiency
Because it seems that often the immediate value will be unused and thus zeroed.

opcode src_r dst_r offset immediate value       assembly
  0x63     1    10 0xfffc      0x00000000       STX   MEM *(u32 *) (r10 + -4) = r1
  0x63     2    10 0xfff8      0x00000000       STX   MEM *(u32 *) (r10 + -8) = r2
  0x61    10     1 0xfffc      0x00000000       LDX   MEM r1 = *(u32 *) (r10 + -4)
  0x61    10     3 0xfff8      0x00000000       LDX   MEM r3 = *(u32 *) (r10 + -8)
  0xbc     1     2 0x0000      0x00000000       ALU   MOV r2 = r1
  0x0c     3     2 0x0000      0x00000000       ALU   ADD r2 += r3
  0x04     0     2 0x0000      0x00000001       ALU   ADD r2 += 1
  0x04     0     1 0x0000      0xffffffff       ALU   ADD r1 += 4294967295
  0xbc     1     4 0x0000      0x00000000       ALU   MOV r4 = r1
  0x67     0     4 0x0000      0x00000020       ALU64 LSH r4 <<= ((1 << 64 - 1) & 32
  0xc7     0     4 0x0000      0x00000020       ALU64 ARSH r4 >>= ((1 << 64 - 1) & 32
  0xbf    10     1 0x0000      0x00000000       ALU64 MOV r1 = r10
...
4 Likes

Interesting. Thanks.

Yeah, I had similar thoughts. When I last looked at this I was looking at Forth or a similar “stack interpreter” in the mcu. I was also very concerned about the complexity of having to write and maintain code in some obscure language (such as Forth).

I also considered having some host mechanism to translate a python-like or c-like language into a byte-code for the mcu. It’s probably a lot of work though.

That is interesting. I had not thought of using something like eBPF.

For what it is worth, there’s likely to be some challenges with using eBPF and clang though - the mcu memory required is likely to be a bit bloated (due to 64-bit instructions), the interpreter code size and performance is likely to be bloated (due to 64-bit registers), and the host requirement to install clang may stress disk requirements on some low-end host computers.

Cheers,
-Kevin

2 Likes

TLDR, Agreed.

I did end up fighting too much with the eBPF compiler.
btw, there is a GCC backend, but I didn’t test it.

I realized that one can apply the same restrictions to anything.
So,
After a short research for a suitable ISA.
I’ve chosen the RISC-V RV32e:

  • 32bit ISA
  • Integer only
  • Embedded, thus 32 → 16 registers.
  • It is possible to implement a “compressed” mixed 16/32bit ISA, but I’m not sure it’s worth it (tradeoff interpreter complexity for the instruction size)
  • Should be supported by any: gcc-riscv64-linux-gnu, gcc-riscv64-unknown-elf

Another PoC.

So, to sum things up, it seems that the basic eBPF idea is somewhat easy to implement.
Any ISA can be converted that way, and by so, any toolchain can be forced to output “bare” code.

Caveats:

  • Based on my experience with eBPF above, the tricky part can be to pass pointers (structures with pointers) around (and if it is a mixed-width - RIP).
    It seems that in the kernel, it is solved with plenty of helpers to do all the heavy lifting.
    For MCU, it can be cumbersome, so I’m trying to avoid them

But, I guess, if one can pass -m32 to Linux MCU gcc, we will have 32bit wide pointers (or less, Arduino? :D) everywhere.
By so, it can work with raw pointers passing around.

So, in my head, I think being able to pass a stack memory pointer to the “i2c_write”, so it can write back raw data to the preallocated space, is crucial.

Regards,
-Timofey


This was educational, now I “know” 2 assembly languages :smiley:.


I’ve implemented most of the ISA (except for some HW releated ops: fence, system status, mode & etc).
I’ve tried to compile it in the firmware (STM32G0B1):
43668 → 44912, 1244 bytes (With additional MCU commands, there will be more of course).

For comparison (I’ve just commented out lines in the Makefile):

  • HX71x 1172 bytes
  • icm20948 1080 bytes
  • BMI160 1260 bytes
  • LIS2DW/LIS3DH 1380 bytes
  • ADXL345 944 bytes
  • Thermocouple 1004 bytes
  • Angle sensors 1928 bytes

But I’m suspicious that part of that is the dict data for additional commands.

I’ve made a prototype with some “basic” integration on top of RISC-V RV32e VM: GitHub - nefelim4ag/klipper at riscv-evm-20260504 · GitHub

I’ve tested it on a Linux MCU and the RP2040.

Host side code is defined inside klippy/extras/evm, MCU side is at src/riscv_evm.c
lib code is outside of the lib folder, because I was unable to convince make to compile it =_=

It seems to me that implementing “syscalls” is cumbersome.
Because it requires manual definition in the MCU and on the host side.
Where MCU also requires some careful type conversions.
Not sure if it’s worth it to dynamically define the syscall list inside the dictionary and export it to the host from each MCU.

The next problems, I would guess, would be:

  • Where/When compile the byte-code?
  • If one tries to integrate it, how should it be integrated with other code? I guess it would look similar to trigger analog with SOS filters.
  • How to implement proper data packing and unpacking on the host side, if one tries to build the IO on top of the VM, for example.

Those are open questions, for me at least (I mean, there is no answer right now).

Syscalls definition
void
platform_ecall(uint32_t id, uint32_t *a0, uint32_t a1,
               uint32_t a2, uint32_t a3, uint32_t a4)
{
    uint32_t ret = *a0;
    switch (id) {
        case 0:
            printf("%d\n", ret);
            break;
        case 1:
            ret = timer_read_time();
            break;
        case 2:
            uint8_t oid = *a0;
            uint8_t rlen = a1;
            uint8_t *rdata = (uint8_t *) a2;
            sendf("evm_response oid=%c response=%*s",
                    oid, rlen, rdata);
            break;
    }
    *a0 = ret;
}

Where the uapi.h can look like so:

#define _timer_read_time() (RISC_V_EVM_CALL_N(1))

static inline uint32_t
timer_read_time()
{
    return _timer_read_time();
}

#define _sendf(oid, rlen, rdata) (RISC_V_EVM_CALL_N3(2, oid, rlen, rdata))

static inline void
sendf(uint8_t oid, uint8_t rlen, uint8_t *rdata)
{
    _sendf(oid, rlen, rdata);
}
How to play with it locally

Goofy config:

[mcu]
serial: /tmp/klipper_mcu

[printer]
kinematics: none
max_velocity: 1
max_accel: 1

[evm test]
mcu: mcu

So, it is possible to play with it:

  1. Compile & Run Linux MCU: make; ./out/klipper.elf -I /tmp/klipper_mcu
  2. Run klippy: python3 klippy/klippy.py evm_printer.cfg -I /tmp/klippy.serial

Output will look like so:

WARNING:root:mcu 'mcu': got {'oid': 0, 'response': b'1234\x004\xa8\xb85', '#name': 'evm_response', '#sent_time': 27290.791607075, '#receive_time': 27290.791752928}

So, I’ve basically sent the string and received it back with some additional data:

#include <stdint.h>
#include "uapi.h"

__section("prog")
void task(uint32_t *args, uint8_t *data) {
....
    new_data[data_len] = 0;
    uint32_t time = timer_read_time();
    for (uint8_t i = 0; i < sizeof(time); i++) {
        new_data[data_len + 1 + i] = (time >> (i * 8)) & 0xff;
    }
    sendf(oid, data_len + 1 + 4, new_data);
}

Caveats:

  • VLA for byte code is disabled because, for whatever reason, it can trigger stack smashing (it seems to read past the VM stack pointer upon return, or there is a bug and I’m blind).
  • I think -Wstack-usage=256 or something should be set to prevent any code that can overflow the stack.

Thanks,
-Timofey

That’s an interesting idea - use a small existing instruction set to implement a simple virtual machine.. I’d guess it would be an improvement over eBPF, but still have similar host package installation overhead (eg, gcc) and still have some mcu size overhead (32bit instructions). I agree that being able to use an existing C compiler is a benefit though.

I’ve only looked at this briefly, so take everything I say here with “a grain of salt”.

Being able to dynamically define “syscalls” would seem useful to me (eg, DECL_EVM(sensor_bulk_report, "void sensor_bulk_report(struct sensor_bulk *sb, uint8_t oid)")). Admittedly, implementing that could be tricky though.

I’d guess the host python code could gather all the vm code during configuration, compile it, and upload it as part of the mcu config. That is, just as the host figures out how many oids there are, it could also figure out what the full set of vm code is. Admittedly, calling gcc on each mcu config could be slow on slow host machines.

I’d guess the VM should be able to call whatever internal code (ie, other vm functions) it wants, but only be able to invoke mcu C functions that have been explicitly declared (eg, DECL_EVM()). I’d also look at having the VM enforce that all memory reads/writes are to internal vm addresses. So, only way for the VM to communicate with the outside world is through declared C functions (eg, DECL_EVM()).

Again - just some random thoughts that should not be taken too seriously.
-Kevin

So, I’ve made some progress: GitHub - nefelim4ag/klipper at riscv-evm-20260509 · GitHub

I’ve implemented a syscall generator, a little bit ugly, but it works.
One can define an arbitrary helper:

uint32_t get_next_wake_time(uint8_t oid) {
    struct evm *vm = evm_oid_lookup(oid);
    return vm->timer.waketime;
}
DECL_FW_CALL1(uint32_t, get_next_wake_time, uint8_t oid);

It will generate the necessary wrappers in compile_time_request.c and define them inside the dict:

    "fw_calls": {
      "int|i2c_dev_read|struct i2cdev_s *i2c|uint8_t reg_len|uint8_t *reg|uint8_t read_len|uint8_t *read": 4,
      "int|i2c_dev_write|struct i2cdev_s *i2c|uint8_t write_len|uint8_t *data": 5,
...
      "void|sched_wake_evm": 8,
      "void|spidev_transfer|struct spidev_s *spi|uint8_t receive_data|uint8_t data_len|uint8_t *data": 10
    },

Which allow the host to generate UAPI headers for this specific MCU.

I’ve implemented support for function calls, and by so, added section support, so one can define specific sections (predefined), and code will load the required section depending on the context.

Not sure about one global bytecode binary, it seems tricky to me to define, link, and then extract the necessary ENTRY addresses for each entry point.

...
static uint8_t max_invalid, invalid_count;
struct spidev_s *dev;

__section(".command")
void command(uint8_t oid, uint8_t data_len, uint8_t *data)
{
    uint32_t *ptr = (uint32_t *)data;
    min_value = ptr[0];
    data = &data[4];
...
__section(".task")
void task(uint8_t oid, struct evm *vm)
{
    update_evm_flags(oid, 0, EVM_PENDING);
    if (CHIP_TYPE == 0) {
        uint8_t msg[4] = { MAX31865_RTDMSB_REG, 0x00, 0x00, 0x00 };
        spidev_transfer(dev, 1, 3, msg);

I’m not sure it is possible or worth it, to really enforce the isolation. (One still can pass wrong data to the external function, and it will crash the machine).
Otherwise, I think we can partially enforce it in the following way.
For example, in the current implementation, working with SPI basically looks like so:

struct spidev_s; // anon
struct spidev_s *dev = spidev_oid_lookup(spi_oid);
uint8_t msg[4] = { MAX31865_RTDMSB_REG, 0x00, 0x00, 0x00 };
spidev_transfer(dev, 1, 3, msg);
...

And in my head, most helpers can use a similar pattern, where we do not allow direct access to the struct fields, but we can request some of them:

// VM side, MCU side defined above
uint32_t next_wake_time = get_next_wake_time(oid);

So, I’ve implemented a basic skeleton, and I’ve ported the MAX31865 support on top of it (in the branch).
It seems to work on RP2040 (SW SPI, no real MAX device) and STM32G0B1 (real one).
There is no shutdown call, so it should be possible to “test” it on any chip, I guess.

In the current state, MCU VM on STM32G0B1 takes about 45812 - 43676 = 2136 bytes (dict overhead is included).
And I hope it is enough to also cover the accelerometers.
Where sideloaded bytecode takes 300 bytes for MAX31865, +32 bytes per VM instance.

BTW, it is handy to test new code on the fly by a simple restart of the klipper :D.

I tried to compile the code for RV32ec target, to enable compressed instructions, which shaves 300 - 232 = 68 bytes, or about 1 - 232/300 = 0.77 ~ 23%
So, I’m still thinking whether implementing the compressed ops is worth the trouble.

I guess, any further changes are about a tradeoff to make:

  • Adding custom ecalls - inflate the dict, but can allow for reducing the byte-code size
  • Additional ISA work (translation to a more SW-friendly state, or support of compressed ops), would either complicate the toolchain further (preprocessor), or inflate the interpreter (addition of compressed ops for example)
  • Converting any existing code to amortize the cost, where code can be generated for a specific target with specific oids, for example, LIS2DW with I2C, where the I2C device ID is a compilation-time constant.

I think the painful point is the bytecode RAM usage, where I am less concerned about the flash usage.

Regards,
-Timofey


I’ve implemented RV32ec (compressed ops) support: GitHub - nefelim4ag/klipper at riscv-evm-20260511 · GitHub
It was a big refactor, with some code size optimizations.

Interpreter code got an additional 46904 - 45812 = 1092, so about 3228 in total (with dict).

  6712: 0800ae5d  2248 FUNC    GLOBAL HIDDEN     1 evm_interpreter.[...]

The spi_temperature got squeezed 280 → 198 bytes.

I’ve fixed some goofy bugs/caveats:

  • Unaligned bytecode read 32u → 2x16u
  • BSS got dropped from the output binary, so the bytecode has tried to access past the allocated memory.

*Just for comparison, the same byte code with arm-none-eabi-gcc -O2 -mcpu=cortex-m0plus, 168 bytes.


=_=
I’m not sure it’s worth it to reimplement it a third time with proprietary ISA (m0).
I guess in such a case, it is simpler to try to reproduce the “kernel modules” on top of that interface, and simply compile raw platform-specific code.
The only hard things, it seems to me, are to guarantee ABI compatibility in that case.

Like, directly compile the PIE code, load it into memory, and call a function directly, as I already pass entry points/offsets from the host.

2 Likes