Skip to main content

Command Palette

Search for a command to run...

Squeezing Bytes: How I Optimised an eBPF Driver in Falco

Updated
9 min read
Squeezing Bytes: How I Optimised an eBPF Driver in Falco
R
Believe in life after death? How? Look at your child—they aren't a ghost from years ago. It's just you forking your DNA into a new process. Am I wrong? You came here for low-level internals, but don't try to apply system logic to your messy human life. 😁

1. How I Found This Opportunity

I wanted to make my first open source contribution, so I decided to fork a few eBPF projects — one of them being falcosecurity/libs. After compiling and running Falco, I copied the sched_process_exec sensor's xlated and JIT assembly views into two separate text files for analysis.

Why sched_process_exec?

If you have read my article on sched_process_exit, you already know how critical this sensor is. A half-clean map can leave your entire visibility blind.

1.1 What I Noticed

While analysing the xlated bytecode, I came across this pattern:

; return settings->boot_time;
  55: (79) r7 = *(u64 *)(r0 +0)
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
  56: (85) call bpf_ktime_get_boot_ns#306080
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
  57: (0f) r0 += r7
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
  58: (73) *(u8 *)(r8 +0) = r0
  59: (bf) r1 = r0
  60: (77) r1 >>= 56
  61: (73) *(u8 *)(r8 +7) = r1
  62: (bf) r1 = r0
  63: (77) r1 >>= 48
  64: (73) *(u8 *)(r8 +6) = r1
  65: (bf) r1 = r0
  66: (77) r1 >>= 40
  67: (73) *(u8 *)(r8 +5) = r1
  68: (bf) r1 = r0
  69: (77) r1 >>= 32
  70: (73) *(u8 *)(r8 +4) = r1
  71: (bf) r1 = r0
  72: (77) r1 >>= 24
  73: (73) *(u8 *)(r8 +3) = r1
  74: (bf) r1 = r0
  75: (77) r1 >>= 16
  76: (73) *(u8 *)(r8 +2) = r1
  77: (77) r0 >>= 8
  78: (73) *(u8 *)(r8 +1) = r0
; auxmap->event_type = event_type;
  79: (bf) r7 = r8
  80: (07) r7 += 131082
; hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
  81: (85) call bpf_get_current_pid_tgid#305536
  82: (b4) w1 = 293
; auxmap->event_type = event_type;
  83: (6b) *(u16 *)(r7 +0) = r1
  84: (b4) w1 = 0
; hdr->nparams = nparams;
  85: (73) *(u8 *)(r8 +25) = r1
  86: (73) *(u8 *)(r8 +24) = r1
  87: (73) *(u8 *)(r8 +23) = r1
  88: (b4) w1 = 1
; hdr->type = event_type;
  89: (73) *(u8 *)(r8 +21) = r1
  90: (b4) w1 = 37
  91: (73) *(u8 *)(r8 +20) = r1
; hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
  92: (73) *(u8 *)(r8 +15) = r6
  93: (73) *(u8 *)(r8 +14) = r6
  94: (73) *(u8 *)(r8 +13) = r6
  95: (73) *(u8 *)(r8 +12) = r6
  96: (73) *(u8 *)(r8 +8) = r0
  97: (bf) r1 = r0
  98: (77) r1 >>= 24
  99: (73) *(u8 *)(r8 +11) = r1
 100: (bf) r1 = r0
 101: (77) r1 >>= 16
 102: (73) *(u8 *)(r8 +10) = r1
 103: (77) r0 >>= 8
 104: (73) *(u8 *)(r8 +9) = r0
; hdr->nparams = nparams;
 105: (73) *(u8 *)(r8 +22) = r9

The first thing I noticed was that the compiler was repeatedly copying a register value and right-shifting it. To understand why, I looked up the relevant BPF opcodes:

  • 79 = BPF_LDX | BPF_DW | BPF_MEM — 64-bit memory load

  • 0f = BPF_ALU64 | BPF_X | BPF_ADD — 64-bit register add

  • 73 = BPF_STX | BPF_B | BPF_MEMsingle byte memory store

  • bf = BPF_ALU64 | BPF_X | BPF_MOV — 64-bit register move

  • 77 = BPF_ALU64 | BPF_K | BPF_RSH — 64-bit right shift

The pattern becomes clear when you trace through it:

  • r0 holds the 64-bit timestamp value

  • r8 + 0 is the destination memory address

  • The compiler copies r0 into r1, right-shifts to extract each byte, then stores that byte individually

  • This repeats 8 times — once per byte of the 64-bit value

This is byte-by-byte assignment. A single 64-bit value written as 8 separate 1-byte stores. That immediately raised a question — why?


2. Finding the Root Cause

My next step was to find where this code originated. I opened sched_process_exec.bpf.c but found no reference to maps__get_boot_time() + bpf_ktime_get_boot_ns(). So I grepped the codebase for that expression and traced the call chain:

  • sched_process_exec.bpf.c calls auxmap__preload_event_header(auxmap, PPME_SYSCALL_EXECVE_19_X)

  • That function is defined in driver/modern_bpf/helpers/store/auxmap_store_params.h

static __always_inline void auxmap__preload_event_header(struct auxiliary_map *auxmap,
                                                         uint16_t event_type) {
	struct ppm_evt_hdr *hdr = (struct ppm_evt_hdr *)auxmap->data;
	uint8_t nparams = maps__get_event_num_params(event_type);
	hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
	hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
	hdr->type = event_type;
	hdr->nparams = nparams;
	auxmap->payload_pos = sizeof(struct ppm_evt_hdr) + nparams * sizeof(uint16_t);
	auxmap->lengths_pos = sizeof(struct ppm_evt_hdr);
	auxmap->event_type = event_type;
}

2.1 Identifying the Gap

The answer lies in these two struct definitions:

struct auxiliary_map {
	uint8_t data[AUXILIARY_MAP_SIZE]; /* raw space to save our variable-size event. */
	uint64_t payload_pos;             /* position of the first empty byte in the `data` buf. */
	uint8_t lengths_pos; /* position the first empty slot into the lengths array of the event. */
	uint16_t event_type; /* event type we want to send to userspace */
};

#if defined _MSC_VER
#pragma pack(push)
#pragma pack(1)
#else
#pragma pack(push, 1)
#endif
struct ppm_evt_hdr {
#ifdef PPM_ENABLE_SENTINEL
	uint32_t sentinel_begin;
#endif
	uint64_t ts;      /* timestamp, in nanoseconds from epoch */
	uint64_t tid;     /* the tid of the thread that generated this event */
	uint32_t len;     /* the event len, including the header */
	uint16_t type;    /* the event type */
	uint32_t nparams; /* the number of parameters of the event */
};
#pragma pack(pop)

Now look at this line in the function:

struct ppm_evt_hdr *hdr = (struct ppm_evt_hdr *)auxmap->data;

Strip away the high-level abstraction and this is what it actually does:

  • auxmap->data is a raw byte array

  • hdr is a pointer to the start of that byte array

  • We are casting an unaligned byte array pointer to a packed struct pointer

And there is the problem. struct ppm_evt_hdr is declared with #pragma pack(1), which removes all padding and alignment guarantees. When the BPF compiler sees a pointer to a packed struct, it cannot prove that the memory is naturally aligned. As a result, it cannot safely emit a single 64-bit store — instead it decomposes every field write into individual byte stores to guarantee correctness regardless of alignment.


3. The Fix

After trying several approaches, the method that produced the best result — 32 fewer instructions — was this:

static __always_inline void auxmap__preload_event_header(struct auxiliary_map *auxmap,
                                                         uint16_t event_type) {
	uint8_t nparams = maps__get_event_num_params(event_type);

	/*
	 * Avoid byte-by-byte stores from writing packed header fields directly.
	 * `struct ppm_evt_hdr` is packed, so field assignments are lowered to
	 * byte stores. By building the header as a local value and copying it with
	 * __builtin_memcpy, clang can lower the copy into wider stores at naturally
	 * aligned destination offsets (ts/tid/len/type).
	 */
	struct ppm_evt_hdr hdr = {0};
	hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
	hdr.tid = bpf_get_current_pid_tgid() & 0xffffffff;
	hdr.type = event_type;
	hdr.nparams = nparams;
	__builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));

	auxmap->payload_pos = sizeof(struct ppm_evt_hdr) + nparams * sizeof(uint16_t);
	auxmap->lengths_pos = sizeof(struct ppm_evt_hdr);
	auxmap->event_type = event_type;
}

4. Verification

After applying the fix, the xlated bytecode now looks like this:

; return settings->boot_time;
  54: (79) r7 = *(u64 *)(r0 +0)
; hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
  55: (85) call bpf_ktime_get_boot_ns#306080
  56: (bf) r9 = r0
; hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
  57: (0f) r9 += r7
; auxmap->event_type = event_type;
  58: (bf) r7 = r8
  59: (07) r7 += 131082
; hdr.tid = (uint32_t)bpf_get_current_pid_tgid();
  60: (85) call bpf_get_current_pid_tgid#305536
  61: (b4) w1 = 293
; auxmap->event_type = event_type;
  62: (6b) *(u16 *)(r7 +0) = r1
; __builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));
  63: (6b) *(u16 *)(r8 +20) = r1
  64: (b4) w1 = 0
  65: (6b) *(u16 *)(r8 +24) = r1
  66: (63) *(u32 *)(r8 +16) = r1
  67: (7b) *(u64 *)(r8 +0) = r9
; hdr.tid = (uint32_t)bpf_get_current_pid_tgid();
  68: (67) r0 <<= 32
  69: (77) r0 >>= 32
; __builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));
  70: (7b) *(u64 *)(r8 +8) = r0
  71: (6b) *(u16 *)(r8 +22) = r6

The total instruction count dropped by 32 instructions per invocation.


5. Why It Works

The key insight is what changes when we declare the struct on the stack:

struct ppm_evt_hdr hdr = {0};

Before the fix, hdr was a pointer into a byte array — an unaligned destination the compiler couldn't trust. After the fix, hdr is a stack-allocated value. The BPF stack is naturally 8-byte aligned, so the compiler can now prove the memory is aligned and safely emit full-width 64-bit stores when writing to hdr.

The __builtin_memcpy then copies the fully built struct from the stack into auxmap->data. Because the source is now naturally aligned, clang optimizes the copy into wide stores rather than byte-by-byte moves.

In short: the trick is to build the struct somewhere the compiler trusts (the stack), then copy it to where you need it. One extra copy, but 32 fewer instructions — a significant net win across 87 probe sites that call this function on every captured syscall event.

This eliminates ~32 BPF instructions per invocation. Since auxmap__preload_event_header is called across all 87 syscall probe sites in the modern BPF driver, the reduction applies to every single syscall event Falco captures — making this a small change with system-wide impact.


6. Final Thoughts

I have been fascinated by how computers work at the lowest level for as long as I can remember. There is something uniquely satisfying about reading raw bytecode, finding a pattern that doesn't feel right, tracing it all the way back to a struct declaration, and then watching the disassembly change after a fix. Not because someone asked me to — just because the curiosity wouldn't let go.

And I can't encourage everyone to start submitting PRs based on my first experience, but it's worth giving it a shot.

7. Merged PR

The fix was reviewed, tested across multiple kernel and distro combinations, and merged into falcosecurity/libs:

https://github.com/falcosecurity/libs/pull/3022