Squeezing Bytes: How I Optimised an eBPF Driver in Falco

1. How I Found This Opportunity
I wanted to make my first open source contribution, so I decided to fork a few eBPF projects — one of them being falcosecurity/libs. After compiling and running Falco, I copied the sched_process_exec sensor's xlated and JIT assembly views into two separate text files for analysis.
Why sched_process_exec?
If you have read my article on sched_process_exit, you already know how critical this sensor is. A half-clean map can leave your entire visibility blind.
1.1 What I Noticed
While analysing the xlated bytecode, I came across this pattern:
; return settings->boot_time;
55: (79) r7 = *(u64 *)(r0 +0)
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
56: (85) call bpf_ktime_get_boot_ns#306080
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
57: (0f) r0 += r7
; hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
58: (73) *(u8 *)(r8 +0) = r0
59: (bf) r1 = r0
60: (77) r1 >>= 56
61: (73) *(u8 *)(r8 +7) = r1
62: (bf) r1 = r0
63: (77) r1 >>= 48
64: (73) *(u8 *)(r8 +6) = r1
65: (bf) r1 = r0
66: (77) r1 >>= 40
67: (73) *(u8 *)(r8 +5) = r1
68: (bf) r1 = r0
69: (77) r1 >>= 32
70: (73) *(u8 *)(r8 +4) = r1
71: (bf) r1 = r0
72: (77) r1 >>= 24
73: (73) *(u8 *)(r8 +3) = r1
74: (bf) r1 = r0
75: (77) r1 >>= 16
76: (73) *(u8 *)(r8 +2) = r1
77: (77) r0 >>= 8
78: (73) *(u8 *)(r8 +1) = r0
; auxmap->event_type = event_type;
79: (bf) r7 = r8
80: (07) r7 += 131082
; hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
81: (85) call bpf_get_current_pid_tgid#305536
82: (b4) w1 = 293
; auxmap->event_type = event_type;
83: (6b) *(u16 *)(r7 +0) = r1
84: (b4) w1 = 0
; hdr->nparams = nparams;
85: (73) *(u8 *)(r8 +25) = r1
86: (73) *(u8 *)(r8 +24) = r1
87: (73) *(u8 *)(r8 +23) = r1
88: (b4) w1 = 1
; hdr->type = event_type;
89: (73) *(u8 *)(r8 +21) = r1
90: (b4) w1 = 37
91: (73) *(u8 *)(r8 +20) = r1
; hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
92: (73) *(u8 *)(r8 +15) = r6
93: (73) *(u8 *)(r8 +14) = r6
94: (73) *(u8 *)(r8 +13) = r6
95: (73) *(u8 *)(r8 +12) = r6
96: (73) *(u8 *)(r8 +8) = r0
97: (bf) r1 = r0
98: (77) r1 >>= 24
99: (73) *(u8 *)(r8 +11) = r1
100: (bf) r1 = r0
101: (77) r1 >>= 16
102: (73) *(u8 *)(r8 +10) = r1
103: (77) r0 >>= 8
104: (73) *(u8 *)(r8 +9) = r0
; hdr->nparams = nparams;
105: (73) *(u8 *)(r8 +22) = r9
The first thing I noticed was that the compiler was repeatedly copying a register value and right-shifting it. To understand why, I looked up the relevant BPF opcodes:
79=BPF_LDX | BPF_DW | BPF_MEM— 64-bit memory load0f=BPF_ALU64 | BPF_X | BPF_ADD— 64-bit register add73=BPF_STX | BPF_B | BPF_MEM— single byte memory storebf=BPF_ALU64 | BPF_X | BPF_MOV— 64-bit register move77=BPF_ALU64 | BPF_K | BPF_RSH— 64-bit right shift
The pattern becomes clear when you trace through it:
r0holds the 64-bit timestamp valuer8 + 0is the destination memory addressThe compiler copies
r0intor1, right-shifts to extract each byte, then stores that byte individuallyThis repeats 8 times — once per byte of the 64-bit value
This is byte-by-byte assignment. A single 64-bit value written as 8 separate 1-byte stores. That immediately raised a question — why?
2. Finding the Root Cause
My next step was to find where this code originated. I opened sched_process_exec.bpf.c but found no reference to maps__get_boot_time() + bpf_ktime_get_boot_ns(). So I grepped the codebase for that expression and traced the call chain:
sched_process_exec.bpf.ccallsauxmap__preload_event_header(auxmap, PPME_SYSCALL_EXECVE_19_X)That function is defined in
driver/modern_bpf/helpers/store/auxmap_store_params.h
static __always_inline void auxmap__preload_event_header(struct auxiliary_map *auxmap,
uint16_t event_type) {
struct ppm_evt_hdr *hdr = (struct ppm_evt_hdr *)auxmap->data;
uint8_t nparams = maps__get_event_num_params(event_type);
hdr->ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
hdr->tid = bpf_get_current_pid_tgid() & 0xffffffff;
hdr->type = event_type;
hdr->nparams = nparams;
auxmap->payload_pos = sizeof(struct ppm_evt_hdr) + nparams * sizeof(uint16_t);
auxmap->lengths_pos = sizeof(struct ppm_evt_hdr);
auxmap->event_type = event_type;
}
2.1 Identifying the Gap
The answer lies in these two struct definitions:
struct auxiliary_map {
uint8_t data[AUXILIARY_MAP_SIZE]; /* raw space to save our variable-size event. */
uint64_t payload_pos; /* position of the first empty byte in the `data` buf. */
uint8_t lengths_pos; /* position the first empty slot into the lengths array of the event. */
uint16_t event_type; /* event type we want to send to userspace */
};
#if defined _MSC_VER
#pragma pack(push)
#pragma pack(1)
#else
#pragma pack(push, 1)
#endif
struct ppm_evt_hdr {
#ifdef PPM_ENABLE_SENTINEL
uint32_t sentinel_begin;
#endif
uint64_t ts; /* timestamp, in nanoseconds from epoch */
uint64_t tid; /* the tid of the thread that generated this event */
uint32_t len; /* the event len, including the header */
uint16_t type; /* the event type */
uint32_t nparams; /* the number of parameters of the event */
};
#pragma pack(pop)
Now look at this line in the function:
struct ppm_evt_hdr *hdr = (struct ppm_evt_hdr *)auxmap->data;
Strip away the high-level abstraction and this is what it actually does:
auxmap->datais a raw byte arrayhdris a pointer to the start of that byte arrayWe are casting an unaligned byte array pointer to a packed struct pointer
And there is the problem. struct ppm_evt_hdr is declared with #pragma pack(1), which removes all padding and alignment guarantees. When the BPF compiler sees a pointer to a packed struct, it cannot prove that the memory is naturally aligned. As a result, it cannot safely emit a single 64-bit store — instead it decomposes every field write into individual byte stores to guarantee correctness regardless of alignment.
3. The Fix
After trying several approaches, the method that produced the best result — 32 fewer instructions — was this:
static __always_inline void auxmap__preload_event_header(struct auxiliary_map *auxmap,
uint16_t event_type) {
uint8_t nparams = maps__get_event_num_params(event_type);
/*
* Avoid byte-by-byte stores from writing packed header fields directly.
* `struct ppm_evt_hdr` is packed, so field assignments are lowered to
* byte stores. By building the header as a local value and copying it with
* __builtin_memcpy, clang can lower the copy into wider stores at naturally
* aligned destination offsets (ts/tid/len/type).
*/
struct ppm_evt_hdr hdr = {0};
hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
hdr.tid = bpf_get_current_pid_tgid() & 0xffffffff;
hdr.type = event_type;
hdr.nparams = nparams;
__builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));
auxmap->payload_pos = sizeof(struct ppm_evt_hdr) + nparams * sizeof(uint16_t);
auxmap->lengths_pos = sizeof(struct ppm_evt_hdr);
auxmap->event_type = event_type;
}
4. Verification
After applying the fix, the xlated bytecode now looks like this:
; return settings->boot_time;
54: (79) r7 = *(u64 *)(r0 +0)
; hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
55: (85) call bpf_ktime_get_boot_ns#306080
56: (bf) r9 = r0
; hdr.ts = maps__get_boot_time() + bpf_ktime_get_boot_ns();
57: (0f) r9 += r7
; auxmap->event_type = event_type;
58: (bf) r7 = r8
59: (07) r7 += 131082
; hdr.tid = (uint32_t)bpf_get_current_pid_tgid();
60: (85) call bpf_get_current_pid_tgid#305536
61: (b4) w1 = 293
; auxmap->event_type = event_type;
62: (6b) *(u16 *)(r7 +0) = r1
; __builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));
63: (6b) *(u16 *)(r8 +20) = r1
64: (b4) w1 = 0
65: (6b) *(u16 *)(r8 +24) = r1
66: (63) *(u32 *)(r8 +16) = r1
67: (7b) *(u64 *)(r8 +0) = r9
; hdr.tid = (uint32_t)bpf_get_current_pid_tgid();
68: (67) r0 <<= 32
69: (77) r0 >>= 32
; __builtin_memcpy(&auxmap->data[0], &hdr, sizeof(struct ppm_evt_hdr));
70: (7b) *(u64 *)(r8 +8) = r0
71: (6b) *(u16 *)(r8 +22) = r6
The total instruction count dropped by 32 instructions per invocation.
5. Why It Works
The key insight is what changes when we declare the struct on the stack:
struct ppm_evt_hdr hdr = {0};
Before the fix, hdr was a pointer into a byte array — an unaligned destination the compiler couldn't trust. After the fix, hdr is a stack-allocated value. The BPF stack is naturally 8-byte aligned, so the compiler can now prove the memory is aligned and safely emit full-width 64-bit stores when writing to hdr.
The __builtin_memcpy then copies the fully built struct from the stack into auxmap->data. Because the source is now naturally aligned, clang optimizes the copy into wide stores rather than byte-by-byte moves.
In short: the trick is to build the struct somewhere the compiler trusts (the stack), then copy it to where you need it. One extra copy, but 32 fewer instructions — a significant net win across 87 probe sites that call this function on every captured syscall event.
This eliminates ~32 BPF instructions per invocation. Since auxmap__preload_event_header is called across all 87 syscall probe sites in the modern BPF driver, the reduction applies to every single syscall event Falco captures — making this a small change with system-wide impact.
6. Final Thoughts
I have been fascinated by how computers work at the lowest level for as long as I can remember. There is something uniquely satisfying about reading raw bytecode, finding a pattern that doesn't feel right, tracing it all the way back to a struct declaration, and then watching the disassembly change after a fix. Not because someone asked me to — just because the curiosity wouldn't let go.
And I can't encourage everyone to start submitting PRs based on my first experience, but it's worth giving it a shot.
7. Merged PR
The fix was reviewed, tested across multiple kernel and distro combinations, and merged into falcosecurity/libs:




