From: Alexei Starovoitov Date: Mon, 11 May 2026 22:25:24 +0000 (-0700) Subject: Merge branch 'selftests-bpf-add-xdp-load-balancer-benchmark' X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=a982dda833e48f3948db2d17715346beb71de12b;p=thirdparty%2Flinux.git Merge branch 'selftests-bpf-add-xdp-load-balancer-benchmark' Puranjay Mohan says: ==================== selftests/bpf: Add XDP load-balancer benchmark Changelog: RFC: https://lore.kernel.org/all/20260420111726.2118636-1-puranjay@kernel.org/ Changes in v1: - Replace bpf_get_cpu_time_counter() with bpf_ktime_get_ns() - Replace bpf_repeat() with plain for loop and may_goto - Refactor collect_measurements() to reuse bench_force_done() - Remove histogram, verbose calibration output, and per-scenario status prints - Trim run script table to p50/stddev/p99 - Set env.quiet when --machine-readable is passed - Add || true to run script benchmark invocation for set -e safety - Add bpf-nop benchmark as timing overhead baseline (patch 3) - Use named struct for LRU inner map to fix build on older toolchains This series adds an XDP load-balancer benchmark (based on Katran) to the BPF selftest bench framework. Motivation ---------- Existing BPF bench tests measure individual operations (map lookups, kprobes, ring buffers) in isolation. Production BPF programs combine parsing, map lookups, branching, and packet rewriting in a single call chain. The performance characteristics of such programs depend on the interaction of these operations -- register pressure, spills, inlining decisions, branch layout -- which isolated micro-benchmarks do not capture. This benchmark implements a simplified L4 load-balancer modeled after katran [1]. The BPF program reproduces katran's core datapath: L3/L4 parsing -> VIP hash lookup -> per-CPU LRU connection table with consistent-hash fallback -> real server selection -> per-VIP and per-real stats -> IPIP/IP6IP6 encapsulation The BPF code exercises hash maps, array-of-maps (per-CPU LRU), percpu arrays, jhash, bpf_xdp_adjust_head(), bpf_ktime_get_ns(), and bpf_get_smp_processor_id() in a single pipeline. This is intended as the first in a series of BPF workload benchmarks covering other use cases (sched_ext, etc.). Design ------ A userspace loop calling bpf_prog_test_run_opts(repeat=1) would measure syscall overhead, not BPF program cost -- the ~4 ns early-exit paths would be buried under kernel entry/exit. Using repeat=N is also unsuitable: the kernel re-runs the same packet without resetting state between iterations, so the second iteration of an encap scenario would process an already-encapsulated packet. Instead, timing is measured inside the BPF program using bpf_ktime_get_ns(). BENCH_BPF_LOOP() brackets N iterations with timestamp reads using a plain for loop with may_goto, runs a caller-supplied reset block between iterations to undo side effects (e.g. strip encapsulation), and records the elapsed time per batch. One extra untimed iteration runs afterward for output validation. Auto-calibration picks a batch size targeting ~10 ms per invocation. A proportionality sanity check verifies that 2N iterations take ~2x as long as N. 24 scenarios cover the code-path matrix: - Protocol: TCP, UDP - Address family: IPv4, IPv6, cross-AF (IPv4-in-IPv6) - LRU state: hit, miss (16M flow space), diverse (4K flows), cold - Consistent-hash: direct (LRU bypass) - TCP flags: SYN (skip LRU, force CH), RST (skip LRU insert) - Early exits: unknown VIP, non-IP, ICMP, fragments, IP options Each scenario validates correctness before benchmarking by comparing the output packet byte-for-byte against a pre-built expected packet and checking BPF map counters. Sample single-scenario output: $ sudo ./bench xdp-lb --scenario tcp-v4-lru-hit Setting up benchmark 'xdp-lb'... Benchmark 'xdp-lb' started. tcp-v4-lru-hit: median 74.51 ns/op, stddev 0.11, p99 74.81 (202 samples) Sample run script output: $ ./benchs/run_bench_xdp_lb.sh XDP load-balancer benchmark =========================== +----------------------------------+----------+---------+----------+ | Single-flow baseline | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-hit | 74.30 | 0.08 | 74.48 | | tcp-v4-ch | 101.73 | 0.11 | 102.01 | | tcp-v6-lru-hit | 76.77 | 0.14 | 77.04 | | tcp-v6-ch | 121.40 | 0.10 | 121.65 | | udp-v4-lru-hit | 107.42 | 0.22 | 107.90 | | udp-v6-lru-hit | 110.21 | 0.12 | 110.45 | | tcp-v4v6-lru-hit | 74.82 | 0.35 | 75.43 | +----------------------------------+----------+---------+----------+ | Diverse flows (4K src addrs) | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-diverse | 86.63 | 0.37 | 89.04 | | tcp-v4-ch-diverse | 104.09 | 0.19 | 105.67 | | tcp-v6-lru-diverse | 89.34 | 0.42 | 90.70 | | tcp-v6-ch-diverse | 122.20 | 0.21 | 123.78 | | udp-v4-lru-diverse | 119.37 | 0.58 | 123.10 | +----------------------------------+----------+---------+----------+ | TCP flags | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-syn | 165.52 | 15.68 | 198.34 | | tcp-v4-rst-miss | 161.34 | 2.69 | 172.64 | +----------------------------------+----------+---------+----------+ | LRU stress | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-miss | 440.39 | 35.75 | 550.62 | | udp-v4-lru-miss | 571.88 | 57.38 | 680.61 | | tcp-v4-lru-warmup | 317.75 | 9.55 | 356.20 | +----------------------------------+----------+---------+----------+ | Early exits | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | pass-v4-no-vip | 18.26 | 0.13 | 18.66 | | pass-v6-no-vip | 19.08 | 0.01 | 19.10 | | pass-v4-icmp | 6.81 | 0.02 | 6.86 | | pass-non-ip | 5.71 | 0.03 | 5.76 | | drop-v4-frag | 6.09 | 0.01 | 6.10 | | drop-v4-options | 5.88 | 0.00 | 5.89 | | drop-v6-frag | 6.00 | 0.03 | 6.04 | +----------------------------------+----------+---------+----------+ Patches ------- Patch 1 adds bench_force_done() to the bench framework so benchmarks can signal early completion when enough samples have been collected. Patch 2 adds the shared BPF batch-timing library (BPF-side timing arrays, BENCH_BPF_LOOP macro, userspace statistics and calibration). Patch 3 adds a bpf-nop benchmark as a timing overhead baseline and usage example for the timing library. Patch 4 adds the common header shared between the BPF program and userspace (flow_key, vip_definition, real_definition, encap helpers). Patch 5 adds the XDP load-balancer BPF program. Patch 6 adds the userspace benchmark driver with 24 scenarios, packet construction, validation, and bench framework integration. Patch 7 adds the run script for running all scenarios. [1] https://github.com/facebookincubator/katran ==================== Link: https://patch.msgid.link/20260427232313.1582588-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov --- a982dda833e48f3948db2d17715346beb71de12b