From: Alexei Starovoitov Date: Fri, 5 Jun 2026 15:00:09 +0000 (-0700) Subject: Merge branch 'bpf-introduce-resizable-hash-map' X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=87d119abc42fa9b4a0acd3b0c038a8584d62568e;p=thirdparty%2Fkernel%2Flinux.git Merge branch 'bpf-introduce-resizable-hash-map' Mykyta Yatsenko says: ==================== bpf: Introduce resizable hash map This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that leverages the kernel's rhashtable to provide resizable hash map for BPF. The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at map creation time. While this works well for many use cases, it presents challenges when: 1. The number of elements is unknown at creation time 2. The element count varies significantly during runtime 3. Memory efficiency is important (over-provisioning wastes memory, under-provisioning hurts performance) BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which automatically grows and shrinks based on load factor. The implementation wraps the kernel's rhashtable with BPF map operations: - Uses bpf_mem_alloc for RCU-safe memory management - Supports all standard map operations (lookup, update, delete, get_next_key) - Supports batch operations (lookup_batch, lookup_and_delete_batch) - Supports BPF iterators for traversal - Supports BPF_F_LOCK for spin locks in values - Requires BPF_F_NO_PREALLOC flag (elements allocated on demand) - In-place updates for improved performance. - max_entries serves as a hard limit, not bucket count - Uses bit_spin_lock() + local_irq_save() for bucket locking, similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and deletes may fail. - Iterations are best-effort, if resize, insertions, deletions take place concurrently, iterations may visit same elements multiple times or skip elements. - Lock out insertions, when running special fields destructor to guarantee its completion. The series includes comprehensive tests: - Basic operations in test_maps (lookup, update, delete, get_next_key) - BPF program tests for lookup/update/delete semantics - Seq file tests Signed-off-by: Mykyta Yatsenko --- Update implementation --------------------- Current implementation of the BPF_MAP_TYPE_RHASH does not provide the same strong guarantees on the values consistency under concurrent reads/writes as BPF_MAP_TYPE_HASH. BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held. rhash trades consistency for speed, concurrent readers can observe partially updated data. Two concurrent writers to the same key can also interleave, producing mixed values. This is similar to arraymap update implementation, including handling of the special fields. As a solution, user may use BPF_F_LOCK to guarantee consistent reads and write serialization. Summary of the read consistency guarantees: map type | write mechanism | read consistency -------------+------------------+-------------------------- htab | alloc, swap ptr | always consistent (RCU) htab F_LOCK | in-place + lock | consistent if reader locks -------------+------------------+-------------------------- rhtab | in-place memcpy | torn reads rhtab F_LOCK | in-place + lock | consistent if reader locks Benchmarks ---------- 1. LOOKUP (single producer, M events/sec) key | max | nr | htab | rhtab | ratio | delta ----+-----+-------+---------+---------+-------+------- 8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 % 8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 % 8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 % 8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 % 32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 % 32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 % 32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 % 32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 % 256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 % 256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 % 256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 % 256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 % Pattern: * Small map (1K): htab wins for 8 / 32 byte keys by 10-20 % because the preallocated bucket array fits in L1. Equalises at 256 byte keys. * Large map (1M): rhtab wins everywhere, up to 4x at high load factor with 8 byte keys. * Higher load factor amplifies rhtab's lead: rhtab grows the bucket array; htab stays at user-declared max. 2. FULL UPDATE (M events/sec per producer, -p 7) htab per-producer: 20.33 22.02 19.27 23.61 24.18 23.17 21.07 mean 21.94 range 19.27 - 24.18 rhtab per-producer: 133.51 129.47 74.52 129.29 102.26 129.98 107.64 mean 115.24 range 74.52 - 133.51 speedup (mean): 5.25x (+425 %) In-place memcpy avoids the per-update alloc + RCU pointer swap that htab pays. 3. MEMORY (overwrite, -p 8, no --preallocated) value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem -----------+-------------+-------------+----------+---------- 32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB 4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB rhtab/htab : +8 % ops, +0.8 % mem (32 B) +1 % ops, -4 % mem (4096 B) SUMMARY * Small / well-fitting map: htab is faster (cache-friendly fixed bucket array), but only by ~10-20 %. * Large / high-load-factor map: rhtab is dramatically faster (1.2x to 4x) because rhashtable resizes to keep the load factor sane while htab stays stuck at user-declared max. * Update-heavy workloads: rhtab is ~5x faster per producer via in-place memcpy. * Memory benchmark: effectively on par --- Changes in v7: - rhashtable_next_key: move into lib/rhashtable.c, drop params argument (Herbert). - rhashtable_next_key: kdoc clarifies that behavior on tables with duplicate keys is undefined (sashiko). - rhashtable: include Herbert's "Use irq work for shrinking" patch so __rhashtable_remove_fast_one() can fire the shrink path from NMI context (Herbert). - hashtab: fix u32 multiply overflow in __rhtab_map_lookup_and_delete_batch copy_to_user; cast total to size_t before multiplying by key_size / value_size (sashiko, bot+bpf-ci). - hashtab: allow kptr/refcount fields in rhtab values (same model as array map). - Link to v6: https://patch.msgid.link/20260602-rhash-v6-0-1bfd35a4184f@meta.com Changes in v6: - rhashtable_next_key: advance past duplicate keys in the main bucket chain to avoid an infinite loop when there are duplicate keys (sashiko). - rhashtable_next_key: return ERR_PTR(-EOPNOTSUPP) on rhltable (sashiko). - rhashtable: selftest pre-sizes the table to avoid concurrent rehash triggering spurious failures (sashiko). - hashtab: real rhtab_map_mem_usage in the basic commit; move bpf_map_free_internal_structs from rhtab_free_elem into the special-fields commit where it does meaningful work (bot+bpf-ci). - bpf_iter (seq_file): switch to rhashtable_walk_* for stronger coverage under concurrent rehash; get_next_key and batch keep rhashtable_next_key (sashiko). - iter ops: rhtab_map_get_next_key adds IS_ERR check before dereferencing the element pointer (sashiko). - iter ops: bpf_each_rhash_elem removes cond_resched() (sashiko). - iter ops: batch returns -EAGAIN (not -ENOENT) on cursor delete, so userspace can distinguish lost cursor from end-of-iteration and restart from NULL (sashiko). - Link to v5: https://patch.msgid.link/20260528-rhash-v5-0-7205191b6c57@meta.com Changes in v5: - rhashtable_next_key: add kdoc WARNING to highlight lack of rehash detection and unbounded iteration (Herbert). - rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison on the missing-key path (bot+bpf-ci). - hashtab: drop dead stub bodies and unused map_ops registrations from the basic commit; iteration commit adds bodies, structs, and registrations together. .map_get_next_key keeps a stub registration in the basic commit because the syscall dispatcher does not NULL-check it; iteration commit replaces the stub body with the real implementation (bot+bpf-ci). - hashtab: fix batch cursor advancement. v4 stashed the lookahead element key but then resumed via next_key(cursor), skipping that element across batch boundaries and orphaning it on lookup_and_delete_batch. v5 stashes the lookahead key and looks it up directly on the next batch entry (bot+bpf-ci, sashiko v3). - hashtab: document torn-read race in rhtab_map_update_existing, matching arraymap semantics (bot+bpf-ci). - Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com Changes in v4: - rhashtable: introduce rhashtable_next_key(), drop walker-based iteration for BPF (also drops earlier rhashtable_walk_enter_from() proposal). - map_extra: presize hint via lower 32 bits (nelem_hint), capped at U16_MAX. - Automatic shrinking enabled (was missing despite being advertised). - Reject key_size > U16_MAX (rhashtable_params.key_len is u16). - Replace irqs_disabled() guard with bpf_disable_instrumentation around bucket-lock paths: closes same-CPU NMI tracing recursion without rejecting legitimate IRQ-context callers. - lookup_and_delete reordered: unlink before copy to avoid populating user buffer on concurrent-unlink -ENOENT. - update_existing reordered: copy then free_fields, matching arraymap. - Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn via static-const rhashtable_params; works on both 32-bit and 64-bit. - check_and_init_map_value() on insert (zero special-field bytes from recycled bpf_mem_alloc memory; previously bpf_spin_lock could read garbage and qspinlock would deadlock). - BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special- fields commit so each commit is bisect-safe. - Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com Changes in v3: - Squash all commits implementing basic functions into one (Alexei) - Remove selftests that were not necessary (Alexei) - Resize detection for kernel full iterations, error out on resize (Alexei) - Remove second lookup in get_next_key() (Emil) - __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil) - Use bpf_map_check_op_flags() where it makes sense (Leon) - Benchmarks refresh, experiment with alternative hash functions - Rely on iterator invalidation during rehash to handle table resizes: fail on resize where we fully iterate on table inside kernel, dont fail on resize where iteration goes through userspace. Exception - rhtab_map_free_internal_structs() should be just safe to iterate fully in kernel, no risk of infinite loop, because no user holding reference. - Handle special fields during in-place updates (Emil, sashiko) - Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/ Changes in v2: - Added benchmarks - Reworked all functions that walk the rhashtable, use walk API, instead of directly accessing tbl and future_tbl - Added rhashtable_walk_enter_from() into rhashtable to support O(1) iteration continuations - Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com --- ==================== Link: https://patch.msgid.link/20260605-rhash-v7-0-5b8e05f8630d@meta.com Signed-off-by: Alexei Starovoitov --- 87d119abc42fa9b4a0acd3b0c038a8584d62568e