MEDIUM: migrate the patterns reference to cebs_tree
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).
Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):
$ time ./haproxy -c -f acl-ip.cfg
Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.
Before:
RAM (VSZ/RSS):
4443912 3912444
real 0m19.211s
user 0m18.138s
sys 0m1.068s
Overhead Command Shared Object Symbol
44.79% haproxy haproxy [.] ebst_insert
25.07% haproxy haproxy [.] ebmb_insert_prefix
3.44% haproxy libc-2.33.so [.] __libc_calloc
2.71% haproxy libc-2.33.so [.] _int_malloc
2.33% haproxy haproxy [.] free_pattern_tree
1.78% haproxy libc-2.33.so [.] inet_pton4
1.62% haproxy libc-2.33.so [.] _IO_fgets
1.58% haproxy libc-2.33.so [.] _int_free
1.56% haproxy haproxy [.] pat_ref_push
1.35% haproxy libc-2.33.so [.] malloc_consolidate
1.16% haproxy libc-2.33.so [.] __strlen_avx2
0.79% haproxy haproxy [.] pat_idx_tree_ip
0.76% haproxy haproxy [.] pat_ref_read_from_file
0.60% haproxy libc-2.33.so [.] __strrchr_avx2
0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.54% haproxy libc-2.33.so [.] __memchr_avx2
0.46% haproxy haproxy [.] pat_ref_append
After:
RAM (VSZ/RSS):
4166108 3634768
real 0m18.114s
user 0m17.113s
sys 0m0.996s
Overhead Command Shared Object Symbol
38.99% haproxy haproxy [.] cebs_insert
27.09% haproxy haproxy [.] ebmb_insert_prefix
3.63% haproxy libc-2.33.so [.] __libc_calloc
3.18% haproxy libc-2.33.so [.] _int_malloc
2.69% haproxy haproxy [.] free_pattern_tree
1.99% haproxy libc-2.33.so [.] inet_pton4
1.74% haproxy libc-2.33.so [.] _IO_fgets
1.73% haproxy libc-2.33.so [.] _int_free
1.57% haproxy haproxy [.] pat_ref_push
1.48% haproxy libc-2.33.so [.] malloc_consolidate
1.22% haproxy libc-2.33.so [.] __strlen_avx2
1.05% haproxy libc-2.33.so [.] __strcmp_avx2
0.80% haproxy haproxy [.] pat_idx_tree_ip
0.74% haproxy libc-2.33.so [.] __memchr_avx2
0.69% haproxy libc-2.33.so [.] __strrchr_avx2
0.69% haproxy libc-2.33.so [.] _IO_getline_info
0.62% haproxy haproxy [.] pat_ref_read_from_file
0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5
0.46% haproxy haproxy [.] pat_ref_append
If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).
On large strings such as user agents (1 million here), it's now slightly
slower (+9%):
Before:
real 0m2.475s
user 0m2.316s
sys 0m0.155s
After:
real 0m2.696s
user 0m2.544s
sys 0m0.147s
But such patterns are much less common than short ones, and the memory
savings do still count.
Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.