MINOR: fd: don't scan the full fdtab on all threads
During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.
It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:
global
stats socket /tmp/sock1 level admin mode 666
stats timeout 1h
maxconn
2000000
And the following script started in another window:
while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done
shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.
Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.
In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.
A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.
This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.
Here on the test platform, the script gets its response in 10ms vs 1580
before.