OPTIM: mux-h2: use tasklet_wakeup_after() in h2s_notify_recv()
This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us.
The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the
perf is not better. Also there are situations where we must not perform
such wakeup, these may only be done from h2_io_cb, hence the test on the
next_tasklet pointer and its reset when leaving the function. In practice
all callers to h2s_close() or h2s_destroy() can reach that code, this
includes h2_detach, h2_snd_buf, h2_shut etc.
Another test with 40 concurrent connections, transferring 40k 1MB objects
at different concurrency levels from 1 to 80 also showed a 21% drop in L2
cache misses, and a 2% perf improvement:
Before:
329,510,887,528 instructions
50,907,966,181 branches
843,515,912 branch-misses
2,753,360,222 cache-misses
19,306,172,474 L1-icache-load-misses
17,321,132,742 L1-dcache-load-misses
951,787,350 LLC-load-misses
44.
660469000 seconds user
62.
459354000 seconds sys
=> avg perf: 373 MB/s
After:
331,310,219,157 instructions
51,343,396,257 branches
851,567,572 branch-misses
2,183,369,149 cache-misses
19,129,827,134 L1-icache-load-misses
17,441,877,512 L1-dcache-load-misses
906,923,115 LLC-load-misses
42.
795458000 seconds user
62.
277983000 seconds sys
=> avg perf: 380 MB/s
With small requests, it's the L1 and L3 cache misses which reduced by
3% and 7% respectively, and the performance went up by 3%.