BUG/MEDIUM: peers: hold the refcnt until updating ts->seen
In peer_treat_updatemsg(), we call stktable_touch_remote() after
releasing the write lock on the TS, asking it to decrement the
refcnt, then we update ts->seen. Unfortunately this is racy and
causes the issue that Christian reported in issue #2959.
The sequence of events is very hard to trigger manually, but what happens
is the following:
T1. stktable_touch_remote(table, ts, 1);
-> at this point the entry is in the mt_list, and the refcnt is zero.
T2. stktable_trash_oldest() or process_table_expire()
-> these can run, because the refcnt is now zero.
The entry is cleanly deleted and freed.
T1. HA_ATOMIC_STORE(&ts->seen, 1)
-> we dereference freed memory.
A first attempt at a fix was made by keeping the refcnt held during
all the time the entry is in the mt_list, but this is expensive as
such entries cannot be purged, causing lots of skips during
trash_oldest_data(). This managed to trigger watchdogs, and was only
hiding the real cause of the problem.
The correct approach clearly is to maintain the ref_cnt until we
touch ->seen. That's what this patch does. It does not decrement
the refcnt, while calling stktable_touch_remote(), and does it
manually after touching ->seen. With this the problem is gone.
Note that a reproducer involves the following:
- a config with 10 stick-ctr tracking the same table with a
random key between 10M and 100M depending on the machine.
- the expiration should be between 10 and 20s. http_req_cnt
is stored and shared with the peers.
- 4 total processes with such a config on the local machine,
each corresponding to a different peer. 3 of the peers are
bound to half of the cores (all threads) and share the same
threads; the last process is bound to the other half with
its own threads.
- injecting at full load, ~256 conn, on the shared listening
port. After ~2x expiration time to 1 minute the lone process
should segfault in pools code due to a corrupted by_lru list.
This problem already exists in earlier versions but the race looks
narrower. Given how difficult it is to trigger on a given machine
in its current form, it's likely that it only happens once in a
while on stable branches. The fix must be backported wherever the
code is similar, and there's no hope to reproduce it to validate
the backport.