From: Willy Tarreau Date: Tue, 6 May 2025 09:32:34 +0000 (+0200) Subject: BUG/MEDIUM: peers: hold the refcnt until updating ts->seen X-Git-Tag: v3.2-dev15~8 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=006a3acbde309f11190a003c4ac1c026480444e4;p=thirdparty%2Fhaproxy.git BUG/MEDIUM: peers: hold the refcnt until updating ts->seen In peer_treat_updatemsg(), we call stktable_touch_remote() after releasing the write lock on the TS, asking it to decrement the refcnt, then we update ts->seen. Unfortunately this is racy and causes the issue that Christian reported in issue #2959. The sequence of events is very hard to trigger manually, but what happens is the following: T1. stktable_touch_remote(table, ts, 1); -> at this point the entry is in the mt_list, and the refcnt is zero. T2. stktable_trash_oldest() or process_table_expire() -> these can run, because the refcnt is now zero. The entry is cleanly deleted and freed. T1. HA_ATOMIC_STORE(&ts->seen, 1) -> we dereference freed memory. A first attempt at a fix was made by keeping the refcnt held during all the time the entry is in the mt_list, but this is expensive as such entries cannot be purged, causing lots of skips during trash_oldest_data(). This managed to trigger watchdogs, and was only hiding the real cause of the problem. The correct approach clearly is to maintain the ref_cnt until we touch ->seen. That's what this patch does. It does not decrement the refcnt, while calling stktable_touch_remote(), and does it manually after touching ->seen. With this the problem is gone. Note that a reproducer involves the following: - a config with 10 stick-ctr tracking the same table with a random key between 10M and 100M depending on the machine. - the expiration should be between 10 and 20s. http_req_cnt is stored and shared with the peers. - 4 total processes with such a config on the local machine, each corresponding to a different peer. 3 of the peers are bound to half of the cores (all threads) and share the same threads; the last process is bound to the other half with its own threads. - injecting at full load, ~256 conn, on the shared listening port. After ~2x expiration time to 1 minute the lone process should segfault in pools code due to a corrupted by_lru list. This problem already exists in earlier versions but the race looks narrower. Given how difficult it is to trigger on a given machine in its current form, it's likely that it only happens once in a while on stable branches. The fix must be backported wherever the code is similar, and there's no hope to reproduce it to validate the backport. Thanks again to Christian for his amazing help! --- diff --git a/src/peers.c b/src/peers.c index f13d9d393..6f22e1751 100644 --- a/src/peers.c +++ b/src/peers.c @@ -2067,7 +2067,11 @@ static int peer_treat_updatemsg(struct appctx *appctx, struct peer *p, int updt, ts->expire = tick_add(now_ms, expire); HA_RWLOCK_WRUNLOCK(STK_SESS_LOCK, &ts->lock); - stktable_touch_remote(table, ts, 1); + + /* we MUST NOT dec the refcnt yet because stktable_trash_oldest() or + * process_table_expire() could execute between the two next lines. + */ + stktable_touch_remote(table, ts, 0); /* Entry was just learned from a peer, we want to notify this peer * if we happen to modify it. Thus let's consider at least one @@ -2075,6 +2079,9 @@ static int peer_treat_updatemsg(struct appctx *appctx, struct peer *p, int updt, */ HA_ATOMIC_STORE(&ts->seen, 1); + /* only now we can decrement the refcnt */ + HA_ATOMIC_DEC(&ts->ref_cnt); + if (wts) { /* Start over the message decoding for wts as we got a valid stksess * for write_to table, so we need to refresh the entry with supported