From: Willy Tarreau <w@1wt.eu>
Date: Tue, 6 May 2025 09:32:34 +0000 (+0200)
Subject: BUG/MEDIUM: peers: hold the refcnt until updating ts->seen
X-Git-Tag: v3.2-dev15~8
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=006a3acbde309f11190a003c4ac1c026480444e4;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: peers: hold the refcnt until updating ts->seen

In peer_treat_updatemsg(), we call stktable_touch_remote() after
releasing the write lock on the TS, asking it to decrement the
refcnt, then we update ts->seen. Unfortunately this is racy and
causes the issue that Christian reported in issue #2959.

The sequence of events is very hard to trigger manually, but what happens
is the following:

 T1.  stktable_touch_remote(table, ts, 1);
      -> at this point the entry is in the mt_list, and the refcnt is zero.

      T2.  stktable_trash_oldest() or process_table_expire()
           -> these can run, because the refcnt is now zero.
              The entry is cleanly deleted and freed.

 T1.  HA_ATOMIC_STORE(&ts->seen, 1)
      -> we dereference freed memory.

A first attempt at a fix was made by keeping the refcnt held during
all the time the entry is in the mt_list, but this is expensive as
such entries cannot be purged, causing lots of skips during
trash_oldest_data(). This managed to trigger watchdogs, and was only
hiding the real cause of the problem.

The correct approach clearly is to maintain the ref_cnt until we
touch ->seen. That's what this patch does. It does not decrement
the refcnt, while calling stktable_touch_remote(), and does it
manually after touching ->seen. With this the problem is gone.

Note that a reproducer involves the following:
  - a config with 10 stick-ctr tracking the same table with a
    random key between 10M and 100M depending on the machine.
  - the expiration should be between 10 and 20s. http_req_cnt
    is stored and shared with the peers.
  - 4 total processes with such a config on the local machine,
    each corresponding to a different peer. 3 of the peers are
    bound to half of the cores (all threads) and share the same
    threads; the last process is bound to the other half with
    its own threads.
  - injecting at full load, ~256 conn, on the shared listening
    port. After ~2x expiration time to 1 minute the lone process
    should segfault in pools code due to a corrupted by_lru list.

This problem already exists in earlier versions but the race looks
narrower. Given how difficult it is to trigger on a given machine
in its current form, it's likely that it only happens once in a
while on stable branches. The fix must be backported wherever the
code is similar, and there's no hope to reproduce it to validate
the backport.

Thanks again to Christian for his amazing help!
---

diff --git a/src/peers.c b/src/peers.c
index f13d9d393..6f22e1751 100644
--- a/src/peers.c
+++ b/src/peers.c
@@ -2067,7 +2067,11 @@ static int peer_treat_updatemsg(struct appctx *appctx, struct peer *p, int updt,
 	ts->expire = tick_add(now_ms, expire);
 
 	HA_RWLOCK_WRUNLOCK(STK_SESS_LOCK, &ts->lock);
-	stktable_touch_remote(table, ts, 1);
+
+	/* we MUST NOT dec the refcnt yet because stktable_trash_oldest() or
+	 * process_table_expire() could execute between the two next lines.
+	 */
+	stktable_touch_remote(table, ts, 0);
 
 	/* Entry was just learned from a peer, we want to notify this peer
 	 * if we happen to modify it. Thus let's consider at least one
@@ -2075,6 +2079,9 @@ static int peer_treat_updatemsg(struct appctx *appctx, struct peer *p, int updt,
 	 */
 	HA_ATOMIC_STORE(&ts->seen, 1);
 
+	/* only now we can decrement the refcnt */
+	HA_ATOMIC_DEC(&ts->ref_cnt);
+
 	if (wts) {
 		/* Start over the message decoding for wts as we got a valid stksess
 		 * for write_to table, so we need to refresh the entry with supported