In GH issue #2552, Christian Ruppert reported an increase in crashes
with recent 3.0-dev versions, always related with stick-tables and peers.
One particularity of his config is that it has a lot of peers.
While trying to reproduce, it empirically was found that firing 10 load
generators at 10 different haproxy instances tracking a random key among
100k against a table of max 5k entries, on 8 threads and between a total
of 50 parallel peers managed to reproduce the crashes in seconds, very
often in ebtree deletion or insertion code, but not only.
The debugging revealed that the crashes are often caused by a parent node
being corrupted while delete/insert tries to update it regarding a recently
inserted/removed node, and that that corrupted node had always been proven
to be deleted, then immediately freed, so it ought not be visited in the
tree from functions enclosed between a pair of lock/unlock. As such the
only possibility was that it had experienced unexpected inserts. Also,
running with pool integrity checking would 90% of the time cause crashes
during allocation based on corrupted contents in the node, likely because
it was found at two places in the same tree and still present as a parent
of a node being deleted or inserted (hence the __stksess_free and
stktable_trash_oldest callers being visible on these items).
Indeed the issue is in fact related to the test set (occasionally redundant
keys, many peers). What happens is that sometimes, a same key is learned
from two different peers. When it is learned for the first time, we end up
in stktable_touch_with_exp() in the "else" branch, where the test for
existence is made before taking the lock (since commit
cfeca3a3a3
("MEDIUM: stick-table: touch updates under an upgradable read lock") that
was merged in 2.9), and from there the entry is added. But is one of the
threads manages to insert it before the other thread takes the lock, then
the second thread will try to insert this node again. And inserting an
already inserted node will corrupt the tree (note that we never switched
to enforcing a check in insertion code on this due to API history that
would break various code parts).
Here the solution is simple, it requires to recheck leaf_p after getting
the lock, to avoid touching anything if the entry has already been
inserted in the mean time.
Many thanks to Christian Ruppert for testing this and for his invaluable
help on this hard-to-trigger issue.
This fix needs to be backported to 2.9.
do_wakeup = 1;
}
else {
- /* If this entry is not in the tree */
-
+ /* Note: we land here when learning new entries from
+ * remote peers. We hold one ref_cnt so the entry
+ * cannot vanish under us, however if two peers create
+ * the same key at the exact same time, we must be
+ * careful not to perform two parallel inserts! Hence
+ * we need to first check leaf_p to know if the entry
+ * is new, then lock the tree and check the entry again
+ * (since another thread could have created it in the
+ * mean time).
+ */
if (!ts->upd.node.leaf_p) {
/* Time to upgrade the read lock to write lock if needed */
if (!use_wrlock) {
}
/* here we're write-locked */
-
- ts->seen = 0;
- ts->upd.key= (++t->update)+(2147483648U);
- eb = eb32_insert(&t->updates, &ts->upd);
- if (eb != &ts->upd) {
- eb32_delete(eb);
- eb32_insert(&t->updates, &ts->upd);
+ if (!ts->upd.node.leaf_p) {
+ ts->seen = 0;
+ ts->upd.key= (++t->update)+(2147483648U);
+ eb = eb32_insert(&t->updates, &ts->upd);
+ if (eb != &ts->upd) {
+ eb32_delete(eb);
+ eb32_insert(&t->updates, &ts->upd);
+ }
}
}
}