A refcount variable is used to allocate sequential unique identifiers for
Netlink sequence numbers, subject to overflows. The risk of an overflow
has so far not been considered practical, as it requires 2^32 netlink
requests.
It seems that this issue is not only theoretical. A host with thousands
of tunnels doing aggressive rekeying and/or aggressive status checking
(via vici list-sas) may trigger the overflow after a few weeks uptime.
The consequences are rather devastating: Once the refcount overflows, a
Netlink request is sent with sequence number 0. This request is answered
by the kernel, but can't be matched to the request, resulting in the error:
"received unknown netlink seq 0, ignored". Without Netlink timeouts, the
thread indefinitely waits for a response while holding the Netlink mutex,
bringing all threads to a halt.
So at all costs avoid zero sequence numbers. Also, start at sequence number
1 instead of the arbitrary 201, so the same range is used on start and after
an overflow.
uintptr_t seq;
u_int try;
- seq = ref_get(&this->seq);
+ seq = ref_get_nonzero(&this->seq);
for (try = 0; try <= this->retries; ++try)
{
.send_ack = _netlink_send_ack,
.destroy = _destroy,
},
- .seq = 200,
.mutex = mutex_create(MUTEX_TYPE_RECURSIVE),
.socket = socket(AF_NETLINK, SOCK_RAW, protocol),
.entries = hashtable_create(hashtable_hash_ptr, hashtable_equals_ptr, 4),