We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.
/* now release */
for (curr_cell = &cell; curr_cell; curr_cell = next_cell) {
next_cell = HA_ATOMIC_LOAD(&curr_cell->next);
- HA_ATOMIC_STORE(&curr_cell->next, curr_cell);
+ _HA_ATOMIC_STORE(&curr_cell->next, curr_cell);
}
/* unlock the message area */
for (curr_cell = &cell; curr_cell; curr_cell = next_cell) {
next_cell = HA_ATOMIC_LOAD(&curr_cell->next);
HA_ATOMIC_STORE(&curr_cell->to_send_self, 0);
- HA_ATOMIC_STORE(&curr_cell->next, curr_cell);
+ _HA_ATOMIC_STORE(&curr_cell->next, curr_cell);
}
}