Fix the arithmetic when pre-filling non_empty_tgids when we still have
more than 32/64 thread groups left, to get the right index, we of course
have to divide the number of thread groups by the number of bits in a
long.
This bug was introduced by commit
7e1fed4b7a8b862bf7722117f002ee91a836beb5, but hopefully was not hit
because it requires to have at least as much thread groups as there are
bits in a long, which is impossible on 64bits machines, as MAX_TGROUPS
is still 32.
while (i >= LONGBITS) {
- non_empty_tgids[global.nbtgroups - i] = ULONG_MAX;
+ non_empty_tgids[(global.nbtgroups - i) / LONGBITS] = ULONG_MAX;
i -= LONGBITS;
}
while (i > 0) {