Provide an s390 specific word-at-a-time implementation. Compared to the
generic variant the generated code for has_zero() is slightly
better. However find_zero() is much simpler since it reuses the result
of __fls() aka flogr() and now comes without any conditional branches,
while the generic variant has three of them.