With this, execution time for e.g. __moddi3 go from 59 to 40 cycles in
the "fast" case or from 290 to 200 cycles in the "slow" case (when the
!TARGET_HAS_NO_HW_DIVIDE variant calls division and modulus functions
for 32-bit SImode), as exposed by gcc.c-torture/execute/arith-rand-ll.c
compiled for -march=v10.
Unfortunately, it just puts a performance improvement "dent" of 0.07%
in a arith-rand-ll.c-based performance test - where all loops are also
reduced to 1/10.
The size of every affected libgcc function is reduced to less than
half and they are all now leaf functions.