[RISC-V] Optimize clear-lowest-set-bit sequence when ctz is nearby
Daniel Barboza and I were looking at deepsjeng recently and spotted an oddity.
In particular we had a sequence like this:
addi a1,a0,-1
and a1,a0,a1
That idiom clears the lowest set bit in a0. In the initial code we looked at
there was a subsequent ctz dest,a0. In other cases I could see an earlier ctz.
And when I went digging further I even found the ctz between the addi/and.
The key is the ctz uses the same input as the addi, but a different output. So
the ctz doesn't participate in combine with the addi/and.
Tackling this in gimple isn't really feasible as the resultant code is going to
have a higher expression count. The bclr idiom looks like (and (rotate (-2)
(count) (input)) where the count comes from a ctz. SO to make it work from a
costing standpoint, we'd have to go find the ctz and use it.
Tackling in gimple->rtl expansion looked potentially feasible, but sign and
general type changes make that painful. So while I could see the add+and and
could speculatively hope there was a nearby ctz it looked less than ideal from
an implementation standpoint. It also tickled some undesirable behavior in the
combiner.
Tackling in combine isn't possible because the ctz doesn't feed the addi/and or
vice-versa.
Tackling with a peephole works sometimes, but is easily thwarted by the
scheduler putting random code in the middle of the sequences we want to adjust.
So in the end I just made a new RISC-V pass that runs after combine.
It walks each block looking for the addi+and construct. When found it then
looks a few instructions in either direction for a related ctz. When found
it'll move things around as necessary and use ctz+bclr. It depends on DCE to
eliminate the almost certainly dead addi+and insns.
I haven't benchmarked this one on design, but it's definitely picking up the
opportunities in deepsjeng and saves a few billion instructions. It has
survived a riscv64-elf and riscv32-elf build. Bootstrap on the BPI is in
flight, bootstrap on the Pioneer passes, but this code won't trigger this
because the Pioneer doesn't have ctz or bclr instructions.
Note while I would expect some of this pass to be usable on other targets, it
does make some assumptions about RTL structure that hold on RISC-V. For
example subword arithmetic has an explicit sign extension to word mode, bit
position is QImode, etc.
Comments? Other thoughts on how to resolve?
gcc/
* config.gcc (riscv*); Add riscv-bclr-lowest-set-bit.o to extra_objs.
* config/riscv/riscv-bclr-lowest-set-bit.cc: New file.
* config/riscv/riscv-passes.def: Add new pass after combine.
* config/riscv/riscv-protos.h (make_pass_bclr_lowest_set_bit): Add
prototype.
* config/riscv/t-riscv: Add rules to build riscv-bclr-lowest-set-bit.o.
gcc/testsuite
* gcc.target/riscv/bclr-lowest-set-bit-1.c: New test.