]> git.ipfire.org Git - thirdparty/gcc.git/commit
Make ix86_macro_fusion_pair_p and ix86_fuse_mov_alu_p match current CPUs
authorJan Hubicka <hubicka@ucw.cz>
Mon, 3 Mar 2025 18:12:20 +0000 (19:12 +0100)
committerJan Hubicka <hubicka@ucw.cz>
Tue, 4 Mar 2025 15:10:34 +0000 (16:10 +0100)
commitc84be624e079cd748df93a3dc0b5168865fefee9
tree8c7fb4c2cedcbf27e6f437b5e5a949d973b56f17
parent173cf7c9b8c0d61bb2cb0bd3a9e3150b393ab59a
Make ix86_macro_fusion_pair_p and ix86_fuse_mov_alu_p match current CPUs

The current implementation of fussion predicates misses some common
fussion cases on zen and more recent cores.  I added knobs for
individual conditionals we test.

 1) I split checks for fusing ALU with conditional operands when the ALU
 has memory operand.  This seems to be supported by zen3+ and by
 tigerlake and coperlake (according to Agner Fog's manual)

 2) znver4 and 5 supports fussion of ALU and conditional even if ALU has
    memory and immediate operands.
    This seems to be relatively important enabling 25% more fusions on
    gcc bootstrap.

 3) no CPU supports fusing when ALU contains IP relative memory
    references.  I added separate knob so we do not forger about this if
    this gets supoorted later.

The patch does not solve the limitation of sched that fuse pairs must be
adjacent on imput and the first operation must be signle-set.  Fixing
single-set is easy (I have separate patch for this), for non-adjacent
pairs we need bigger surgery.

To verify what CPU really does I made simpe test script.

jh@ryzen3:~> cat fuse-test.c
        int b;
        const int z = 0;
        const int o = 1;
        int
main()
{
        int a = 1000000000;
        int b;
        int z = 0;
        int o = 1;
        asm volatile ("\n"
".L1234:\n"
        "nop\n"
        "subl   %3, %0\n"

        "movl %0, %1\n"
        "cmpl     %2, %1\n"
        "movl %0, %1\n"
        "test %1, %1\n"

        "nop\n"
        "jne    .L1234":"=a"(a),
        "=m"(b)
        "=r"(b)
        :
        "m"(z),
        "m"(o),
        "i"(0),
        "i"(1),
        "0"(a)
                );
}
jh@ryzen3:~> cat fuse-test.sh
EVENT=ex_ret_fused_instr
dotest()
{
gcc -O2  fuse-test.c $* -o fuse-cmp-imm-mem-nofuse
perf stat -e $EVENT ./fuse-cmp-imm-mem-nofuse  2>&1 | grep $EVENT
gcc -O2 fuse-test.c -DFUSE $* -o fuse-cmp-imm-mem-fuse
perf stat  -e $EVENT ./fuse-cmp-imm-mem-fuse 2>&1 | grep $EVENT
}

echo ALU with immediate
dotest
echo ALU with memory
dotest -D MEM
echo ALU with IP relative memory
dotest -D MEM -D IPRELATIVE
echo CMP with immediate
dotest -D CMP
echo CMP with memory
dotest -D CMP -D MEM
echo CMP with memory and immediate
dotest -D CMP -D MEMIMM
echo CMP with IP relative memory
dotest -D CMP -D MEM -D IPRELATIVE
echo TEST
dotest -D TEST

On zen5 I get:
ALU with immediate
            20,345      ex_ret_fused_instr:u
     1,000,020,278      ex_ret_fused_instr:u
ALU with memory
            20,367      ex_ret_fused_instr:u
     1,000,020,290      ex_ret_fused_instr:u
ALU with IP relative memory
            20,395      ex_ret_fused_instr:u
            20,403      ex_ret_fused_instr:u
CMP with immediate
            20,369      ex_ret_fused_instr:u
     1,000,020,301      ex_ret_fused_instr:u
CMP with memory
            20,314      ex_ret_fused_instr:u
     1,000,020,341      ex_ret_fused_instr:u
CMP with memory and immediate
            20,372      ex_ret_fused_instr:u
     1,000,020,266      ex_ret_fused_instr:u
CMP with IP relative memory
            20,382      ex_ret_fused_instr:u
            20,369      ex_ret_fused_instr:u
TEST
            20,346      ex_ret_fused_instr:u
     1,000,020,301      ex_ret_fused_instr:u

IP relative memory seems to not be documented.

On zen3/4 I get:

ALU with immediate
            20,263      ex_ret_fused_instr:u
     1,000,020,051      ex_ret_fused_instr:u
ALU with memory
            20,255      ex_ret_fused_instr:u
     1,000,020,056      ex_ret_fused_instr:u
ALU with IP relative memory
            20,253      ex_ret_fused_instr:u
            20,266      ex_ret_fused_instr:u
CMP with immediate
            20,264      ex_ret_fused_instr:u
     1,000,020,052      ex_ret_fused_instr:u
CMP with memory
            20,253      ex_ret_fused_instr:u
     1,000,019,794      ex_ret_fused_instr:u
CMP with memory and immediate
            20,260      ex_ret_fused_instr:u
            20,264      ex_ret_fused_instr:u
CMP with IP relative memory
            20,258      ex_ret_fused_instr:u
            20,256      ex_ret_fused_instr:u
TEST
            20,261      ex_ret_fused_instr:u
     1,000,020,048      ex_ret_fused_instr:u

zen1 and 2 gets:

ALU with immediate
            21,610      ex_ret_fus_brnch_inst:u
            21,697      ex_ret_fus_brnch_inst:u
ALU with memory
            21,479      ex_ret_fus_brnch_inst:u
            21,747      ex_ret_fus_brnch_inst:u
ALU with IP relative memory
            21,623      ex_ret_fus_brnch_inst:u
            21,684      ex_ret_fus_brnch_inst:u
CMP with immediate
            21,708      ex_ret_fus_brnch_inst:u
     1,000,021,288      ex_ret_fus_brnch_inst:u
CMP with memory
            21,689      ex_ret_fus_brnch_inst:u
     1,000,004,270      ex_ret_fus_brnch_inst:u
CMP with memory and immediate
            21,604      ex_ret_fus_brnch_inst:u
            21,671      ex_ret_fus_brnch_inst:u
CMP with IP relative memory
            21,589      ex_ret_fus_brnch_inst:u
            21,602      ex_ret_fus_brnch_inst:u
TEST
            21,600      ex_ret_fus_brnch_inst:u
     1,000,021,233      ex_ret_fus_brnch_inst:u

I tested the patch on zen3 and zen5 and spec2k17 and it seems neutral, however
the number of fussion does go up.

Bootstrapped/regtested x86_64-linux, I plan to commit it tomorrow.

Honza

gcc/ChangeLog:

* config/i386/i386.h (TARGET_FUSE_ALU_AND_BRANCH_MEM): New macro.
(TARGET_FUSE_ALU_AND_BRANCH_MEM_IMM): New macro.
(TARGET_FUSE_ALU_AND_BRANCH_RIP_RELATIVE): New macro.
* config/i386/x86-tune-sched.cc (ix86_fuse_mov_alu_p): Support
non-single-set.
(ix86_macro_fusion_pair_p): Allow ALU which only clobbers;
be more careful about immediates; check TARGET_FUSE_ALU_AND_BRANCH_MEM,
TARGET_FUSE_ALU_AND_BRANCH_MEM_IMM, TARGET_FUSE_ALU_AND_BRANCH_RIP_RELATIVE;
verify that we never use unsigned checks with inc/dec.
* config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): New tune.
(X86_TUNE_FUSE_ALU_AND_BRANCH_MEM): New tune.
(X86_TUNE_FUSE_ALU_AND_BRANCH_MEM_IMM): New tune.
(X86_TUNE_FUSE_ALU_AND_BRANCH_RIP_RELATIVE): New tune.
gcc/config/i386/i386.h
gcc/config/i386/x86-tune-sched.cc
gcc/config/i386/x86-tune.def