Align loops by 32 to use DSB more efficiently in x86

Hello everyone,

I wanted to discuss the loop alignment choice in X86 codegen. Currently, LLVM unconditionally aligns all loops by 16 bits. And in some cases it does not interact well with some processor mechanisms, in particular with DSB. The effect I’m observing now has been discussed before, at least it is mentioned in this slides: https://llvm.org/devmtg/2016-11/Slides/Ansari-Code-Alignment.pdf, but it doesn’t seem that any decision has been taken on it ever since.

Motivation:

The motivating code piece that demonstrated significant score swings is as in the following example:

define i32 @test(i32* %p, i64 %len, i32 %x) {

entry:

br label %loop

loop: ; preds = %backedge, %entry

%iv = phi i64 [ %iv.next, %backedge ], [ %len, %entry ]

%iv.next = add nsw i64 %iv, -1

%cond_1 = icmp eq i64 %iv, 0

br i1 %cond_1, label %exit, label %backedge

backedge: ; preds = %loop

%addr = getelementptr inbounds i32, i32* %p, i64 %iv.next

%loaded = load atomic i32, i32* %addr unordered, align 4

%cond_2 = icmp eq i32 %loaded, %x

br i1 %cond_2, label %failure, label %loop

exit: ; preds = %loop

ret i32 -1

failure: ; preds = %backedge

unreachable

}

Basically this code is searching element x in array of values. Here is llc result for this loop in mtriple=x86_64-apple-macosx:

.p2align 4, 0x90

LBB0_1: ## %loop

=>This Inner Loop Header: Depth=1

subq $1, %rax

jb LBB0_4

%bb.2: ## %backedge

in Loop: Header=BB0_1 Depth=1

cmpl %edx, -4(%rdi,%rsi,4)

movq %rax, %rsi

jne LBB0_1

(Note: the last movq is redundant, inserted by LSR likely due to cost model bug, filed as https://bugs.llvm.org/show_bug.cgi?id=48355. Regardless of it, the situation remains the same).

And here is the assembly on x64 platform:

97.34% :arrow_upper_right: 0x30026d50: 83ea01 subl $1, %edx

│ 0x30026d53: 0f820b060000 jb 1547 ; 0x30027364

0.04% │ 0x30026d59: 89d3 movl %edx, %ebx

│ 0x30026d5b: 394c9810 cmpl %ecx, 16(%rax,%rbx,4)

╰ 0x30026d5f: 75ef jne -17 ; 0x30026d50

Important notes here:

  • Loop is aligned by 16 bytes;
  • Loop size is 17 bytes;

Depending on a particular machine, this loop is also aligned by 32 bytes. The score difference for this example is dramatic: 32 ops/ms when loop is aligned by 16 (and not by 32), against 51 ops/ms (when loop is aligned by 32). It means that the workload is bound to decoding, and the rest of execution works fine (insns/cycpes grows from 2.5 to 3.7 when aligned).

Alignment of this particular loop depends on how code for previous IR is generated, and in our case it varies:

  • Host to host;
  • Build to build;
  • Run to run (observed at least once, might be a JIT effect?).

Here are some performance counters collected on this test:

Align 16: