eBPF: Odd optimization results with clang-5.0

Hi,

just tested the clang eBPF backend. Very cool thing :slight_smile:

But I've noticed that the codegen creates really odd things which I
cannot explain from my understanding of the ISA. My test is attached
and was compiled via:

clang -O2 -target bpf -x c -c ebpf_opimizer_problems.o -o
ebpf_opimizer_problems.o
llvm-objdump -S ebpf_opimizer_problems.o

First thing is the odd way how 8 bit loads to an uint8_t are handled
(see bug1_sec):

      0: bf 16 00 00 00 00 00 00 r6 = r1
      1: 30 00 00 00 00 00 00 00 r0 = *(u8 *)skb[0]
      2: 57 00 00 00 ff ff 00 00 r0 &= 65535
      3: 95 00 00 00 00 00 00 00 exit

I have no idea why the second instruction (line "2:") would be
required when there was just a load of 8 bit to the 64 bit register.
This isn't generated when it is stored in an uint32_t or uint64_t
variable (but also when storing it in an uint16_t). This can be seen
in ok1_sec, ok1_too_a_sec and ok1_too_b_sec.

The other problem is the chaining of tests which results in an extra
store for the temporary output register (r7, see bug2_sec):

      0: bf 16 00 00 00 00 00 00 r6 = r1
      1: b7 07 00 00 00 00 00 00 r7 = 0
      2: 30 00 00 00 00 00 00 00 r0 = *(u8 *)skb[0]
      3: 55 00 04 00 01 00 00 00 if r0 != 1 goto 4
      4: 30 00 00 00 01 00 00 00 r0 = *(u8 *)skb[1]
      5: b7 07 00 00 15 00 00 00 r7 = 21
      6: 15 00 01 00 01 00 00 00 if r0 == 1 goto 1
      7: b7 07 00 00 00 00 00 00 r7 = 0

LBB4_3:
      8: bf 70 00 00 00 00 00 00 r0 = r7
      9: 95 00 00 00 00 00 00 00 exit

For unknown reasons, the line "6:" was changed from an JNE to an JEQ.
This makes it necessary to introduce link "5:" and line "7:". It would
have been better (in my eyes) to use keep it an JNE, move line "5:"
after the comparison and drop line "7:". This would look like this
(removed the bytecode to make it easier for me):

      0: r6 = r1
      1: r7 = 0
      2: r0 = *(u8 *)skb[0]
      3: if r0 != 1 goto 3
      4: r0 = *(u8 *)skb[1]
      5: if r0 != 1 goto 1
      6: r7 = 21
LBB4_3:
      7: r0 = r7
      8: exit

Or a different version (not so easily readable because it is
asymmetric) would have been:

       0: bf 16 00 00 00 00 00 00 r6 = r1
      1: 30 00 00 00 00 00 00 00 r0 = *(u8 *)skb[0]
      2: 55 00 03 00 01 00 00 00 if r0 != 1 goto 3
      3: b7 07 00 00 15 00 00 00 r7 = 21
      4: 30 00 00 00 01 00 00 00 r0 = *(u8 *)skb[1]
      5: 15 00 01 00 01 00 00 00 if r0 == 1 goto 1

LBB4_2:
      6: b7 07 00 00 00 00 00 00 r7 = 0

LBB4_3:
      7: bf 70 00 00 00 00 00 00 r0 = r7
      8: 95 00 00 00 00 00 00 00 exit

The used version (Debian buster amd64) is:

clang version 5.0.1-2 (tags/RELEASE_501/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

ebpf_optimizer_problems.c (926 Bytes)

thank you for reporting these issues.
It does look odd. Not sure whether it's related to any recent changes.
Needs more investigation.

First thing is the odd way how 8 bit loads to an uint8_t are handled
(see bug1_sec):

I could reproduce both issues on other targets on latest LLVm trunk@321940,
for example AArch64 (need to remove asm("llvm.bpf.load.byte") from the
testcase.

For the first issue, it seems to be i8/i16 will be promoted to i32, so for
bug1_sec, the sequence is:

         t6: i32 = truncate t5
       t8: i32 = and t6, Constant:i32<255>
     t9: i64 = any_extend t8

while for ok1, it is;

         t6: i32 = truncate t5
     t9: i64 = any_extend t6

For ok1 sequence, LLVM is doing (aext (trunx x)) -> x, while for bug1_sec
sequence, LLVM is not doing combination which is correct as it doesn't
understand the return value of llvm.bpf.load.byte is zero extended to i64
so combine the bug1_sec sequence will lost the effect of and instruction.

For unknown reasons, the line "6:" was changed from an JNE to an JEQ.

LLVM is doing geneirc canonicalizations inside

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:

// If the lhs block is the next block, invert the condition so that we can
// fall through to the lhs instead of the rhs block.

First thing is the odd way how 8 bit loads to an uint8_t are handled
(see bug1_sec):

I could reproduce both issues on other targets on latest LLVm trunk@321940,
for example AArch64 (need to remove asm("llvm.bpf.load.byte") from the
testcase.

For the first issue, it seems to be i8/i16 will be promoted to i32, so for
bug1_sec, the sequence is:

     t6: i32 = truncate t5
   t8: i32 = and t6, Constant:i32&lt;255&gt;
 t9: i64 = any\_extend t8

while for ok1, it is;

     t6: i32 = truncate t5
 t9: i64 = any\_extend t6

For ok1 sequence, LLVM is doing (aext (trunx x)) -> x, while for bug1_sec
sequence, LLVM is not doing combination which is correct as it doesn't
understand the return value of llvm.bpf.load.byte is zero extended to i64
so combine the bug1_sec sequence will lost the effect of and instruction.

Thanks for investigation.

Looks like the IR before "and" operation is introduced, IR looks like
   %call = call i64 @llvm.bpf.load.byte(i8* %0, i64 0)
   %conv = trunc i64 %call to i8
   %conv1 = zext i8 %conv to i32
   ret i32 %conv1

and the "Combine redundant instructions" phase changes it to:
   %call = call i64 @llvm.bpf.load.byte(i8* %0, i64 0)
   %conv = trunc i64 %call to i32
   %conv1 = and i32 %conv, 255
   ret i32 %conv1

while for ok1, IR looks like:
   %call = call i64 @llvm.bpf.load.byte(i8* %0, i64 0)
   %conv = trunc i64 %call to i32
   ret i32 %conv

One thing we could do is to do this optimization at BPF backend during
DAG2DAG transformation, since it understands the llvm.bpf.load.byte
semantics.

For unknown reasons, the line "6:" was changed from an JNE to an JEQ.

LLVM is doing geneirc canonicalizations inside

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:

// If the lhs block is the next block, invert the condition so that we can
// fall through to the lhs instead of the rhs block.

I disabled this optimization and original condition "==" is preserved,
we still have the inefficient code:
Disassembly of section bug2_sec:
bug2:
        0: bf 16 00 00 00 00 00 00 r6 = r1
        1: b7 07 00 00 00 00 00 00 r7 = 0
        2: 30 00 00 00 00 00 00 00 r0 = *(u8 *)skb[0]
        3: 15 00 01 00 01 00 00 00 if r0 == 1 goto +1 <LBB4_1>
        4: 05 00 04 00 00 00 00 00 goto +4 <LBB4_3>

LBB4_1:
        5: 30 00 00 00 01 00 00 00 r0 = *(u8 *)skb[1]
        6: b7 07 00 00 15 00 00 00 r7 = 21
        7: 15 00 01 00 01 00 00 00 if r0 == 1 goto +1 <LBB4_3>
        8: b7 07 00 00 00 00 00 00 r7 = 0

LBB4_3:
        9: bf 70 00 00 00 00 00 00 r0 = r7
       10: 95 00 00 00 00 00 00 00 exit

Right, the insn 7 and 8 can be removed. But since the "switch" to "cond"
transformation happens in insn selection, it may be too late for
the redundant condition elimination...