Simple NEON optimization

Hi folks, me again,

So, I want to implement a simple optimization in a NEON case I've seen
these days, most as a matter of exercise, but it also simplifies (just
a bit) the code generated.

The case is simple:

uint32x2_t x, res;
res = vceq_u32(x, vcreate_u32(0));

This will generate the following code:

; zero d16
vmov.i32 d16, #0x0
movw r0, :lower16:a
movt r0, :upper16:a
vld1.32 {d17}, [r0]
; compare two registers
vceq.i32 d17, d17, d16

But, because the vector is zero, and there is a NEON instruction to
compare against an immediate zero (VCEQZ), we could combine the two
instructions:

movw r0, :lower16:a
movt r0, :upper16:a
vld1.32 {d17}, [r0]
; compare two registers
vceq.i32 d17, d17, #0

thus, saving the VMOV.

I know, it's not much, but it's a good start for me to get the hand of
writing such passes.

So, should I put this as a special case in NEON lowering or make it as
part of an optimization pass? Which classes should I look first?

Hi folks, me again,

So, I want to implement a simple optimization in a NEON case I've seen
these days, most as a matter of exercise, but it also simplifies (just
a bit) the code generated.

The case is simple:

uint32x2_t x, res;
res = vceq_u32(x, vcreate_u32(0));

This will generate the following code:

; zero d16
vmov.i32 d16, #0x0
movw r0, :lower16:a
movt r0, :upper16:a
vld1.32 {d17}, [r0]
; compare two registers
vceq.i32 d17, d17, d16

But, because the vector is zero, and there is a NEON instruction to
compare against an immediate zero (VCEQZ), we could combine the two
instructions:

movw r0, :lower16:a
movt r0, :upper16:a
vld1.32 {d17}, [r0]
; compare two registers
vceq.i32 d17, d17, #0

thus, saving the VMOV.

I know, it's not much, but it's a good start for me to get the hand of
writing such passes.

This would be a nice optimization, and it's not a bad place to get started down in the depths of llvm codegen....

So, should I put this as a special case in NEON lowering or make it as
part of an optimization pass? Which classes should I look first?

I recommend implementing this as a target-specific DAG combine optimization. We already have target-specific DAG nodes for the relevant NEON comparison operations (ARMISD::VCEQ, etc. -- see ARMISelLowering.h) as well as the vmov (ARMISD::VMOVIMM). You just need to teach the DAG combiner how to fold them together. Here's what you need to do (all of this code is in ARMISelLowering.cpp):

0. (You don't actually need to do anything, but I'm just mentioning it FYI.) For selection DAG nodes that are not target-specific, you need to inform the DAG combiner that you want to do some target-specific combining. Look for calls to setTargetDAGCombine() for examples of this. For this case, the relevant nodes are all target-specific, so the DAG combiner will call the target-specific combining hook anyway.

1. Add the ARMISD::VCEQ etc. nodes to the switch in ARMTargetLowering::PerformDAGCombine.

2. Add a function to be called for those comparisons that checks if one operand is an ARMISD::VMOVIMM node with an immediate of zero. Note for future reference that the actual operand of VMOVIMM is an encoded value that represents one of the possible vector immediates for the "one register plus a modified immediate" format. In this case it doesn't matter because the canonical encoding of a zero vector is just zero. When you find that case, use DAG.getNode() to return a new node for the compare against zero operation. The PerformShiftCombine function is a fairly simple example of what needs to be done (although it's doing a completely different combination).

3. Write a testcase and make sure it works.

Thanks for offering to work on this!

Hi Bob,

I thought so... I'll get cracked and see if I can generate some simple tests.

Thank you very much for the detailed explanation!

cheers,
--renato

Is this related to Owen's patch r118453?

Evan

Is this related to Owen's patch r118453?

Oh, yes, it is indeed. I had forgotten about that already. Renato, this ought to be working already. Maybe you have an older version?

Bingo! Just updated and it's working now.

Thanks anyway.

--renato