Question about LLVM NEON intrinsics

Hi all,

I would like to know if LLVM Neon intrinsics are designed to support only 'Legal' types for NEON units.
Using llc -march=arm -mcpu=cortex-a9 vmax4.ll -o vmax4.s on following ll code:

; ModuleID = 'vmax.ll'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32"
target triple = "armv7-none-linux-androideabi"

define void @vmaxf32(<4 x float> *%C, <4 x float>* %A, <4 x float>* %B) nounwind {
    %tmp1 = load <4 x float>* %A
    %tmp2 = load <4 x float>* %B
    %tmp3 = call <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float> %tmp1, <4 x float> %tmp2)
    store <4 x float> %tmp3, <4 x float>* %C
    ret void
}

declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone

I've got following code generated:

...
vmaxf32: @ @vmaxf32
@ BB#0:
  vld1.64 {d16, d17}, [r2]
  vld1.64 {d18, d19}, [r1]
  vmax.f32 q8, q9, q8
  vst1.64 {d16, d17}, [r0]
  bx lr
...

Now if use <16 x float> vectors instead of <4 x float>:

define void @vmaxf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind {
    %tmp1 = load <16 x float>* %A
    %tmp2 = load <16 x float>* %B
    %tmp3 = call <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float> %tmp1, <16 x float> %tmp2)
    store <16 x float> %tmp3, <16 x float>* %C
    ret void
}

declare <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float>, <16 x float>) nounwind readnone

llc fails with following message:

SplitVectorResult #0: 0x2258350: v16f32 = llvm.arm.neon.vmaxs 0x2258250, 0x2258050, 0x2258150 [ORD=3] [ID=0]

LLVM ERROR: Do not know how to split the result of this operator!

Is it a BUG ? If yes I'm happy to get some directions on how I can fix it. If not I would like to know how to determine valid type for a given LLVM intrinsics.

Thanks for your answers
Best Regards
Seb

I may be wrong, but I don't think there is such a load intrinsic...

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/BABDCGGF.html

Hi all,

I would like to know if LLVM Neon intrinsics are designed to support only 'Legal' types for NEON units.
Using llc -march=arm -mcpu=cortex-a9 vmax4.ll -o vmax4.s on following ll code:

; ModuleID = 'vmax.ll'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32"
target triple = "armv7-none-linux-androideabi"

define void @vmaxf32(<4 x float> *%C, <4 x float>* %A, <4 x float>* %B) nounwind {
    %tmp1 = load <4 x float>* %A
    %tmp2 = load <4 x float>* %B
    %tmp3 = call <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float> %tmp1, <4 x float> %tmp2)
    store <4 x float> %tmp3, <4 x float>* %C
    ret void
}

declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone

I've got following code generated:

...
vmaxf32: @ @vmaxf32
@ BB#0:
        vld1.64 {d16, d17}, [r2]
        vld1.64 {d18, d19}, [r1]
        vmax.f32 q8, q9, q8
        vst1.64 {d16, d17}, [r0]
        bx lr
...

Now if use <16 x float> vectors instead of <4 x float>:

define void @vmaxf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind {
    %tmp1 = load <16 x float>* %A
    %tmp2 = load <16 x float>* %B
    %tmp3 = call <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float> %tmp1, <16 x float> %tmp2)
    store <16 x float> %tmp3, <16 x float>* %C
    ret void
}

declare <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float>, <16 x float>) nounwind readnone

llc fails with following message:

SplitVectorResult #0: 0x2258350: v16f32 = llvm.arm.neon.vmaxs 0x2258250, 0x2258050, 0x2258150 [ORD=3] [ID=0]

LLVM ERROR: Do not know how to split the result of this operator!

Is it a BUG ? If yes I'm happy to get some directions on how I can fix it.

No... platform-specific intrinsics have platform-specific semantics,
including what types they're defined for. NEON doesn't have 16 x float
vectors, at least not for that sort of operation.

If not I would like to know how to determine valid type for a given LLVM intrinsics.

The ARM reference manual is probably your best bet for ARM intrinsics.

-Eli

Hello Renato,

You're pointing me at ARM intrinsics related to loads, problem that I've reported in original e-mail, is not support for vector loads, but support for 'vmaxs'. For instance, there is no vector loads of 16 floats in ARM ISA but it is legal to write in LLVM:

; ModuleID = 'vadd.ll'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32"
target triple = "armv7-none-linux-androideabi"

define void @vaddf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind {
    %tmp1 = load <16 x float>* %A
    %tmp2 = load <16 x float>* %B
    %tmp3 = fadd <16 x float> %tmp1, %tmp2
    store <16 x float> %tmp3, <16 x float>* %C
    ret void
}

and llc generates following code:

vaddf32: @ @vaddf32
@ BB#0:
  add r12, r1, #48
  add r3, r2, #32
  vld1.64 {d20, d21}, [r3, :128]
  add r3, r2, #48
  vld1.64 {d16, d17}, [r2, :128]
  add r2, r2, #16
  vld1.64 {d18, d19}, [r1, :128]
  vld1.64 {d26, d27}, [r12, :128]
  add r12, r1, #32
  vld1.64 {d24, d25}, [r3, :128]
  add r1, r1, #16
  vadd.f32 q11, q9, q8
  vld1.64 {d28, d29}, [r12, :128]
  vadd.f32 q9, q13, q12
  vadd.f32 q8, q14, q10
  vld1.64 {d20, d21}, [r2, :128]
  vld1.64 {d24, d25}, [r1, :128]
  add r1, r0, #48
  vadd.f32 q10, q12, q10
  vst1.64 {d22, d23}, [r0, :128]
  vst1.64 {d18, d19}, [r1, :128]
  add r1, r0, #32
  add r0, r0, #16
  vst1.64 {d16, d17}, [r1, :128]
  vst1.64 {d20, d21}, [r0, :128]
  bx lr
.Ltmp0:
  .size vaddf32, .Ltmp0-vadd32

So 'fadd' instruction operating on vector of <16 x float> is legalized (scalarized) into 4 vadd.f32 instructions. My assumption was that same process could apply to NEON LLVM intrinsics such as 'vmaxs'. It doesn't seems to be the case so I'm wondering if this is an actual bug or if LLVM intrinsics are limited to legal types for the targeted architecture.
Note that however <16 x float> loads are not supported LLVM is able to generate them as a serie of vld1.i64 instructions.
Hope this clarify my request.

Best Regards
Seb

Hi Eli,

Thanks for the answer, it clarifies the situation for me. Do you know if there is Pass in LLVM that could be adapted to 'legalize' intrinsics calls ?
Or shall I define my own intrinsics for non supported types ?

Best Regards
Seb

Oh, yes, sorry.

Still, Eli is right, you can't assume generic IR will convert to
platform-specific intrinsics automagically.

This is not a bug, but could be a feature, if you want to write a NEON
validator pass that pattern-matches generic LLVM IR operations into
the respective (semantically correct) NEON intrinsics, or at least
leave the IR operations in a state that the back-end will recognize
it.

Honestly, I prefer the approach to have the front-end writing generic
IR and having target-specific passes that will change the generic IR
to target specific, so the back-end can deal with it. But it seems
that the front-ends had to deal with that, so far, including the ones
I wrote. :confused:

Hi Renato,

I guess one solution could be to define LLVM max intrinsic and have LLVM backends generating the appropriate instructions (using SSE inst for x86, NEON for ARM etc.).

Seb

That's a grey area... As Eli said, different back-ends have different
semantics, so trying to add generic intrinsics that will be converted
to target-specific intrinsics is bound to create semantic problems
when generating IR, or worse, silently producing bad code in the end.

Besides, if we're going to transform one intrinsic into another, it's
better to leave the current IR syntax as is (vector operations) and
transform that into target-specific intrinsics directly, or bail. IR
operations' semantics are better defined than intrinsics', so you
leave less room for silent codegen faults.

Hi all,

I would like to know if LLVM Neon intrinsics are designed to support only 'Legal' types for NEON units.
Using llc -march=arm -mcpu=cortex-a9 vmax4.ll -o vmax4.s on following ll code:

; ModuleID = 'vmax.ll'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32"
target triple = "armv7-none-linux-androideabi"

define void @vmaxf32(<4 x float> *%C, <4 x float>* %A, <4 x float>* %B) nounwind {
   %tmp1 = load <4 x float>* %A
   %tmp2 = load <4 x float>* %B
   %tmp3 = call <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float> %tmp1, <4 x float> %tmp2)
   store <4 x float> %tmp3, <4 x float>* %C
   ret void
}

declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone

I've got following code generated:

...
vmaxf32: @ @vmaxf32
@ BB#0:
       vld1.64 {d16, d17}, [r2]
       vld1.64 {d18, d19}, [r1]
       vmax.f32 q8, q9, q8
       vst1.64 {d16, d17}, [r0]
       bx lr
...

Now if use <16 x float> vectors instead of <4 x float>:

define void @vmaxf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind {
   %tmp1 = load <16 x float>* %A
   %tmp2 = load <16 x float>* %B
   %tmp3 = call <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float> %tmp1, <16 x float> %tmp2)
   store <16 x float> %tmp3, <16 x float>* %C
   ret void
}

declare <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float>, <16 x float>) nounwind readnone

llc fails with following message:

SplitVectorResult #0: 0x2258350: v16f32 = llvm.arm.neon.vmaxs 0x2258250, 0x2258050, 0x2258150 [ORD=3] [ID=0]

LLVM ERROR: Do not know how to split the result of this operator!

Is it a BUG ? If yes I'm happy to get some directions on how I can fix it.

No... platform-specific intrinsics have platform-specific semantics,
including what types they're defined for. NEON doesn't have 16 x float
vectors, at least not for that sort of operation.

Right.

These backend intrinsics are designed for support of the functions in arm_neon.h. Any use outside of that context is "there be dragons here" territory.

Hi Eli,

Thanks for the answer, it clarifies the situation for me. Do you know if there is Pass in LLVM that could be adapted to 'legalize' intrinsics calls ?
Or shall I define my own intrinsics for non supported types ?

You should never generate these sorts of intrinsics with non-legal types. It's the job of the front end to make sure that they are only called with legal types. Yes, this is different than normal LLVM IR.

Hi Jim,

Thanks for the answer, it confirms what I first thought.
Best Regards
Seb