FP Intrinsics

Hello,

I am trying to make the FP intrinsics (abs, sin, cos, sqrt) I've added work with the X86ISelPattern, but I'm having some difficulties understanding what needs to be done. I assume I have to add new nodetypes for the FP instructions to SelectionDAGNodes.h, and make nodes for these in SelectionDAGLowering::visitCall when I find the intrinsic...

The part I don't quite understand is what to do for targets that don't have these instructions (although I'm only interested in X86 myself, I would like to see these patches in the official LLVM version as it's some work to maintain them) -- for me it would make most sense to lower the intrinsic to a call if it's not supported. However I notice that for other intrinsics (memcpy etc.) this is done in LegalizeDAG where the node is expanded to a call if it's not directly supported for the target.

To illustrate what I have been doing so far I attach the diff to this mail -- if someone could have a look at it and tell me what needs to be done to get it working I would be very grateful... (Note that if you use the X86ISelSimple it's already working...)

m.

temp.patch (11.7 KB)

Update: I have been working on this all day, and I finally got it working more or less with the pattern instruction selector... However, the generated code is not very good, and I haven't implemented the expand to calls if the target does not support these FP instructions.

As an example, in the following function the sub abs and compare compiles to 13 instructions! Also it has changed the result of the abs to be a 64 bit double when it stores it, while my intention was to get a 32 bit float.
For comparision I included the code generated by the X86ISelSimple which is only 8 instructions... it seems the compare is being generated twice in two different ways with the pattern selector?!

I also attached the diff of my current version

m.

internal void %EvaluatePoint3D65() {
EntryBlock:
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 0)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 1)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 2)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 3)
  %Value = call float %ReadVoxel( void* %_hVB59 ) ; <float> [#uses=1]
  %arg2 = load float* %T64 ; <float> [#uses=1]
  %VMCommandSubtract = sub float %Value, %arg2 ; <float> [#uses=1]
  %VMCommandAbs = call float %llvm.abs( float %VMCommandSubtract ) ; <float> [#uses=2]
  %isNonZero = setgt float %VMCommandAbs, 0.000000e+000 ; <bool> [#uses=1]
  br bool %isNonZero, label %NonZero, label %Zero

NonZero: ; preds = %EntryBlock
  call void %Shader1DLookupLinear( <4 x float>* %_ARGB56, void* %_hS60, float %VMCommandAbs, void* %_hContext3D58 )
  ret void

Zero: ; preds = %EntryBlock
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 0)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 1)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 2)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 3)
  ret void
}

Generated with ISelPattern:

17160410 sub esp,1Ch
17160413 mov dword ptr ds:[161D6240h],0
1716041D mov dword ptr ds:[161D6244h],0
17160427 mov dword ptr ds:[161D6248h],0
17160431 mov dword ptr ds:[161D624Ch],0
1716043B mov eax,76E4560h
17160440 mov dword ptr [esp],eax
17160443 call HueVMReadCommands_LLVMReadVoxel (19BB229h)
17160448 fsub dword ptr ds:[161D6280h]
1716044E fabs
17160450 fst qword ptr [esp+14h]
17160454 ftst
17160456 fstp st(0)
17160458 fnstsw ax
1716045A sahf
1716045B fldz
1716045D fchs
1716045F fld qword ptr [esp+14h]
17160463 fucomip st,st(1)
17160465 fstp st(0)
17160467 jbe 17160498
1716046D mov eax,76E4F60h
17160472 mov dword ptr [esp+0Ch],eax
17160476 fld qword ptr [esp+14h]
1716047A fstp dword ptr [esp+8]
1716047E mov eax,15900060h
17160483 mov dword ptr [esp+4],eax
17160487 mov eax,161D6240h
1716048C mov dword ptr [esp],eax
1716048F call HueVMShaderCommands_LLVMShader1DLookupLinear (19D8B76h)
17160494 add esp,1Ch
17160497 ret
17160498 mov dword ptr ds:[161D6240h],0
171604A2 mov dword ptr ds:[161D6244h],0
171604AC mov dword ptr ds:[161D6248h],0
171604B6 mov dword ptr ds:[161D624Ch],0
171604C0 add esp,1Ch
171604C3 ret

Generated with ISelSimple:

17160440 sub esp,1Ch
17160443 mov eax,16237200h
17160448 mov dword ptr [eax],0
1716044E mov eax,16237200h
17160453 mov dword ptr [eax+4],0
1716045A mov eax,16237200h
1716045F mov dword ptr [eax+8],0
17160466 mov eax,16237200h
1716046B mov dword ptr [eax+0Ch],0
17160472 mov eax,19BB229h
17160477 mov ecx,76E4560h
1716047C mov dword ptr [esp],ecx
1716047F call eax
17160481 fsub dword ptr ds:[16237240h]
17160487 fabs
17160489 fst qword ptr [esp+14h]
1716048D ftst
1716048F fstp st(0)
17160491 fnstsw ax
17160493 sahf
17160494 jbe 171604C7
1716049A mov eax,19D8B76h
1716049F mov ecx,16237200h
171604A4 mov dword ptr [esp],ecx
171604A7 mov ecx,15900060h
171604AC mov dword ptr [esp+4],ecx
171604B0 fld qword ptr [esp+14h]
171604B4 fstp dword ptr [esp+8]
171604B8 mov ecx,76E4F60h
171604BD mov dword ptr [esp+0Ch],ecx
171604C1 call eax
171604C3 add esp,1Ch
171604C6 ret
171604C7 mov eax,16237200h
171604CC mov dword ptr [eax],0
171604D2 mov eax,16237200h
171604D7 mov dword ptr [eax+4],0
171604DE mov eax,16237200h
171604E3 mov dword ptr [eax+8],0
171604EA mov eax,16237200h
171604EF mov dword ptr [eax+0Ch],0
171604F6 add esp,1Ch
171604F9 ret

temp.patch (13.6 KB)

Hello,

I am trying to make the FP intrinsics (abs, sin, cos, sqrt) I've added work with the X86ISelPattern, but I'm having some difficulties understanding what needs to be done.

Cool. Here are a couple of requests:

1. I don't think we need an "llvm.abs" intrinsic at the llvm level. This
    can be modeled as a pair of setcc/select instructions.
2. My objection to llvm.abs does not apply to the FP_ABS
    node you added: it is fine. Additionally, the target-independent
    selection dag machinery should be the one that notices the relevant
    setcc/select pairs at the llvm level to fabricate the FP_ABS node.
3. On X86 at least, sin and cos are not defined over the full numeric
    range. These instructions are useful for applications like yours, and
    situations where a flag like "-ffast-math" has been provided. Because
    of this, please name the intrinsics and nodes sin_approx and
    cos_approx. I don't think that sqrt on the X86 has this limitation,
    so its intrinsic can be named just "llvm.sqrt".
4. Don't forget a doc patch to docs/LangRef.html :slight_smile:

I assume I have to add new nodetypes for the FP instructions to SelectionDAGNodes.h, and make nodes for these in SelectionDAGLowering::visitCall when I find the intrinsic...

This looks good.

The part I don't quite understand is what to do for targets that don't have these instructions (although I'm only interested in X86 myself, I would like to see these patches in the official LLVM version as it's some work to maintain them)

Ok, sounds good.

-- for me it would make most sense to lower the intrinsic to a call if it's not supported. However I notice that for other intrinsics (memcpy etc.) this is done in LegalizeDAG where the node is expanded to a call if it's not directly supported for the target.

Yup, this is what we want to do. There are two places to implement this: LegalizeDAG for the SelectionDAG isels, and lib/CodeGen/IntrinsicLowering.cpp for other isels.

To illustrate what I have been doing so far I attach the diff to this mail -- if someone could have a look at it and tell me what needs to be done to get it working I would be very grateful... (Note that if you use the X86ISelSimple it's already working...)

One of the problems that jumps out at me in the pattern version is that you have code that looks like this:

+ case Intrinsic::sqrt:
+ setValue(&I, DAG.getNode(ISD::FP_SQRT, MVT::f64,
+ getValue(I.getOperand(1))));

In your case, you're passing in an float, which is 32-bits. You probably want the type argument for this to be whatever the input type is, not hard coded to f64.

-Chris

Update: I have been working on this all day, and I finally got it working more or less with the pattern instruction selector... However, the generated code is not very good, and I haven't implemented the expand to calls if the target does not support these FP instructions.

Oh, I see you fixed the F64 thing, sorry, disregard that part of the previous email :-/

As an example, in the following function the sub abs and compare compiles to 13 instructions! Also it has changed the result of the abs to be a 64 bit double when it stores it, while my intention was to get a 32 bit float.
For comparision I included the code generated by the X86ISelSimple which is only 8 instructions... it seems the compare is being generated twice in two different ways with the pattern selector?!

I'm not sure if the specific problem you're talking about here is the comparison-against-zero issue, or something else. Can you reduce it to a smaller test case or annotate the source with the problem you're hitting?

Thanks,

-Chris

I also attached the diff of my current version

m.

internal void %EvaluatePoint3D65() {
EntryBlock:
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 0)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 1)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 2)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 3)
  %Value = call float %ReadVoxel( void* %_hVB59 ) ; <float> [#uses=1]
  %arg2 = load float* %T64 ; <float> [#uses=1]
  %VMCommandSubtract = sub float %Value, %arg2 ; <float> [#uses=1]
  %VMCommandAbs = call float %llvm.abs( float %VMCommandSubtract ) ; <float> [#uses=2]
  %isNonZero = setgt float %VMCommandAbs, 0.000000e+000 ; <bool> [#uses=1]
  br bool %isNonZero, label %NonZero, label %Zero

NonZero: ; preds = %EntryBlock
  call void %Shader1DLookupLinear( <4 x float>* %_ARGB56, void* %_hS60, float %VMCommandAbs, void* %_hContext3D58 )
  ret void

Zero: ; preds = %EntryBlock
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 0)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 1)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 2)
  store float 0.000000e+000, float* getelementptr ([4 x float]* cast (<4 x float>* %_ARGB56 to [4 x float]*), int 0, int 3)
  ret void
}

Generated with ISelPattern:

17160410 sub esp,1Ch
17160413 mov dword ptr ds:[161D6240h],0
1716041D mov dword ptr ds:[161D6244h],0
17160427 mov dword ptr ds:[161D6248h],0
17160431 mov dword ptr ds:[161D624Ch],0
1716043B mov eax,76E4560h
17160440 mov dword ptr [esp],eax
17160443 call HueVMReadCommands_LLVMReadVoxel (19BB229h)
17160448 fsub dword ptr ds:[161D6280h]
1716044E fabs
17160450 fst qword ptr [esp+14h]
17160454 ftst
17160456 fstp st(0)
17160458 fnstsw ax
1716045A sahf
1716045B fldz
1716045D fchs
1716045F fld qword ptr [esp+14h]
17160463 fucomip st,st(1)
17160465 fstp st(0)
17160467 jbe 17160498
1716046D mov eax,76E4F60h
17160472 mov dword ptr [esp+0Ch],eax
17160476 fld qword ptr [esp+14h]
1716047A fstp dword ptr [esp+8]
1716047E mov eax,15900060h
17160483 mov dword ptr [esp+4],eax
17160487 mov eax,161D6240h
1716048C mov dword ptr [esp],eax
1716048F call HueVMShaderCommands_LLVMShader1DLookupLinear (19D8B76h)
17160494 add esp,1Ch
17160497 ret
17160498 mov dword ptr ds:[161D6240h],0
171604A2 mov dword ptr ds:[161D6244h],0
171604AC mov dword ptr ds:[161D6248h],0
171604B6 mov dword ptr ds:[161D624Ch],0
171604C0 add esp,1Ch
171604C3 ret

Generated with ISelSimple:

17160440 sub esp,1Ch
17160443 mov eax,16237200h
17160448 mov dword ptr [eax],0
1716044E mov eax,16237200h
17160453 mov dword ptr [eax+4],0
1716045A mov eax,16237200h
1716045F mov dword ptr [eax+8],0
17160466 mov eax,16237200h
1716046B mov dword ptr [eax+0Ch],0
17160472 mov eax,19BB229h
17160477 mov ecx,76E4560h
1716047C mov dword ptr [esp],ecx
1716047F call eax
17160481 fsub dword ptr ds:[16237240h]
17160487 fabs
17160489 fst qword ptr [esp+14h]
1716048D ftst
1716048F fstp st(0)
17160491 fnstsw ax
17160493 sahf
17160494 jbe 171604C7
1716049A mov eax,19D8B76h
1716049F mov ecx,16237200h
171604A4 mov dword ptr [esp],ecx
171604A7 mov ecx,15900060h
171604AC mov dword ptr [esp+4],ecx
171604B0 fld qword ptr [esp+14h]
171604B4 fstp dword ptr [esp+8]
171604B8 mov ecx,76E4F60h
171604BD mov dword ptr [esp+0Ch],ecx
171604C1 call eax
171604C3 add esp,1Ch
171604C6 ret
171604C7 mov eax,16237200h
171604CC mov dword ptr [eax],0
171604D2 mov eax,16237200h
171604D7 mov dword ptr [eax+4],0
171604DE mov eax,16237200h
171604E3 mov dword ptr [eax+8],0
171604EA mov eax,16237200h
171604EF mov dword ptr [eax+0Ch],0
171604F6 add esp,1Ch
171604F9 ret

-Chris

Chris Lattner wrote:

I am trying to make the FP intrinsics (abs, sin, cos, sqrt) I've added work with the X86ISelPattern, but I'm having some difficulties understanding what needs to be done.

Cool. Here are a couple of requests:

1. I don't think we need an "llvm.abs" intrinsic at the llvm level. This
   can be modeled as a pair of setcc/select instructions.

OK, I'm not exactly sure where and how to do this matching of setcc/select - can you give me some pointers?

2. My objection to llvm.abs does not apply to the FP_ABS
   node you added: it is fine. Additionally, the target-independent
   selection dag machinery should be the one that notices the relevant
   setcc/select pairs at the llvm level to fabricate the FP_ABS node.

OK

3. On X86 at least, sin and cos are not defined over the full numeric
   range. These instructions are useful for applications like yours, and
   situations where a flag like "-ffast-math" has been provided. Because
   of this, please name the intrinsics and nodes sin_approx and
   cos_approx. I don't think that sqrt on the X86 has this limitation,
   so its intrinsic can be named just "llvm.sqrt".

I think it makes more sense to have the intrinsics as is, but do the code generation in the X86 target different depending on some command line flag. For the pattern ISel, this simply amounts to registering the FP_SIN and FP_COS
as nodes which need to be expanded (to calls) if not fast-math is set... What do you think about this approach?

4. Don't forget a doc patch to docs/LangRef.html :slight_smile:

I assume I have to add new nodetypes for the FP instructions to SelectionDAGNodes.h, and make nodes for these in SelectionDAGLowering::visitCall when I find the intrinsic...

OK, I will.

-- for me it would make most sense to lower the intrinsic to a call if it's not supported. However I notice that for other intrinsics (memcpy etc.) this is done in LegalizeDAG where the node is expanded to a call if it's not directly supported for the target.

Yup, this is what we want to do. There are two places to implement this: LegalizeDAG for the SelectionDAG isels, and lib/CodeGen/IntrinsicLowering.cpp for other isels.

Why not do it in SelectionDAGLowering::visitCall? The way I have implemented it now, this calls TLI.hasNativeSupportForOperation to see if it (for example FP_SQRT) is a legal operation on the target, and if not it sets RenameFn to "sin" and simply goes ahead with generating the call. This is a lot less code than doing it in LegalizeDAG, and also more efficient since it's not first lowered to an instruction and then expanded to a call.

m.

> I am trying to make the FP intrinsics (abs, sin, cos, sqrt) I've added work
> with the X86ISelPattern, but I'm having some difficulties understanding what
> needs to be done.

I have some notes for Alpha below, if you are so inclined :slight_smile:

1. I don't think we need an "llvm.abs" intrinsic at the llvm level. This
    can be modeled as a pair of setcc/select instructions.

I agree.

2. My objection to llvm.abs does not apply to the FP_ABS
    node you added: it is fine. Additionally, the target-independent
    selection dag machinery should be the one that notices the relevant
    setcc/select pairs at the llvm level to fabricate the FP_ABS node.

A FP_ABS node is nice, alpha has abs (cyps F31, src, dst)

3. On X86 at least, sin and cos are not defined over the full numeric
    range. These instructions are useful for applications like yours, and
    situations where a flag like "-ffast-math" has been provided. Because
    of this, please name the intrinsics and nodes sin_approx and
    cos_approx. I don't think that sqrt on the X86 has this limitation,
    so its intrinsic can be named just "llvm.sqrt".

I think it is the same with alpha's sqrt. Though sin and cos will have
to be lowered.

> The part I don't quite understand is what to do for targets that don't have
> these instructions (although I'm only interested in X86 myself, I would like
> to see these patches in the official LLVM version as it's some work to
> maintain them)

So alpha has sqrt and abs, but not sin and cos. Once you have the
SelectionDag making the nodes for the intrinsics, let me know.

Andrew

Chris Lattner wrote:

I am trying to make the FP intrinsics (abs, sin, cos, sqrt) I've added work with the X86ISelPattern, but I'm having some difficulties understanding what needs to be done.

Cool. Here are a couple of requests:

1. I don't think we need an "llvm.abs" intrinsic at the llvm level. This
   can be modeled as a pair of setcc/select instructions.

OK, I'm not exactly sure where and how to do this matching of setcc/select - can you give me some pointers?

X = abs(Y) is the same as:

cond = setlt Y, 0
tmp = sub -0.0, Y
X = select cond, tmp, Y

3. On X86 at least, sin and cos are not defined over the full numeric
   range. These instructions are useful for applications like yours, and
   situations where a flag like "-ffast-math" has been provided. Because
   of this, please name the intrinsics and nodes sin_approx and
   cos_approx. I don't think that sqrt on the X86 has this limitation,
   so its intrinsic can be named just "llvm.sqrt".

I think it makes more sense to have the intrinsics as is, but do the code generation in the X86 target different depending on some command line flag. For the pattern ISel, this simply amounts to registering the FP_SIN and FP_COS as nodes which need to be expanded (to calls) if not fast-math is set... What do you think about this approach?

This is fine with me at the code generator level. If you choose to do this, then there is no reason to make sin/cos intrinsics. The code generator might as well just recognize a call to an external function named sin/cos and generate the FP_SIN/FP_COS node when appropriate.

Note that we still need llvm.sqrt, because llvm.sqrt (in contrast to sqrt) does not set errno.

4. Don't forget a doc patch to docs/LangRef.html :slight_smile:

I assume I have to add new nodetypes for the FP instructions to SelectionDAGNodes.h, and make nodes for these in SelectionDAGLowering::visitCall when I find the intrinsic...

OK, I will.

-- for me it would make most sense to lower the intrinsic to a call if it's not supported. However I notice that for other intrinsics (memcpy etc.) this is done in LegalizeDAG where the node is expanded to a call if it's not directly supported for the target.

Yup, this is what we want to do. There are two places to implement this: LegalizeDAG for the SelectionDAG isels, and lib/CodeGen/IntrinsicLowering.cpp for other isels.

Why not do it in SelectionDAGLowering::visitCall? The way I have implemented it now, this calls TLI.hasNativeSupportForOperation to see if it (for example FP_SQRT) is a legal operation on the target, and if not it sets RenameFn to "sin" and simply goes ahead with generating the call. This is a lot less code than doing it in LegalizeDAG, and also more efficient since it's not first lowered to an instruction and then expanded to a call.

I admit that it's not a huge difference, but the idea behind the legalize phase in general is basically to allow optimization on the unlegalized representation as well as on the legalized representation. For example, it would make sense to implement constant folding for FP_SIN nodes. If you did this, then only targets that natively support sin would perform the constant folding, targets that didn't wouldn't (unless the code to do all of the folding was duplicated in lower call).

This all makes a much bigger difference when talking about breaking up 64-bit operations for 32-bit hosts, or promoting 8-bit operations for 32-bit RISC-like hosts, but the principal still applies to sin/cos lowering I think.

-Chris