Using CallingConvLower in ARM target

Attached is a prototype patch that uses CCState to lower RET nodes in
the ARM target. Lowering CALL nodes will come later.

This patch does not handle f64 and i64 types. For these types, it
would be ideal to request the conversions below:

def RetCC_ARM_APCS : CallingConv<[
  CCIfType<[f32], CCBitConvertToType<i32>>,
  CCIfType<[f64], CCBitConvertToType<i64>>,
  CCIfType<[i64], CCExtractElements<2, i32>>,

  CCIfType<[i32], CCAssignToReg<[R0, R1]>>
]>;

The problem is that i64 handling requires splitting into 2 x i32
registers. I am not sure how to build CCExtractElements as shown. The
current organization of CCState::AnalyzeReturn does not allow for
reissuing of the RET with an altered set of operands which is the
solution used elsewhere. Can anyone suggest a better way to express
this lowering?

Deep

arm_ccstate.diff (9.95 KB)

ARMCallingConv.td (1.8 KB)

Attached is a prototype patch that uses CCState to lower RET nodes in
the ARM target. Lowering CALL nodes will come later.

This patch does not handle f64 and i64 types. For these types, it
would be ideal to request the conversions below:

i64 isn't Legal on ARM, so it should already be handled.

def RetCC_ARM_APCS : CallingConv<[
CCIfType<[f32], CCBitConvertToType<i32>>,
CCIfType<[f64], CCBitConvertToType<i64>>,
CCIfType<[i64], CCExtractElements<2, i32>>,

CCIfType<[i32], CCAssignToReg<[R0, R1]>>
]>;

The problem is that i64 handling requires splitting into 2 x i32
registers. I am not sure how to build CCExtractElements as shown. The
current organization of CCState::AnalyzeReturn does not allow for
reissuing of the RET with an altered set of operands which is the
solution used elsewhere. Can anyone suggest a better way to express
this lowering?

One problem with this approach is that since i64 isn't legal, the
bitcast would require custom C++ code in the ARM target to
handle properly. It might make sense to introduce something
like

    CCIfType<[f64], CCCustom>

where CCCustom is a new entity that tells the calling convention
code to to let the target do something not easily representable
in the tablegen minilanguage.

Dan

Hi Sandeep,

Dan's idea of adding CC custom lowering support is sound. Can you add that support?

Your patch looks good. I look forward to a more complete patch. But I do have a nitpick:

Index: include/llvm/CodeGen/CallingConvLower.h

I am thinking that this requires two changes: add a flag to
CCValAssign (take a bit from HTP) to indicate isCustom and a way to
author an arbitrary CCAction by including the source directly in the
TableGen mini-language. This latter change might want a generic change
to the TableGen language. For example, the syntax might be like:

class foo : CCCustomAction {
code <<< EOF
      ....multi-line C++ code goes here that allocates regs & mem and
sets CCValAssign::isCustom....
EOF
}

Does this seem reasonable? An alternative is for CCCustom to take a
string that names a function to be called:

CCIfType<[f64], CCCustom<"MyCustomLoweringFunc">>

the function signature for such functions will have to return two
results: if the CC processing is finished and if it the func succeeded
or failed:

typedef bool CCCustomFn(unsigned ValNo, MVT ValVT,
                        MVT LocVT, CCValAssign::LocInfo LocInfo,
                        ISD::ArgFlagsTy ArgFlags, CCState &State, bool &result);

One problem with this approach is that since i64 isn't legal, the
bitcast would require custom C++ code in the ARM target to
handle properly. It might make sense to introduce something
like

  CCIfType<[f64], CCCustom>

where CCCustom is a new entity that tells the calling convention
code to to let the target do something not easily representable
in the tablegen minilanguage.

I am thinking that this requires two changes: add a flag to
CCValAssign (take a bit from HTP) to indicate isCustom and a way to
author an arbitrary CCAction by including the source directly in the
TableGen mini-language. This latter change might want a generic change
to the TableGen language. For example, the syntax might be like:

class foo : CCCustomAction {
code <<< EOF
     ....multi-line C++ code goes here that allocates regs & mem and
sets CCValAssign::isCustom....
EOF
}

Does this seem reasonable? An alternative is for CCCustom to take a
string that names a function to be called:

CCIfType<[f64], CCCustom<"MyCustomLoweringFunc">>

the function signature for such functions will have to return two
results: if the CC processing is finished and if it the func succeeded
or failed:

I like the second solution better. It seems rather cumbersome to embed multi-line c++ code in td files.

Evan

I think I've got all the cases handled now, implementing with
CCCustom<"foo"> callbacks into C++.

This also fixes a crash when returning i128. I've also included a
small asm constraint fix that was needed to build newlib.

deep

arm_callingconv.diff (53 KB)

Thanks Sandeep. I did a quick scan, this looks really good. But I do have a question:

+/// CCCustomFn - This function assigns a location for Val, possibly updating
+/// all args to reflect changes and indicates if it handled it. It must set
+/// isCustom if it handles the arg and returns true.
+typedef bool CCCustomFn(unsigned &ValNo, MVT &ValVT,
+ MVT &LocVT, CCValAssign::LocInfo &LocInfo,
+ ISD::ArgFlagsTy &ArgFlags, CCState &State,
+ bool &result);

Is it necessary to return two bools (the second is returned by reference in 'result')? I am confused about the semantics of 'result'.

Also, a nitpick:

+ unsigned i;
+ for (i = 0; i < 4; ++i)

The convention we use is:

+ for (unsigned i = 0; i < 4; ++i)

Thanks,

Evan

Although it's not generally needed for ARM's use of CCCustom, I return
two bools to handle the four possible outcomes to keep the mechanism
flexible:

* if CCCustomFn handled the arg or not
* if CCCustomFn wants to end processing of the arg or not

I placed the "unsigned i" outside those loops because i is used after
the loop. If there's a better index search pattern, I'd be happy to
change it.

Attached is an updated patch against HEAD that has DebugLoc changes. I
also split out the ARMAsmPrinter fix into it's own patch.

deep

arm_callingconv.diff (54.1 KB)

arm_fixes.diff (590 Bytes)

Although it's not generally needed for ARM's use of CCCustom, I return
two bools to handle the four possible outcomes to keep the mechanism
flexible:

* if CCCustomFn handled the arg or not
* if CCCustomFn wants to end processing of the arg or not

+/// CCCustomFn - This function assigns a location for Val, possibly updating
+/// all args to reflect changes and indicates if it handled it. It must set
+/// isCustom if it handles the arg and returns true.
+typedef bool CCCustomFn(unsigned &ValNo, MVT &ValVT,
+ MVT &LocVT, CCValAssign::LocInfo &LocInfo,
+ ISD::ArgFlagsTy &ArgFlags, CCState &State,
+ bool &result);

Is "result" what you refer to as "isCustom" in the comments?

Sorry, I am still confused. You mean it could return true but set 'result' to false? That means it has handled the argument but it would not process any more arguments? What scenario do you envision that this will be useful? I'd rather keep it simple.

I placed the "unsigned i" outside those loops because i is used after
the loop. If there's a better index search pattern, I'd be happy to
change it.

Ok.

One more nitpick:

+/// CCCustom - calls a custom arg handling function

Please capitalize "calls" and end with a period.

Thanks,

Evan

Although it's not generally needed for ARM's use of CCCustom, I return
two bools to handle the four possible outcomes to keep the mechanism
flexible:

* if CCCustomFn handled the arg or not
* if CCCustomFn wants to end processing of the arg or not

+/// CCCustomFn - This function assigns a location for Val, possibly
updating
+/// all args to reflect changes and indicates if it handled it. It
must set
+/// isCustom if it handles the arg and returns true.
+typedef bool CCCustomFn(unsigned &ValNo, MVT &ValVT,
+ MVT &LocVT, CCValAssign::LocInfo &LocInfo,
+ ISD::ArgFlagsTy &ArgFlags, CCState &State,
+ bool &result);

Is "result" what you refer to as "isCustom" in the comments?

Sorry, I am still confused. You mean it could return true but set
'result' to false? That means it has handled the argument but it would
not process any more arguments? What scenario do you envision that
this will be useful? I'd rather keep it simple.

As you note there are three actual legitimate cases (of the four combos):

1. The CCCustomFn wants the arg handling to proceed. This might be
used akin to CCPromoteToType.
2. The CCCustomFn entirely handled the arg. This might be used akin to
CCAssignToReg.
3. The CCCustomFn tried to handle the arg, but failed.

these results are conveyed the following ways:

1. The CCCustomFn returns false, &result is not used.
2. The CCCustomFn returns true, &result is false;
3. The CCCustomFn returns true, &result is true.

I tried to keep these CCCustomFns looking like TableGen generated
code. Suggestions of how to reorganize these results are welcome. :slight_smile:
Perhaps better comments around the typedef for CCCustomFn would
suffice?

The isCustom flag is simply a means for this machinery to convey to
the TargetLowering functions to process this arg specially. It may not
always be possible for the TargetLowering functions to determine that
the arg needs special handling after all the changes made by the
CCCustomFn or CCPromoteToType and other transformations.

I placed the "unsigned i" outside those loops because i is used after
the loop. If there's a better index search pattern, I'd be happy to
change it.

Ok.

One more nitpick:

+/// CCCustom - calls a custom arg handling function

Please capitalize "calls" and end with a period.

Once we settle on the result handling changes, I'll submit an update
with this change.

Although it's not generally needed for ARM's use of CCCustom, I return
two bools to handle the four possible outcomes to keep the mechanism
flexible:

* if CCCustomFn handled the arg or not
* if CCCustomFn wants to end processing of the arg or not

+/// CCCustomFn - This function assigns a location for Val, possibly
updating
+/// all args to reflect changes and indicates if it handled it. It
must set
+/// isCustom if it handles the arg and returns true.
+typedef bool CCCustomFn(unsigned &ValNo, MVT &ValVT,
+ MVT &LocVT, CCValAssign::LocInfo &LocInfo,
+ ISD::ArgFlagsTy &ArgFlags, CCState &State,
+ bool &result);

Is "result" what you refer to as "isCustom" in the comments?

Sorry, I am still confused. You mean it could return true but set
'result' to false? That means it has handled the argument but it would
not process any more arguments? What scenario do you envision that
this will be useful? I'd rather keep it simple.

As you note there are three actual legitimate cases (of the four combos):

1. The CCCustomFn wants the arg handling to proceed. This might be
used akin to CCPromoteToType.
2. The CCCustomFn entirely handled the arg. This might be used akin to
CCAssignToReg.
3. The CCCustomFn tried to handle the arg, but failed.

these results are conveyed the following ways:

1. The CCCustomFn returns false, &result is not used.
2. The CCCustomFn returns true, &result is false;
3. The CCCustomFn returns true, &result is true.

I don't think we want to support #1. If the target want to add custom code to handle an argument, if should be responsible for outputting legal code. Is there an actual need to support #1?

Evan

ARMTargetLowering doesn't need case #1, but it seemed like you and Dan
wanted a more generic way to inject C++ code into the process so I
tried to make the mechanism a bit more general.

deep

ARMTargetLowering doesn't need case #1, but it seemed like you and Dan
wanted a more generic way to inject C++ code into the process so I
tried to make the mechanism a bit more general.

Ok. Since ARM doesn't need it and it's the only client, I'd much rather have CCCustomFn just return a single bool indicating whether it can handle the arg. Would that be ok?

Thanks,

Evan

Sure. Updated patches attached.

deep

arm_callingconv.diff (53.4 KB)

arm_fixes.diff (590 Bytes)

Sorry left a small bit of cruft in ARMCallingConv.td. A corrected
patch it attached.

deep

arm_callingconv.diff (53.1 KB)

arm_fixes.diff (590 Bytes)

Thanks.

More questions :slight_smile:

    /// Information about how the value is assigned.
- LocInfo HTP : 7;
+ LocInfo HTP : 6;

Do you know why this change is needed? Are we running out of bits?

- NeededStackSize = 4;
- break;
- case MVT::i64:
- case MVT::f64:
- if (firstGPR < 3)
- NeededGPRs = 2;
- else if (firstGPR == 3) {
- NeededGPRs = 1;
- NeededStackSize = 4;
- } else
- NeededStackSize = 8;
+ State.addLoc(CCValAssign::getCustomMem(ValNo, ValVT,
+ State.AllocateStack(4, 4),
+ MVT::i32, LocInfo));
+ return true; // we handled it

Your change isn't handling the "NeededStackSize = 8" case.

++ static const unsigned HiRegList[] = { ARM::R0, ARM::R2 };
+ static const unsigned LoRegList[] = { ARM::R1, ARM::R3 };

   /// Information about how the value is assigned.
- LocInfo HTP : 7;
+ LocInfo HTP : 6;

Do you know why this change is needed? Are we running out of bits?

HTP was't using all of these bits. I needed the hasCustom bit to come
from somewhere unless we wanted to grow this struct, so I grabbed a
bit from HTP.

- NeededStackSize = 4;
- break;
- case MVT::i64:
- case MVT::f64:
- if (firstGPR < 3)
- NeededGPRs = 2;
- else if (firstGPR == 3) {
- NeededGPRs = 1;
- NeededStackSize = 4;
- } else
- NeededStackSize = 8;
+ State.addLoc(CCValAssign::getCustomMem(ValNo, ValVT,
+ State.AllocateStack(4, 4),
+ MVT::i32, LocInfo));
+ return true; // we handled it

Your change isn't handling the "NeededStackSize = 8" case.

I believe it is. I've attached two additional test cases. The
difference is that this case isn't handled by the CCCustomFns. They
fail to allocate any regs and then handling falls through to an
CCAssignToStack in ARMCallingConv.td. This is how other targets handle
similar allocations.

++ static const unsigned HiRegList[] = { ARM::R0, ARM::R2 };
+ static const unsigned LoRegList[] = { ARM::R1, ARM::R3 };
+
+ if (unsigned Reg = State.AllocateReg(HiRegList, LoRegList, 2)) {
+ unsigned i;
+ for (i = 0; i < 2; ++i)
+ if (HiRegList[i] == Reg)
+ break;
+
+ State.addLoc(CCValAssign::getCustomReg(ValNo, ValVT, Reg,
+ MVT::i32, LocInfo));
+ State.addLoc(CCValAssign::getCustomReg(ValNo, ValVT, LoRegList[i],
+ MVT::i32, LocInfo));

Since 'i' is used after the loop, please choose a better variable name.

Actually, is the loop necessary? We know the low register is always
one after the high register. Perhaps you can use
ARMRegisterInfo::getRegisterNumbering(Reg), add one to 1. And the
lookup the register enum with a new function (something like
getRegFromRegisterNum(RegNo, ValVT)).

The patch is looking good. I need to run it through some more tests.
Unfortunately ARM target is a bit broken right now. I hope to fix it
today.

I'll submit a revised patch after we've settled on the NeededStackSize=8 issue.

deep

This time with the test cases actually attached.

deep

arm_stack64_tests.diff (1.06 KB)

  /// Information about how the value is assigned.
- LocInfo HTP : 7;
+ LocInfo HTP : 6;

Do you know why this change is needed? Are we running out of bits?

HTP was't using all of these bits. I needed the hasCustom bit to come
from somewhere unless we wanted to grow this struct, so I grabbed a
bit from HTP.

- NeededStackSize = 4;
- break;
- case MVT::i64:
- case MVT::f64:
- if (firstGPR < 3)
- NeededGPRs = 2;
- else if (firstGPR == 3) {
- NeededGPRs = 1;
- NeededStackSize = 4;
- } else
- NeededStackSize = 8;
+ State.addLoc(CCValAssign::getCustomMem(ValNo, ValVT,
+ State.AllocateStack(4, 4),
+ MVT::i32, LocInfo));
+ return true; // we handled it

Your change isn't handling the "NeededStackSize = 8" case.

I believe it is. I've attached two additional test cases. The
difference is that this case isn't handled by the CCCustomFns. They
fail to allocate any regs and then handling falls through to an
CCAssignToStack in ARMCallingConv.td. This is how other targets handle
similar allocations.

Ok.

++ static const unsigned HiRegList[] = { ARM::R0, ARM::R2 };
+ static const unsigned LoRegList[] = { ARM::R1, ARM::R3 };
+
+ if (unsigned Reg = State.AllocateReg(HiRegList, LoRegList, 2)) {
+ unsigned i;
+ for (i = 0; i < 2; ++i)
+ if (HiRegList[i] == Reg)
+ break;
+
+ State.addLoc(CCValAssign::getCustomReg(ValNo, ValVT, Reg,
+ MVT::i32, LocInfo));
+ State.addLoc(CCValAssign::getCustomReg(ValNo, ValVT, LoRegList[i],
+ MVT::i32, LocInfo));

Since 'i' is used after the loop, please choose a better variable name.

Actually, is the loop necessary? We know the low register is always
one after the high register. Perhaps you can use
ARMRegisterInfo::getRegisterNumbering(Reg), add one to 1. And the
lookup the register enum with a new function (something like
getRegFromRegisterNum(RegNo, ValVT)).

The patch is looking good. I need to run it through some more tests.
Unfortunately ARM target is a bit broken right now. I hope to fix it
today.

I'll submit a revised patch after we've settled on the NeededStackSize=8 issue.

ARM target is fairly healthy now. I'll run some tests with your patch in the next few days.

Thanks,

Evan

Sorry I haven't gotten back to you earlier. I have been busy.

I ran some MultiSource/Benchmark earlier today. Looks like there are some failures: Fhourstones-3.1, Fhourstones, McCat/08-main, MiBench/consumer-lame, Olden/Power, Olden/voronoi, mafft/pairlocalign, and sim. Are you able to test them on your end?

Evan