How to enable use of 64bit load/store for 32bit architecture

In http://reviews.llvm.org/D8713, I added the 64bit integer store (“std”) and load (“ldd”) instructions for 32bit sparc. But now I need codegen to know how to emit them, and am not sure the best way to go about teaching the backend that 64bit integers can be used natively, but only for loads and stores.

(I originally wrote an earlier draft of question in the review but it seems like it may be better on the dev list, since it’s more what to do next than what was done.)

Basically, I’d like it if C code like:

long long x,y;
void f() { y = x; }
Or equivalently, the ll code

%0 = load i64, i64* @x, align 8
store i64 %0, i64* @y, align 8

turned into just “ldd, std” instructions, as it does in GCC, rather than loading and storing the two 32bit halves of the variables separately.

To allow that, I tried adding:
addRegisterClass(MVT::i64, &SP::IntPairRegClass)
to SparcTargetLowering::SparcTargetLowering in 32bit mode.

Doing that then makes load/store work. But it causes llvm to try to use i64 operations for everything, which of course fails for all other operations, since there’s no such instruction pattern for them.

Okay, so I then try setting all the operations to “Expand” via:

for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op) {
setOperationAction(Op, MVT::i64, Expand);
}

This does not actually work properly. E.g.

%res = add nsw i64 %0, 3

gives an error:

LLVM ERROR: Cannot select: 0x43b1df0: i64 = add 0x43b1bd0, 0x43b1ce0 [ORD=3] [ID=15]

Which seems odd – it appears as if it ignored the Expand and just tried to continue onto selection for 64bit add as if it had been legal?

And a different example:
%res = shl i64 %0, 1

aborts with error:
llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:3988: void {anonymous}::SelectionDAGLegalize::ExpandNode(llvm::SDNode*): Assertion `VT.isVector() && “Unable to legalize non-vector shift”’ failed.

So, it seems to me that a whole bunch of other code will need to be added or modified, too, to teach LLVM how to expand i64 to i32 operations for everything other than load/store, if I continue down this path – despite that it can do so just fine if i64 doesn’t have a register class. This seems ugly…

Is there some other better way to notate the case that ONLY loads/stores can be done on 64bit integers, but that everything else must be done as if 64bit integers didn’t exist?

Hi James,

I’m not too familiar with Sparc’s design here, but you may find it useful to have a look at how the ARM target deals with this sort of thing. In particular, the load/store optimization pass which can form the LDRD/STRD instructions among other things.

-Jim

Hi James, Jim

If you really want this to work in selection DAG then there is a solution, but its not pretty.

First make i64 not be legal. Then, assuming the regclass you gave has some subregs, you can give load/store a custom legalisation where you change the i64 to MVT::Untyped. So something like this for ISD::STORE:

SDValue ValueToBeStored = St.getOperand(…)

auto SeqOps = {
DAG.getTargetConstant(SP::IntPairRegClassID, MVT::i32),
DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(0, MVT::i32)),DAG.getTargetConstant(SP ::sub0, MVT::i32),
DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(1, MVT::i32)),
DAG.getTargetConstant(SP ::sub1, MVT::i32)};

SDValue NewValueToBeStored = DAG.getMachineNode(TargetOpcode::REG_SEQUENCE, dl, MVT::Untyped, SeqOps);

return DAG.getStore(…, NewValueToBeStored, …)

The legalizer won’t touch Untyped operands, so once you’ve changed the store operand to this, you won’t have the legalize complain later. But this is very target specific and prone to error if you get sub regs or anything wrong.

Cheers,
Pete

Hi James, Jim

If you *really* want this to work in selection DAG then there is a solution, but its not pretty.

First make i64 not be legal. Then, assuming the regclass you gave has some subregs, you can give load/store a custom legalisation where you change the i64 to MVT::Untyped. So something like this for ISD::STORE:

SDValue ValueToBeStored = St.getOperand(…)

auto SeqOps = {
  DAG.getTargetConstant(SP::IntPairRegClassID, MVT::i32),
  DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(0, MVT::i32)),
  DAG.getTargetConstant(SP ::sub0, MVT::i32),
  DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(1, MVT::i32)),
  DAG.getTargetConstant(SP ::sub1, MVT::i32)
};

SDValue NewValueToBeStored = DAG.getMachineNode(TargetOpcode::REG_SEQUENCE, dl, MVT::Untyped, SeqOps);

return DAG.getStore(…, NewValueToBeStored, …)

The legalizer won’t touch Untyped operands, so once you’ve changed the store operand to this, you won’t have the legalize complain later. But this is very target specific and prone to error if you get sub regs or anything wrong.

Cheers,
Pete

Hi,

We have a similar issue in the R600 backend. We need to support 128-bit
(and larger) stores, but i128 is not a legal type. The way we handle
this is to make a same sized vector type (v4i32) legal and then add a
loop to the TargetLowering constructor to Expand everything except loads and
stores.

I'm not sure if the TypeLegalizer will promote i64 stores to v2i32, but
if not I think you can add a target DAGCombine for it.

I'm not sure if this is better or worse than Pete's suggestion, but it
may be worth a try.

-Tom

Hi James, Jim

If you really want this to work in selection DAG then there is a solution, but its not pretty.

First make i64 not be legal. Then, assuming the regclass you gave has some subregs, you can give load/store a custom legalisation where you change the i64 to MVT::Untyped. So something like this for ISD::STORE:

SDValue ValueToBeStored = St.getOperand(…)

auto SeqOps = {
DAG.getTargetConstant(SP::IntPairRegClassID, MVT::i32),
DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(0, MVT::i32)),
DAG.getTargetConstant(SP ::sub0, MVT::i32),
DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, ValueToBeStored, DAG.getConstant(1, MVT::i32)),
DAG.getTargetConstant(SP ::sub1, MVT::i32)
};

SDValue NewValueToBeStored = DAG.getMachineNode(TargetOpcode::REG_SEQUENCE, dl, MVT::Untyped, SeqOps);

return DAG.getStore(…, NewValueToBeStored, …)

The legalizer won’t touch Untyped operands, so once you’ve changed the store operand to this, you won’t have the legalize complain later. But this is very target specific and prone to error if you get sub regs or anything wrong.

Cheers,
Pete

Hi,

We have a similar issue in the R600 backend. We need to support 128-bit
(and larger) stores, but i128 is not a legal type. The way we handle
this is to make a same sized vector type (v4i32) legal and then add a
loop to the TargetLowering constructor to Expand everything except loads and
stores.

I’m not sure if the TypeLegalizer will promote i64 stores to v2i32, but
if not I think you can add a target DAGCombine for it.

I’m not sure if this is better or worse than Pete’s suggestion, but it
may be worth a try.

This solution is definitely less of a hack than mine :slight_smile:

Mine has the advantage of working with weird sized types, but if you don’t need to store something like a <3 x float> then i’d go with Tom’s solution.

Cheers,
Pete

Thanks for the help so far!

So, there were three alternatives proposed on this thread.

(1) Jim’s suggestion to copy the solution ARM uses for LDRD/STRD. This looks like it would be applicable, since ARM’s LDRD/STRD seem a basically direct analog of the Sparc’s LDD/STD instructions. Both require an even/odd register pair, and 64bit alignment of the memory for the 2x32 load/store. But that solution seemed pretty complicated, requiring writing a separate optimization pass. I thought I’d try a simpler solution first…

(2) Tom’s suggestion of declaring that the IntPair registers are of type v2i32. I actually tried implementing this. And it does seem to work…kinda.

First thing – it seems kinda hacky to me to use a vector type for a pair of integer registers. What if the architecture also had real vector ops? Wouldn’t that conflict with the use v2i32 to represent a pair of integer registers?

But anyways,

I implemented Legal actions of LOAD, STORE, EXTRACT_VECTOR_ELT, and BUILD_VECTOR (the latter two expanding to EXTRACT_SUBREG/INSERT_SUBREG patterns) on the v2i32 type in the .td file. All other operations for v2i32 are marked Expand. I made i64 LOAD/STORE Custom, and added handlers in LowerOperation and ReplaceNodeResults, which do a v2i32 load/store, and a bitcast (which turns into the build/extract operations).

So, that all seems to work fine for i64 operations now.

But, doing some testing, it seems that something within LLVM really wants to merge two i32 loads into an v2i32, even though it’s not appropriate. It makes terrible code for this example:
struct X {
int a, b;
};
struct X x, y;

void f() {
long a = x.a, b = x.b;
y.a = a; y.b = b;
}

“f()” turns into the expected 4 operations:
%0 = load i32, i32* getelementptr inbounds (%struct.X, %struct.X* @x, i32 0, i32 0), align 4, !tbaa !1
%1 = load i32, i32* getelementptr inbounds (%struct.X, %struct.X* @x, i32 0, i32 1), align 4, !tbaa !6
store i32 %0, i32* getelementptr inbounds (%struct.X, %struct.X* @y, i32 0, i32 0), align 4, !tbaa !1
store i32 %1, i32* getelementptr inbounds (%struct.X, %struct.X* @y, i32 0, i32 1), align 4, !tbaa !6

But, somewhere in the bowels after that, llvm seems to try to merge the two sequential i32 loads and stores into a v2i32 load and store. But, because “x” and “y” are not 64bit aligned, it then ends up converting that v2i32 load to occur from the stack – after using i32 load/stores to copy x.a, x.b onto the stack temporarily! Then, immediately after the v2i32 load from the stack…it starts the reverse process – a v2i32 store BACK to a different place in the stack, and i32 load/stores copying the latter stack location onto y.a and y.b.

That seems quite sub-optimal, but I’m not sure where it comes from. It looks like just having the v2i32 type assigned to the register class causes that “optimization”. The other stuff I implemented doesn’t affect it. Why is it trying to do a conversion to v2i32 when the arguments are not properly aligned? Running “llc -debug” doesn’t seem to shed any light for me.

(3) Pete’s “not pretty” solution of using Untyped operands, and REG_SEQUENCE. I’m gonna try implementing this one next. It seems like it might be the simplest of the three, and have fewer side-effects since it doesn’t introduce a new legal type at all.

Thanks for the help so far!

So, there were three alternatives proposed on this thread.

(1) Jim’s suggestion to copy the solution ARM uses for LDRD/STRD. This looks like it would be applicable, since ARM’s LDRD/STRD seem a basically direct analog of the Sparc’s LDD/STD instructions. Both require an even/odd register pair, and 64bit alignment of the memory for the 2x32 load/store. But that solution seemed pretty complicated, requiring writing a separate optimization pass. I thought I’d try a simpler solution first…

(2) Tom’s suggestion of declaring that the IntPair registers are of type v2i32. I actually tried implementing this. And it does seem to work…kinda.

First thing – it seems kinda hacky to me to use a vector type for a pair of integer registers. What if the architecture also had real vector ops? Wouldn’t that conflict with the use v2i32 to represent a pair of integer registers?

But anyways,

I implemented Legal actions of LOAD, STORE, EXTRACT_VECTOR_ELT, and BUILD_VECTOR (the latter two expanding to EXTRACT_SUBREG/INSERT_SUBREG patterns) on the v2i32 type in the .td file. All other operations for v2i32 are marked Expand. I made i64 LOAD/STORE Custom, and added handlers in LowerOperation and ReplaceNodeResults, which do a v2i32 load/store, and a bitcast (which turns into the build/extract operations).

So, that all seems to work fine for i64 operations now.

But, doing some testing, it seems that something within LLVM really wants to merge two i32 loads into an v2i32, even though it’s not appropriate. It makes terrible code for this example:
struct X {
int a, b;
};
struct X x, y;

void f() {
long a = x.a, b = x.b;
y.a = a; y.b = b;
}

“f()” turns into the expected 4 operations:
%0 = load i32, i32* getelementptr inbounds (%struct.X, %struct.X* @x, i32 0, i32 0), align 4, !tbaa !1
%1 = load i32, i32* getelementptr inbounds (%struct.X, %struct.X* @x, i32 0, i32 1), align 4, !tbaa !6
store i32 %0, i32* getelementptr inbounds (%struct.X, %struct.X* @y, i32 0, i32 0), align 4, !tbaa !1
store i32 %1, i32* getelementptr inbounds (%struct.X, %struct.X* @y, i32 0, i32 1), align 4, !tbaa !6

But, somewhere in the bowels after that, llvm seems to try to merge the two sequential i32 loads and stores into a v2i32 load and store. But, because “x” and “y” are not 64bit aligned, it then ends up converting that v2i32 load to occur from the stack – after using i32 load/stores to copy x.a, x.b onto the stack temporarily! Then, immediately after the v2i32 load from the stack…it starts the reverse process – a v2i32 store BACK to a different place in the stack, and i32 load/stores copying the latter stack location onto y.a and y.b.

That seems quite sub-optimal, but I’m not sure where it comes from. It looks like just having the v2i32 type assigned to the register class causes that “optimization”. The other stuff I implemented doesn’t affect it. Why is it trying to do a conversion to v2i32 when the arguments are not properly aligned? Running “llc -debug” doesn’t seem to shed any light for me.

Its probably DAGCombiner::MergeConsecutiveStores which is checking that the stores have equal alignment (and they do: 4), but not that the final store has enough alignment to be legal. I’d breakpoint MergeStoresOfConstantsOrVecElts and see if you get a hit there.

Cheers,
Pete

Yes, it is indeed MergeConsecutiveStores doing this. (not MergeStoresOfConstantsOrVecElts; the case of merging a sequence of stores whose data comes from a sequence of loads handled in the main function body, and that's the case in question).

I see in a bunch of other places, llvm checks 'allowsMisalignedMemoryAccesses' before creating a new larger load/store operation. I suppose that code should do so as well. I wonder if this is the only place that creates unaligned operations improperly?

I'd say, additionally, the alignment checks that MergeConsecutiveStores does do seem pretty bogus:
1) it doesn't make sense to require that the stores all have the same alignment as each-other: if the struct as a whole had an alignment of 8, the first store would have alignment 8, and the second would have alignment 4. And *that* ought to be valid, but isn't currently treated as such.
2) Later on, it then requires that the alignment of each load is the same as the alignment of the corresponding store "if (Ld->getAlignment() != St->getAlignment()) break;". But why? As long as the alignments are both acceptable, I don't see what purpose ensuring that they're identical has.

As to what it's doing storing temporarily onto the stack, that's done by "ExpandUnalignedLoad" and "ExpandUnalignedStore" -- apparently they are only able to deal with unaligned vector and floating-point ops by temporarily storing to the stack, unless there's an equivalent integer type of the same size. That seems quite suboptimal -- I'd assume it would almost always be better to split into loads of the largest integer type actually available, and build things up from there, instead of indirecting through the stack...