RFC: SDNode Flags

Right now the MemSDNode keeps a volatile bit in the SubclassData to mark
volatile memory operations.

We have some changes we'd like to push back that adds a NonTemporal flag
to MemSDNode to mark instructions where movnt (on x86) and other goodness
can happen (we'll also add the TableGen patterns to properly select movnt).

In our tree we simply added another flag to the MemSDNode constructor
and embedded it in SubclassData:

MemSDNode(unsigned Opc, DebugLoc dl, SDVTList VTs, MVT MemoryVT,
            const Value *srcValue, int SVOff,
            unsigned alignment, bool isvolatile, bool NonTemporal);

This is ugly for a variety of reasons and also doesn't scale as we
want to add more of this kind of information.

So what if we replace Volatile/NonTemporal with a single bitvector?
There's not a lot of room in SubclassData (it also holds alignment
information) so maybe this should be another member on MemSDNode.
That will increase the size of MemSDNode, so we need to consider
that.

The constructor could look something like this:

MemSDNode(unsigned Opc, DebugLoc dl, SDVTList VTs, MVT MemoryVT,
            const Value *srcValue, int SVOff,
            unsigned alignment, unsigned flags);

and called like this:

new (N) LoadSDNode(..., isVolatile|isNonTemporal);

Thoughts?

                                    -Dave

Right now the MemSDNode keeps a volatile bit in the SubclassData to mark
volatile memory operations.

We have some changes we'd like to push back that adds a NonTemporal flag
to MemSDNode to mark instructions where movnt (on x86) and other goodness
can happen (we'll also add the TableGen patterns to properly select movnt).

In our tree we simply added another flag to the MemSDNode constructor
and embedded it in SubclassData:

MemSDNode(unsigned Opc, DebugLoc dl, SDVTList VTs, MVT MemoryVT,
           const Value *srcValue, int SVOff,
           unsigned alignment, bool isvolatile, bool NonTemporal);

This is ugly for a variety of reasons and also doesn't scale as we
want to add more of this kind of information.

So what if we replace Volatile/NonTemporal with a single bitvector?
There's not a lot of room in SubclassData (it also holds alignment
information) so maybe this should be another member on MemSDNode.
That will increase the size of MemSDNode, so we need to consider
that.

LoadSDNode, which inherits from MemSDNode is the largest
SDNode. With the current SDNode allocation strategy, making it
bigger will increase the allocation needed for all nodes.

The constructor could look something like this:

MemSDNode(unsigned Opc, DebugLoc dl, SDVTList VTs, MVT MemoryVT,
           const Value *srcValue, int SVOff,
           unsigned alignment, unsigned flags);

and called like this:

new (N) LoadSDNode(..., isVolatile|isNonTemporal);

Thoughts?

This sounds reasonable. I'd suggest using names like MemRefFlags rather
than isVolatileisNonTemporal, but that's just a detail :-).

My biggest concern is how this is encoded in the SDNode. It'd be
good to avoid making MemSDNode bigger, but there are a variety of ways
that exiting bits can be made available. NodeType doesn't need all 16 of
its bits, for example, and OperandsNeedDelete could be merged with
OperandList with a PointerIntPair if needed.

How are you representing movnt in LLVM IR, with an intrinsic?

Dan

I also want a way to add target specific flag to a SDNode (which should be transferred to MachineInstr). For example, on x86 lots of opcodes have *lock* variants. Right now, these are separate instructions. I'd prefer to make it into a target specific flag that can be toggled by some sort of post-isel action routine.

One way to handle this might be to expand the use of SubclassData. There are 15 bits to play with and only the bottom 6-7 are in use (if I am reading it right). We can also reduce the width of following field NodeId (do we need 32-bit for it?) and widen SubclassData.

Also, NumOperands and NumValues can be changed to take up fewer bits. I don't think we need 16-bits for each.

So NodeType, OperandsNeedDelete, SubclassData, NodeId, NumOperands, and NumValues together are using 96 bits (assuming int is 32-bit and short is 16-bit). I think there is opportunity here to re-pack them and find quite a few bits for target specific data.

Evan

LoadSDNode, which inherits from MemSDNode is the largest
SDNode. With the current SDNode allocation strategy, making it
bigger will increase the allocation needed for all nodes.

Ok.

> new (N) LoadSDNode(..., isVolatile|isNonTemporal);
>
> Thoughts?

This sounds reasonable. I'd suggest using names like MemRefFlags rather
than isVolatileisNonTemporal, but that's just a detail :-).

There's a pipe there. :slight_smile:

My biggest concern is how this is encoded in the SDNode. It'd be
good to avoid making MemSDNode bigger, but there are a variety of ways
that exiting bits can be made available. NodeType doesn't need all 16
of
its bits, for example, and OperandsNeedDelete could be merged with
OperandList with a PointerIntPair if needed.

*shudder* Undefined behavior? No thanks.

Right now, the lower five bits of SubclassData are used to encode various
things for memory SDNodes. One of those is the volatile bit. This leaves
10 bits for alignment information, meaning we can represent alignments up to
2^1024, which seems like a bit much. :slight_smile:

What do you think about carving four more bits out of the alignment field
so we have five flag/semantic bits and can represent alignments up to 2^64?
That seems like plenty of alignment headroom.

How are you representing movnt in LLVM IR, with an intrinsic?

Yes, it's an intrinsic inserted just before the load or store of interest.
I'm not particularly happy with this solution because I'm not sure all
transformation passes will preserve the ordering of
intrinsic-followed-by-operation. Is there a better way to do this?

                                  -Dave

I also want a way to add target specific flag to a SDNode (which
should be transferred to MachineInstr). For example, on x86 lots of
opcodes have *lock* variants. Right now, these are separate
instructions. I'd prefer to make it into a target specific flag that
can be toggled by some sort of post-isel action routine.

That's a good idea.

One way to handle this might be to expand the use of SubclassData.
There are 15 bits to play with and only the bottom 6-7 are in use (if
I am reading it right). We can also reduce the width of following
field NodeId (do we need 32-bit for it?) and widen SubclassData.

Yep. See my response to Dan. I don't know how many target-specific
flags you need, but we should be able to carve out four more bits from
SubclassData without changing anything else.

Also, NumOperands and NumValues can be changed to take up fewer bits.
I don't think we need 16-bits for each.

Probably true.

So NodeType, OperandsNeedDelete, SubclassData, NodeId, NumOperands,
and NumValues together are using 96 bits (assuming int is 32-bit and
short is 16-bit). I think there is opportunity here to re-pack them
and find quite a few bits for target specific data.

I agree. If you and Dan agree, I'll take a crack at this.

                                  -Dave

So NodeType, OperandsNeedDelete, SubclassData, NodeId, NumOperands,
and NumValues together are using 96 bits (assuming int is 32-bit and
short is 16-bit). I think there is opportunity here to re-pack them
and find quite a few bits for target specific data.

I agree. If you and Dan agree, I'll take a crack at this.

That'd be great. Thanks.

Evan

LoadSDNode, which inherits from MemSDNode is the largest
SDNode. With the current SDNode allocation strategy, making it
bigger will increase the allocation needed for all nodes.

Ok.

new (N) LoadSDNode(..., isVolatile|isNonTemporal);

Thoughts?

This sounds reasonable. I'd suggest using names like MemRefFlags rather
than isVolatileisNonTemporal, but that's just a detail :-).

There's a pipe there. :slight_smile:

Oops. I blame my font ;-).

My biggest concern is how this is encoded in the SDNode. It'd be
good to avoid making MemSDNode bigger, but there are a variety of ways
that exiting bits can be made available. NodeType doesn't need all 16
of
its bits, for example, and OperandsNeedDelete could be merged with
OperandList with a PointerIntPair if needed.

*shudder* Undefined behavior? No thanks.

What undefined behavior? Is it PointerIntPair that's making you shudder?
That's implementation-defined behavior. Very different :-). And it's checked
by asserts.

Right now, the lower five bits of SubclassData are used to encode various
things for memory SDNodes. One of those is the volatile bit. This leaves
10 bits for alignment information, meaning we can represent alignments up to
2^1024, which seems like a bit much. :slight_smile:

What do you think about carving four more bits out of the alignment field
so we have five flag/semantic bits and can represent alignments up to 2^64?
That seems like plenty of alignment headroom.

Yes, that's fine too. If someone really wanted an alignment greater
than 1<<64, they would probably have other issues that would dwarf
this one.

How are you representing movnt in LLVM IR, with an intrinsic?

Yes, it's an intrinsic inserted just before the load or store of interest.
I'm not particularly happy with this solution because I'm not sure all
transformation passes will preserve the ordering of
intrinsic-followed-by-operation. Is there a better way to do this?

Crazy idea: how about an intrinsic just after the load or store of interest?
It would be a sort of opposite of prefetch, saying "i'm not going to be
using the memory at this address for a while". It could use
IntrWriteArgMem like the regular prefetch does, which is usually
sufficient to keep it from wandering off, though it is a little stronger than
is strictly necessary.

The other alternative is something like llvm.x86.sse.movnt.ps and
friends.

Dan

> *shudder* Undefined behavior? No thanks.

What undefined behavior? Is it PointerIntPair that's making you
shudder?
That's implementation-defined behavior. Very different :-). And it's
checked
by asserts.

Well, it's still non-portable either way.

> What do you think about carving four more bits out of the alignment
> field
> so we have five flag/semantic bits and can represent alignments up
> to 2^64?
> That seems like plenty of alignment headroom.

Yes, that's fine too. If someone really wanted an alignment greater
than 1<<64, they would probably have other issues that would dwarf
this one.

But see Evan's note. I can see a need for more bits if we take
target-specific needs into account. I'll work on a solution.

>> How are you representing movnt in LLVM IR, with an intrinsic?
>
> Yes, it's an intrinsic inserted just before the load or store of
> interest.
> I'm not particularly happy with this solution because I'm not sure all
> transformation passes will preserve the ordering of
> intrinsic-followed-by-operation. Is there a better way to do this?

Crazy idea: how about an intrinsic just after the load or store of
interest?
It would be a sort of opposite of prefetch, saying "i'm not going to be
using the memory at this address for a while". It could use
IntrWriteArgMem like the regular prefetch does, which is usually
sufficient to keep it from wandering off, though it is a little
stronger than
is strictly necessary.

Does before vs. after matter for IntrWriteArgMem? What does that do anyway?

The other alternative is something like llvm.x86.sse.movnt.ps and
friends.

I don't want to go that route because this kind of information is useful on
multiple targets. Better to have one way to do it. We ain't coding Perl
here. :slight_smile:

                            -Dave

*shudder* Undefined behavior? No thanks.

What undefined behavior? Is it PointerIntPair that's making you

shudder?

That's implementation-defined behavior. Very different :-). And it's

checked

by asserts.

Well, it's still non-portable either way.

LLVM does many non-portable things behind encapsulation.

What do you think about carving four more bits out of the alignment

field

so we have five flag/semantic bits and can represent alignments up

to 2^64?

That seems like plenty of alignment headroom.

Yes, that's fine too. If someone really wanted an alignment greater

than 1<<64, they would probably have other issues that would dwarf

this one.

But see Evan's note. I can see a need for more bits if we take
target-specific needs into account. I'll work on a solution.

Ok.

How are you representing movnt in LLVM IR, with an intrinsic?

Yes, it's an intrinsic inserted just before the load or store of

interest.

I'm not particularly happy with this solution because I'm not sure all

transformation passes will preserve the ordering of

intrinsic-followed-by-operation. Is there a better way to do this?

Crazy idea: how about an intrinsic just after the load or store of

interest?

It would be a sort of opposite of prefetch, saying "i'm not going to be

using the memory at this address for a while". It could use

IntrWriteArgMem like the regular prefetch does, which is usually

sufficient to keep it from wandering off, though it is a little

stronger than

is strictly necessary.

Does before vs. after matter for IntrWriteArgMem?

After just seems a little more intuitive:
   "I won't be using the memory at this address for a while."
rather than
   "I'm about to do a store to this address and after that I won't be
    using the memory there for a while."

What does that do anyway?

Passes that don't know about that specific intrinsic will treat it
as if it reads from and writes to the memory pointed to by its
argument.

The other alternative is something like llvm.x86.sse.movnt.ps and

friends.

I don't want to go that route because this kind of information is useful on
multiple targets. Better to have one way to do it. We ain't coding Perl
here. :slight_smile:

A target-independent intrinsic that would be lowered to a plain
store on targets which don't have special instructions would be possible
too. I'm not necessarily advocating this, just mentioning the possibility.

Dan