X86 sub_ss and sub_sd sub-register indexes

All,

I've been trying to simplify the way LLVM models sub-register relationships a bit, and the X86 sub_ss and sub_sd sub-register indices are getting in the way. I want to get rid of them.

These sub-registers are special, they are only mentioned here:

  let CompositeIndices = [(sub_ss), (sub_sd)] in {
  def XMM0: Register<"xmm0">, DwarfRegNum<[17, 21, 21]>;
  def XMM1: Register<"xmm1">, DwarfRegNum<[18, 22, 22]>;
  ...

This secret syntax means that the indexes are idempotent:

  getSubReg(YMM0, sub_ss) --> XMM0
  getSubReg(XMM0, sub_ss) --> XMM0

They are supposed to represent the 32-bit and 64-bit low parts of the xmm registers, but since we don't define explicit registers for those sub-registers, we are left with idempotent sub-register indexes.

We have three different register classes for the xmm registers: FR32, FR64, and VR128. The sub_ss and sub_sd indexes used to play a role in selecting the right register class, but not any longer. That is all derived from the instruction descriptions now.

As far as I can tell, all sub-register operations involving sub_ss and sub_sd can simply be replaced with COPY_TO_REGCLASS:

  def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
            (VMOVSDrr VR128:$src1, (EXTRACT_SUBREG (v4i32 VR128:$src2),
                                                   sub_sd))>;

Becomes:

  def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
            (VMOVSDrr VR128:$src1, (COPY_TO_REGCLASS VR128:$src2, FR64))>;

By eliminating these indexes, I can remove the 'CompositeIndices' syntax and TableGen's handling of loops in the sub-register graph. I can assert that every sub-register has a unique name, and that can be used to compress tables a bit more.

/jakob

Jakob Stoklund Olesen <jolesen@apple.com> writes:

These sub-registers are special, they are only mentioned here:

  let CompositeIndices = [(sub_ss), (sub_sd)] in {
  def XMM0: Register<"xmm0">, DwarfRegNum<[17, 21, 21]>;
  def XMM1: Register<"xmm1">, DwarfRegNum<[18, 22, 22]>;
  ...

I'm confused. Below you note that they are used in patterns, so they
are certainly mentioned more than just in the code above.

As far as I can tell, all sub-register operations involving sub_ss and
sub_sd can simply be replaced with COPY_TO_REGCLASS:

  def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
            (VMOVSDrr VR128:$src1, (EXTRACT_SUBREG (v4i32 VR128:$src2),
                                                   sub_sd))>;

Becomes:

  def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
            (VMOVSDrr VR128:$src1, (COPY_TO_REGCLASS VR128:$src2, FR64))>;

A few questions:

Will COPY_TO_REGCLASS actually generate a copy instruction or can
TableGen/isel fold it away?

What happens if the result of the above pattern using COPY_TO_REGCLASS
is spilled? Will we get a 64-bit store or a 128-bit store?

                                -Dave

Jakob Stoklund Olesen <jolesen@apple.com> writes:

As far as I can tell, all sub-register operations involving sub_ss and
sub_sd can simply be replaced with COPY_TO_REGCLASS:

def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
           (VMOVSDrr VR128:$src1, (EXTRACT_SUBREG (v4i32 VR128:$src2),
                                                  sub_sd))>;

Becomes:

def : Pat<(v4i32 (X86Movsd VR128:$src1, VR128:$src2)),
           (VMOVSDrr VR128:$src1, (COPY_TO_REGCLASS VR128:$src2, FR64))>;

A few questions:

Will COPY_TO_REGCLASS actually generate a copy instruction or can
TableGen/isel fold it away?

Both EXTRACT_SUBREG and COPY_TO_REGCLASS are emitted as COPY instructions by InstrEmitter. One as a sub-register copy, one as a full register copy. Both are handled by the register coalescer.

It would actually be possible to have EmitCopyToRegClassNode() try to call MRI->constrainRegClass() first, just like AddRegisterOperand() does. That could avoid the copy in some cases, and you would simply get a VR128 register as the second VMOVSDrr operand. I am not proposing we do that for now. Let the register coalescer deal with that.

What happens if the result of the above pattern using COPY_TO_REGCLASS
is spilled? Will we get a 64-bit store or a 128-bit store?

This behavior isn't affected by the change. FR64 registers are spilled with 64-bit stores, and VR128 registers are spilled with 128-bit stores.

When the register coalescer removes a copy between VR128 and FR64 registers, it chooses the larger spill size for the result. This is the same for sub-register copies and full register copies.

The important point here is that VR128 is a sub-class of FR64, so getCommonSubClass(VR128, FR64) -> VR128. This is the Liskov substitution principle for register classes.

/jakob

Jakob Stoklund Olesen <jolesen@apple.com> writes:

What happens if the result of the above pattern using COPY_TO_REGCLASS
is spilled? Will we get a 64-bit store or a 128-bit store?

This behavior isn't affected by the change. FR64 registers are spilled
with 64-bit stores, and VR128 registers are spilled with 128-bit
stores.

When the register coalescer removes a copy between VR128 and FR64
registers, it chooses the larger spill size for the result. This is
the same for sub-register copies and full register copies.

So if I understand this correctly, a pattern like this:

  def : Pat<(f64 (vector_extract (v2f64 VR128:$src), (iPTR 0))),
            (f64 (EXTRACT_SUBREG (v2f64 VR128:$src), sub_sd))>;

will currently use a 128-bit store if it is spilled?

That's really not good.

If the 128-bit register is not ever used as a 128-bit register,
shouldn't the coalescer pick the 64- or 32-bit register?

                                   -Dave

Jakob Stoklund Olesen <jolesen@apple.com> writes:

What happens if the result of the above pattern using COPY_TO_REGCLASS
is spilled? Will we get a 64-bit store or a 128-bit store?

This behavior isn't affected by the change. FR64 registers are spilled
with 64-bit stores, and VR128 registers are spilled with 128-bit
stores.

When the register coalescer removes a copy between VR128 and FR64
registers, it chooses the larger spill size for the result. This is
the same for sub-register copies and full register copies.

So if I understand this correctly, a pattern like this:

def : Pat<(f64 (vector_extract (v2f64 VR128:$src), (iPTR 0))),
           (f64 (EXTRACT_SUBREG (v2f64 VR128:$src), sub_sd))>;

will currently use a 128-bit store if it is spilled?

It will if we coalesce the COPY away, yes.

None of this is dependent on our using sub-registers, though. The coalescer treats sub-register copies and full register copies equally.

If the 128-bit register is not ever used as a 128-bit register,
shouldn't the coalescer pick the 64- or 32-bit register?

That optimization is not currently implemented for sub-registers. For example, if you create a GR64 virtual register and only ever use the sub_32bit sub-register, it would be possible to replace the virtual register with a GR32 register. It's not impossible to do, but it doesn't come up a lot.

When not using sub-registers, the optimization does exist. For example, if you have a VR128 virtual register, but all the instructions using it only require FR32, MRI->recomputeRegClass() will figure it out, and downgrade to FR32.

It gets permission to do this because X86RegisterInfo::getLargestLegalSuperClass(VR128) returns FR32.

/jakob

Jakob Stoklund Olesen <jolesen@apple.com> writes:

If the 128-bit register is not ever used as a 128-bit register,
shouldn't the coalescer pick the 64- or 32-bit register?

That optimization is not currently implemented for sub-registers. For
example, if you create a GR64 virtual register and only ever use the
sub_32bit sub-register, it would be possible to replace the virtual
register with a GR32 register. It's not impossible to do, but it
doesn't come up a lot.

It does come up a lot in vector code. Extraction of scalar values from
vectors is pretty common, especially given the limitations of SSE/AVX.
Typically we have done this using EXTRACT_SUBREG. So either we would
have to prevent coalescing to avoid a 128-bit spill or we would always
have to use a 128-bit spill even if we never use anything but the scalar
value.

Neither option is a good one.

When not using sub-registers, the optimization does exist. For
example, if you have a VR128 virtual register, but all the
instructions using it only require FR32, MRI->recomputeRegClass() will
figure it out, and downgrade to FR32.

I don't think this optimization applies because the SSE/AVX instruction
defines a vector register but we never use the upper elements.

Would adding Fs patterns for these cases, forcing the result register to
FR64, help?

What does Fs mean anyway, "fake scalar?" :slight_smile:

                                -Dave

If you feel this is important, please file a PR with a test case where it matters. It is orthogonal to the topic of this thread.

/jakob