Pseudo load and store instructions for AArch64

Hello,

I'm trying to add pseudo 64-bit load and store instructions for AArch64, which
should have latencies set to "1" while being otherwise exactly the same as
normal load and store instructions. Various assertions fire (even different
ones for the same binary, maybe something is uninitialized) and I can't
understand what's wrong. Related pieces added by me:

to AArch64InstrInfo.td:

  let isReMaterializable = 1 in {
    def FakeLoad64 : Pseudo<(outs GPR64:$Rt), (ins GPR64sp:$Rn, GPR32:$Rm, ro_Wextend64:$extend), []>;
    def FakeStore64 : Pseudo<(outs), (ins GPR64:$Rt, GPR64sp:$Rn, GPR32:$Rm, ro_Wextend64:$extend), []>;
  }

  def AArch64fakeload64 : SDNode<"AArch64ISD::FakeLoad64", SDTIntBinOp, [SDNPHasChain]>;
  def AArch64fakestore64 : SDNode<"AArch64ISD::FakeStore64", SDTIntBinOp, [SDNPHasChain]>;

to AArch64ISD in AArch64ISelLowering.h below ISD::FIRST_TARGET_MEMORY_OPCODE:

  FakeLoad64,
  FakeStore64

in AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy():

  SmallVector<SDValue, 4> Ops;
  Ops.push_back(Chain);
  Ops.push_back(DAG.getNode(ISD::ADD, dl, MVT::i64, Src,
                            DAG.getConstant(SrcOff, MVT::i64)));
  // Ops.push_back(SrcPtrInfo.getWithOffset(SrcOff));
  Ops.push_back(DAG.getConstant(0, MVT::i64));

  Loads[i] = DAG.getNode(AArch64::FakeLoad64, dl, VT, Ops);

There seems to be something wrong with pointer information inside getNode() as
llvm::MachinePointerInfo::getAddrSpace() asserts.

I can't find an example of similar instructions to start with, are there any
similar pseudoes already?

Any help would be appreciated, even if someone could confirm that it should
be possible to do and I'm just missing something.

Thanks,
Sergey

I'm trying to add pseudo 64-bit load and store instructions for AArch64, which
should have latencies set to "1" while being otherwise exactly the same as
normal load and store instructions.

Hi Sergey,

Can I ask why would you need that?

There seems to be something wrong with pointer information inside getNode() as
llvm::MachinePointerInfo::getAddrSpace() asserts.

Looks like there's specific knowledge about the types and instructions
codes in switches midway through that is not recognizing your new
pseudos.

One way to find them out is to grep for the instruction codes yours is
similar to, and then see if you need to add your pseudos

cheers,
--renato

Hi Renato,

> I'm trying to add pseudo 64-bit load and store instructions for AArch64, which
> should have latencies set to "1" while being otherwise exactly the same as
> normal load and store instructions.

Can I ask why would you need that?

This is the only way I found to stop Machine Instruction Scheduler from
reordering load and store instructions. I asked on this specific topic
several times before, but no one answered. The following approaches
didn't work in this case:

- different kind of chaining;
- gluing;
- single pseudo instruction for load and store as it needs temporary
   register, but such pseudos are expanded after RA.

It's needed to make code of inlined memcpy() more efficient.

Looks like there's specific knowledge about the types and instructions
codes in switches midway through that is not recognizing your new
pseudos.

One way to find them out is to grep for the instruction codes yours is
similar to, and then see if you need to add your pseudos

Thanks, I'll try that.

Regards,
Sergey

I see. Saleem (cc'd) worked on a similar thing for ARM's movh/movt for
Windows, which also didn't like the reordering. Maybe he can help you.

cheers,
--renato

Hi Sergey,

I was thinking about this and I remember seeing a similar problem to
yours in ARM. Something like:

  ldr r1, [sp, #20]
  ldr r2, [sp, #24]
  ldr r3, [sp, #28]

being reordered to:

  ldr r2, [sp, #24]
  ldr r1, [sp, #20]
  ldr r3, [sp, #28]

and having a big hit on performance.

The ARM back-end has the ARMLoadStoreOptimizer class, which deals with
similar problems and it's generally passed at the right time for
fixing loads and stores, maybe you could add a similar thing to
AArch64?

That'd have the benefit of not polluting the table-gen files, and
could be turned on via a flag, on demand, that only after heavily
tested, could be turned on by default.

James (cc'd) implemented the optimizer, maybe he could hint on some of
the issues for your particular case.

cheers,
--renato

> This is the only way I found to stop Machine Instruction Scheduler from
> reordering load and store instructions.

I see. Saleem (cc'd) worked on a similar thing for ARM's movh/movt for
Windows, which also didn't like the reordering. Maybe he can help you.

Sorry, Ive been a bit busy at work :-(.

For Windows on ARM, the movw/movt relocations need to be contiguous. In
order to accommodate that, we generate a bundle (similar to the VLIW
concept) to treat the pair as a single scheduling entity.

Although, that approach could work, it feels like updating the
LoadStoreOptimizer to deal with the particular case may be a cleaner
approach.

I think you give me too much credit - I didn’t write that pass!

Cheers,

James

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782

Saleem,

we generate a bundle (similar to the VLIW concept)

Is it possible to specify that a bundle uses a couple of registers
internally? Which are not input, not output registers, but are rather
temporaries. If the answer is no, then it won't work anyway.

Although, that approach could work, it feels like updating the
LoadStoreOptimizer to deal with the particular case may be a cleaner
approach.

Thanks! I didn't notice that load/store optimizer is executed *after*
machine instructions scheduler, which should allow to move some of
instructions.

Regards,
Sergey

Hi Renato,

I was thinking about this and I remember seeing a similar problem to
yours in ARM. Something like:

  ldr r1, [sp, #20]
  ldr r2, [sp, #24]
  ldr r3, [sp, #28]

being reordered to:

  ldr r2, [sp, #24]
  ldr r1, [sp, #20]
  ldr r3, [sp, #28]

Well, it's a bit different. What I'm trying to do is to turn

    ldp x10, x11, [x9] // load
    ldp x12, x9, [x9, #16] // load
    stp x10, x11, [x8] // store
    mov w0, wzr
    stp x12, x9, [x8, #16] // store

into

    ldp x10, x11, [x9] // load
    stp x10, x11, [x8] // store
    ldp x12, x9, [x9, #16] // load
    stp x12, x9, [x8, #16] // store
    mov w0, wzr

So "load" + "load" and "store" + "store" are already fine, I need paired
operations to be properly interleaved and adjacent. It should result
in better performance even though machine instruction scheduler thinks
differently.

fixing loads and stores, maybe you could add a similar thing to
AArch64?

AArch64LoadStoreOptimizer already exists, but I'll try to add
instruction reordering to it, I saw some code for moving instructions in
ARMLoadStoreOptimizer. Saleem suggested something similar.

That'd have the benefit of not polluting the table-gen files, and
could be turned on via a flag, on demand, that only after heavily
tested, could be turned on by default.

I'd prefer that as well. Pseudo instructions was the last resort as I
was out of options.

Thanks for your help.

Regards,
Sergey

Hum, my bad. Getting old...

--renato