Controlling order of instructions

Hi,

I'm trying to improve performance of code generated for AArch64, as
described in this thread [0] about memcpy() inlining, and have two issues
with it:

1) Some output sequences look like this:
     ...something...
     load
     ...something...
     store
     load
     store
   By chaining each next load with previous store I can turn it into:
     ...something...
     ...something...
     load
     store
     load
     store
   It can be done directly in SelectionDAG.cpp, but it's target-specific
   and shouldn't go there. If I make it target-specific, it never gets
   called, because SelectionDAG uses the following sequence of calls:
    1. Try generic load&store generator.
    2. Try target-specific load&store generator.
    3. Try generic load&store generator and force generation.

   My question is how can I give target-specific generator bigger
   priority in this case? Is there any flag for this, or maybe it's worth
   adding one?

2) Second issue seems to be harder, I'd like to prevent Machine Instruction
   Scheduler from reordering
     load
     store
     load
     store
   into
     load
     load
     store
     store
   Chaining and specifying IROrder doesn't help (assuming I implement it
   correctly). I don't see particular order on picking instructions
   in GenericScheduler that don't really depend on each other. Reordering
   seems to occur as a side effect of something else. If there are more
   load&store operations (e.g. four pairs of load&store), only the last
   four instructions are reordered.

   I don't see how scheduling can be controlled other than by providing
   custom scheduler, but will it help? I do not see enough ordering
   information at this level and don't understand how it can be forced.

Can somebody advice me on this? If it's documented somewhere and I miss
that, you could just give me a link.

Thanks,
Sergey

[0]: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140714/226044.html

Hi again,

Still seeking a way to force particular order of instructions which would
dominate machine instruction scheduling. IR ordering is quite close and it
actually helps in ensuring correct ordering of instructions, but it's either
not enough to turn off scheduling or I'm using it wrong (comments in code
and commits that introduced it is pretty much all documentation I've found
on this subject).

I still have basically the same questions, but I'll try to rephrase them
hoping that it'll make them clearer:

1. Is it OK to block generic load&store sequence generator by setting
    limits for memcpy() generation to very low values? This will force
    calling of architecture-specific callback for inlining memcpy().

2. Would it be better to provide a flag in TargetSelectionDAGInfo to
    alter behaviour of generic load&store function or maybe use existing
    flags that say whether platform supports paired loads&stores?

3. Is there a way to ignore latencies for particular instructions? They
    can cause undesired reordering even for instructions for which IR
    order is set explicitly.

Are there any documentation on using IROrder? Should it even guarantee
strict order at all? I'm asking because effect of assigning order can
produce quite surprising sequences, e.g.: 100->101->1->102->103.

Also found ScheduleDAGSDNodes::AddGlue(...), can it be useful in my case?

Could somebody with experience in this part of LLVM say whether I'm
digging in right direction or it's all wrong?

Thanks,
Sergey