instruction scheduling issue

Hi all,

I’m trying to insert a function call "llvm_memory_profiling " right before each memory access. The function uses the effective address of the memory access as its single parameter.

A example is as follows: the function call at 402a99 has a parameter passed to %rdi at 402a91. One can see that the function call is exactly before the
memory access I want to monitor because the effective address used by 402a9e is passed to the function call.

402a91: 48 8d bc 1c 48 02 00 lea 0x248(%rsp,%rbx,1),%rdi
402a98: 00
402a99: e8 82 e0 ff ff callq 400b20 llvm_memory_profiling@plt
/home/xl10/llvm/test//luleshOMP-0611.cc:1974
402a9e: f2 0f 10 84 1c 48 02 movsd 0x248(%rsp,%rbx,1),%xmm0
402aa5: 00 00
402aa7: f2 0f 11 84 24 b8 01 movsd %xmm0,0x1b8(%rsp)

However, due to instruction scheduling, the following instruction of the function call is not always the desired memory access instruction. For example,
in the following code, the memory access is at 402afa, not right next to the function call at 402aed.

/home/xl10/llvm/test//luleshOMP-0611.cc:1972
402ae5: 48 8d bc 1c 08 02 00 lea 0x208(%rsp,%rbx,1),%rdi
402aec: 00
402aed: e8 2e e0 ff ff callq 400b20 llvm_memory_profiling@plt
402af2: 48 8d bc 1c c8 02 00 lea 0x2c8(%rsp,%rbx,1),%rdi

402af9: 00
/home/xl10/llvm/test//luleshOMP-0611.cc:1975
402afa: f2 0f 10 84 1c 08 02 movsd 0x208(%rsp,%rbx,1),%xmm0
402b01: 00 00

Could anyone point me how to solve this problem by modifying the instruction scheduling module in LLVM to make sure the function call instruction and the
memory access instruction next to each other? Thank you very much.

Best regards,

Xu Liu

Liu,

I do not think there is a trivial way to do it. Do you really have to have those instructions together, or mere order is enough?

Also, how much performance are you willing to sacrifice to do what you do? Maybe turning off scheduling all together is an acceptable solution?

Sergei

Or insert the calls after scheduling.

-Krzysztof

Hi Sergei,

Thanks for your reply. I need to make the have the two instructions together because I want to make the monitored memory access instruction as the return address of the instrumented function.
How can I turn off the instruction scheduling? Is there any easy way to do that?

Best regards,

Xu Liu
PhD student, Rice University

Quoting Sergei Larin <slarin@codeaurora.org>:

Krzysztof,

This would be ideal. How can I do the instrumentation pass after the instruction scheduling?

Xu Liu

Quoting Krzysztof Parzyszek <kparzysz@codeaurora.org>:

You could derive your own class from TargetPassConfig, and add the annotation pass in YourDerivedTargetPassConfig::addPreEmitPass. This will add your annotation pass very late, just before the final code is emitted. If you're using the X86 target, then the class and the function is already there:

lib/Target/X86/X86TargetMachine.cpp:

bool X86PassConfig::addPreEmitPass() {
   bool ShouldPrint = false;
   if (getOptLevel() != CodeGenOpt::None && getX86Subtarget().hasSSE2()) {
     addPass(createExecutionDependencyFixPass(&X86::VR128RegClass));
     ShouldPrint = true;
   }

   if (getX86Subtarget().hasAVX() && UseVZeroUpper) {
     addPass(createX86IssueVZeroUpperPass());
     ShouldPrint = true;
   }

   return ShouldPrint;
}

-Krzysztof

Liu,

  This is likely a better solution for you - you do not want to mess with
the scheduler unless you really have to :wink:

Sergei

If you need your pass to run before register allocation, you can use function addPreRegAlloc in the same way. The only problem will be that there is another scheduling pass that runs after register allocation, but you should be able to disable it.

-Krzysztof

Krzysztof,

Thanks for your helpful answers.

Xu

Quoting Krzysztof Parzyszek <kparzysz@codeaurora.org>:

Hello everybody,

  I have a case of suspected indeterminism and I would like to verify that
it is not a known issue before I dig deep into it.
It seems to happen during PreVerifier pass ("Preliminary module
verification"). The little I understand/assume about it, a verifier pass is
not supposed to change the code (or is it?) but in debug stream I see the
following:

Common predecessor:

*** IR Dump After Loop-Closed SSA Form Pass ***
for.body.us68: ; preds =
%for.body.lr.ph.us81, %for.body.us68
  %arrayidx.us70.phi = phi i8* [ %buf.0.ph, %for.body.lr.ph.us81 ], [
%arrayidx.us70.inc, %for.body.us68 ]
  %add.ptr4.us72.phi = phi i8* [ %add.ptr4.us72.gep, %for.body.lr.ph.us81 ],
[ %add.ptr4.us72.inc, %for.body.us68 ]
  %i.043.us69 = phi i32 [ 0, %for.body.lr.ph.us81 ], [ %inc.us73,
%for.body.us68 ]
  ...

LV: Found a vectorizable loop (8) in core_state.i
LV: Adding RT check for range: %add.ptr4.us72.phi = phi i8* [
%add.ptr4.us72.gep, %for.body.lr.ph.us81 ], [ %add.ptr4.us72.inc,
%for.body.us68 ]
LV: Adding RT check for range: %arrayidx.us70.phi = phi i8* [ %buf.0.ph,
%for.body.lr.ph.us81 ], [ %arrayidx.us70.inc, %for.body.us68 ]

Then there are two possible outcomes triggered by a code change in
completely unrelated portion of the code and rebuild:

*** IR Dump After Preliminary module verification ***

First version:

for.body.us68: ; preds = %scalar.ph,
%for.body.us68
  %arrayidx.us70.phi = phi i8* [ %resume.val200, %scalar.ph ], [
%arrayidx.us70.inc, %for.body.us68 ]
  %add.ptr4.us72.phi = phi i8* [ %resume.val, %scalar.ph ], [
%add.ptr4.us72.inc, %for.body.us68 ]

Second version:

for.body.us68: ; preds = %scalar.ph,
%for.body.us68
  %arrayidx.us70.phi = phi i8* [ %resume.val, %scalar.ph ], [
%arrayidx.us70.inc, %for.body.us68 ]
  %add.ptr4.us72.phi = phi i8* [ %resume.val200, %scalar.ph ], [
%add.ptr4.us72.inc, %for.body.us68 ]

This difference snowballs there after causing different instruction order
and ultimately a different code.

If it rings the bell for anyone, or it is a known issue, please let me know.

Thanks.

Sergei

Nadav,

  As I peel this onion, it looks like you might know something about
InnerLoopVectorizer::addRuntimeCheck.
What does it do, and can it be causing the below described issue? Could
resuming somehow (indeterministically) switch the order of PHIs in the
original code?

Thanks a lot.

Sergei.

Hi Sergei,

"addRuntimeCheck" inserts code that checks that two or more arrays are disjoint. I looked at the code and it looks fine. We generate PHIs in the order that they appear in a vector. The values are inserted in 'canVectorizeMemory', which also looks fine. Please let me know if you think I missed something.

Thanks,
Nadav

Nadav,

  Thanks for the quick response. By now I am convinced that the given loop
ends up vectorized with enough difference to cause bad things later on, but
I have not found the exact cause yet. To continue with my work I'll have to
simply turn off vectorization for now, but I will come back and investigate.
Again, there is some indeterminism in order of PHIs processing somewhere.
I'll keep you posted.

Sergei

Is there a test case that you can share ?

Unfortunately no... I would have done that already :slight_smile:
Once I am back, I'll try to reproduce it with a reduced test case first and
will post it.
Thanks again.

Sergei