Dynamic Profiling - Instrumentation basic query

Hi,

I am new to LLVM, and would like to write a dynamic profiler, say which prints out the load address of all the load instructions encountered in a program.

From what I could pick up, edge-profiler.cpp increments a counter dynamically which is somehow dumped onto llvmprof.out by profile.pl

Could anyone explain me how this works? Can I instrument the code to dump out the load addresses or other such information to a file?

Thanks!

Hi Silky,

Firstly: Do you really need to do this? You can't just get a memory access trace from a simulator? Or dcache access/miss rates from hardware counters?

Assuming this really is required:

I am new to LLVM, and would like to write a dynamic profiler, say which prints out the load address of all the load instructions encountered in a program.

From what I could pick up, edge-profiler.cpp increments a counter dynamically which is somehow dumped onto llvmprof.out by profile.pl

profile.pl is just a wrapper. 'EdgeProfiling.cpp' inserts calls to instrumentation functions on control flow edges. libprofile_rt.so provides these functions (code is in runtime/libprofile). It is the functions in libprofile_rt.so that dump the counters to llvmprof.out.

Note: libprofile_rt.so be named differently on some platforms.

Could anyone explain me how this works? Can I instrument the code to dump out the load addresses or other such information to a file?

Yes, you can do this, though the current profiling code does not profile load addresses at all. The quickest way to get this working is probably to:

A) add a new pass to insert the load profiling instrumentation, EdgeProfiling.cpp should provide a good start point to copy from. You just need to modify the IR to insert a call to some function, say 'llvm_profile_load_address', and in the IR pass this function the load address as an argument.

B) Add an implementation of 'llvm_profile_load_address' to runtime/libprofile (don't forget to add the symbol to libprofile.exports). Perhaps it just does 'fprintf(file, "%p\n", addr);'.

C) If your implementation of 'llvm_profile_load_address' requires initialisation (such as an fopen) add a call to an 'llvm_profile_load_start' in the IR for 'main' and 'llvm_profile_load_end' can be provided to atexit(). EdgeProfiling.cpp does this, so you can look at that code to see how it works.

This will slow down your code a lot. Maybe even by 100x. Faster implementations are of course possible, but loads are very common, so any extra work will slow things down a lot.

Also, once all these calls to external functions have been added the optimiser will likely be severely hindered. So to get realistic results the load profiling instrumentation pass should probably happen as late as possible.

Regards,
Alastair.

There is code that does this for older versions of LLVM. I believe it is in the giri project in the LLVM SVN repository. I can look into more details when I get back from vacation. Swarup may also be able to provide information on the giri code.

-- John T.

Hi,

@Alastair: Thanks a bunch for explaining this so well. I was able to write a simple profiler, and run it.

I need to profile the code for branches (branch mis predicts simulation), load/store instructions (for cache hits/miss rate), and a couple of other things and therefore, would need to instrument the code.
However, I would like to know if writing the output to a file would increase the execution time, or is it the profiling itself? I can probably use a data structure to store the output instead.

Also, I have heard of Intel’s Pin tool which can provide memory trace information. Could you please explain to me what you meant by hardware counters for dcache miss/hit rates.

@Criswell: Thank you so much for helping me with this.
I am starting to write my own code, but having a look at the existing code would definitely help me.

Thanks and Regards,
Silky

Hi Silky,

I need to profile the code for branches (branch mis predicts
simulation), load/store instructions (for cache hits/miss rate), and a
couple of other things and therefore, would need to instrument the code.
However, I would like to know if writing the output to a file would
increase the execution time, or is it the profiling itself? I can
probably use a data structure to store the output instead.

Also, I have heard of Intel's Pin tool which can provide memory trace
information. Could you please explain to me what you meant by hardware
counters for dcache miss/hit rates.

I've also heard of Pin, but never actually used it.

Regarding the hardware counters: x86 processors count various hardware events via internal counters. I think both Intel and AMD processors can do this, but I've only tried out Intel. The easiest way to access these on Linux is probably via the 'perf' tool [1]. (There are other options on other platforms. I think 'Intel VTune' can use these counters as well.)

[1] https://perf.wiki.kernel.org/

The result of running 'perf' on a random command (xz -9e dictionary) is in the attached file (because my mail client was destroying the formatting). I just chose some counters which seemed to match what you mention, there were many more though. 'perf list' will show them. The only issue I can think of is that the hardware counters aren't available inside (most?) virtual machines.

If you need to individually determine the hit/miss-rate, mispredict ratios etc per load/store/branch then I'm not sure if these counters are very useful.

Regards,
Alastair.

perf.txt (1.39 KB)

Hi Alastair,
Thank you so much for the information on the tools. Actually, I need to
analyze which sections of code are prone to misses and mis predicts, and
would have to eventually instrument the code.

I was able to instrument and call an external function, but faced an issue
while passing an argument to the function. I am following EdgeProfiling.cpp
but couldn't figure out the problem. Could you please see where I am going
wrong here -

virtual bool runOnModule(Module &M)
        {
            Constant *hookFunc;
            LLVMContext& context = M.getContext();
            hookFunc =
M.getOrInsertFunction("cacheCounter",Type::getVoidTy(M.getContext()),
                                                           
llvm::Type::getInt32Ty(M.getContext()),
                                                            (Type*)0);
            cacheCounter= cast<Function>(hookFunc);

            for(Module::iterator F = M.begin(), E = M.end(); F!= E; ++F)
            {

                for(Function::iterator BB = F->begin(), E = F->end(); BB !=
E; ++BB)
                {
                    cacheProf::runOnBasicBlock(BB, hookFunc, context);
                }
            }

            return false;
        }
        virtual bool runOnBasicBlock(Function::iterator &BB, Constant*
hookFunc, LLVMContext& context)
        {

            for(BasicBlock::iterator BI = BB->begin(), BE = BB->end(); BI !=
BE; ++BI)
            {
                std::vector<Value*> Args(1);
                unsigned a =100;
                Args[0] = ConstantInt::get(Type::getInt32Ty(context), a);
                if(isa<LoadInst>(&(*BI)) )
                    {
                        CallInst *newInst = CallInst::Create(hookFunc, Args,
"",BI);
                    }
                    
            }
            return true;
        }

The C code is as follows -

extern "C" void cacheCounter(unsigned a){
    std::cout<<a<<" Load instruction\n";
}

Error:
line 8: 18499 Segmentation fault (core dumped) lli out.bc

Also, the code works fine when I don't try to print out 'a'.

Thanks for your help.

Regards,
Silky

Hi Silky,

Sorry for the slow reply. You probably already fixed this, but just in case I'll reply anyway.

Comments inline below

Hi Alastair,
Thank you so much for the information on the tools. Actually, I need to
analyze which sections of code are prone to misses and mis predicts, and
would have to eventually instrument the code.

I was able to instrument and call an external function, but faced an issue
while passing an argument to the function. I am following EdgeProfiling.cpp
but couldn't figure out the problem. Could you please see where I am going
wrong here -

  virtual bool runOnModule(Module &M)
         {
             Constant *hookFunc;
             LLVMContext& context = M.getContext();
             hookFunc =
M.getOrInsertFunction("cacheCounter",Type::getVoidTy(M.getContext()),

llvm::Type::getInt32Ty(M.getContext()),
                                                             (Type*)0);
             cacheCounter= cast<Function>(hookFunc);

             for(Module::iterator F = M.begin(), E = M.end(); F!= E; ++F)
             {

                 for(Function::iterator BB = F->begin(), E = F->end(); BB !=
E; ++BB)
                 {
                     cacheProf::runOnBasicBlock(BB, hookFunc, context);
                 }
             }

             return false;
         }
         virtual bool runOnBasicBlock(Function::iterator &BB, Constant*
hookFunc, LLVMContext& context)
         {

             for(BasicBlock::iterator BI = BB->begin(), BE = BB->end(); BI !=
BE; ++BI)
             {
                 std::vector<Value*> Args(1);
                 unsigned a =100;
                 Args[0] = ConstantInt::get(Type::getInt32Ty(context), a);
                 if(isa<LoadInst>(&(*BI)) )
                     {
                         CallInst *newInst = CallInst::Create(hookFunc, Args,
"",BI);

Did this line compile without warning? Shouldn't 'BI' be '&(*BI)' or something?

                     }

             }
             return true;
         }

The C code is as follows -

extern "C" void cacheCounter(unsigned a){
     std::cout<<a<<" Load instruction\n";
}

I was kind of surprised that this worked, so I tried it out and it does! In general libprofile_rt.so is written in C though, if you write it in C++ it will require a C++ library at run-time.

Error:
line 8: 18499 Segmentation fault (core dumped) lli out.bc

Also, the code works fine when I don't try to print out 'a'.

Thanks for your help.

Regards,
Silky

You could look at the IR for out.bc before running it, if there is something wrong it will probably be pretty obvious there.

Regards,
Alastair.

Hi Alastair,

You're right. I figured out that I wasn't linking the files properly.
Instead of using llvm-link command I used the following commands, and it
worked.

opt -load /x/ext_students/silkyar/llvm/Debug+Asserts/lib/cacheProf.so
-cacheProf a.bc>out.bc
llc out.bc -o out.s
g++ -o cache.exe out.s cacheSim.o
./cache.exe

Regards,
Silky

There is code that does this for older versions of LLVM. I believe it is in the giri project in the LLVM SVN repository. I can look into more details when I get back from vacation. Swarup may also be able to provide information on the giri code.

I took a quick look, and the dynamic slicing code doesn't appear to be checked into the giri project yet like I had originally thought.

We can, however, give you a copy of the code if you would like. However, having looked at other emails in the thread, I'm not sure if it's what you want. Our dynamic slicing code only instruments LLVM IR loads and stores; it does not instrument memory accesses caused by stack spill slots, function argument setup, etc. (these are only visible at the code generation IR level).

If instrumenting LLVM IR loads and stores suffices, and if you'd like a copy of our code, please let me know.

-- John T.

Hi John,

Thanks for getting back to me. I was able to instrument the memory
instructions.
I'll reach out to you if I've issues. I really appreciate the help.

Sincerely,
Silky

Hi John and Silky,

    I can see a copy of 'giri' slicing project branch here http://llvm.org/viewvc/llvm-project/giri/. Though it may be little older, it will work I think. You can look at the code to see how we do the instrumentation.

Thanks,
Swarup.

Hi John and Silky,

     I can see a copy of 'giri' slicing project branch here http://llvm.org/viewvc/llvm-project/giri/. Though it may be little older, it will work I think. You can look at the code to see how we do the instrumentation.

The giri project is supposed to contain both the static slicing code and the dynamic slicing code. It looks like all it contains at present is the static slicing code. At some point, the dynamic slicing code needs to be integrated into it.

-- John T.

Oh, OK. I didn't check the code. I think it only contains your flow tracking analysis code, isn't it.

Our 'Giri' project was completely separate from it. Should we merge it with this or keep it as a separate project?

-Swarup.

Oh, OK. I didn't check the code. I think it only contains your flow tracking analysis code, isn't it.

Our 'Giri' project was completely separate from it. Should we merge it with this or keep it as a separate project?

My intention was to have a project called "giri" that contained both the static slicing code that I wrote plus the dynamic slicing code (currently named "giri") that we wrote. I put the static slicing code into the public project, but the dynamic slicing code is still on the TODO list.

-- John T.