Finding all pointers to functions

I need to track down all pointers anywhere in a module that could be pointing to functions (because some of the optimizations I want to do, require either identifying every use of a function, or conservatively identifying when such cannot be done).

A starting point is to look at all the global variables:

for (auto &G : M.globals())
for (auto &V : G.operands())
if (auto F = dyn_cast(V))

Of course, instructions can also refer to functions, both as direct calls and otherwise:

for (auto &F : M) {
for (auto &I : inst_range(F)) {
for (auto &V : I.operands())
if (auto F = dyn_cast(V))

But there are other things as well, for example it seems there is something called a personality function that can be a pointer to another function, so need to add that

if (F.hasPersonalityFn())

It seems there are other things called prefix data and prologue data, which are pointers to constants, which could contain pointers to functions, so would need to include those as well.

Am I correct in thinking that prefix data and prologue data will not be included in the global variables list, so do need special handling?

Could they be recursive? That is, could those constants contain pointers to other constants… which end up containing pointers to functions… such that none of the intermediate constant objects are in the global variable list?

Is there anything else I’m missing?

Oh, I just came across Function::hasAddressTaken. Maybe I can just use that instead?

You could conservatively assume that any function that has its address taken has a pointer to it that escapes into memory or external code. To make things a little more accurate, you could scan the uses of any function for which hasAddressTaken() returns true and see if any of its uses escapes its function or escapes into memory or external code. I believe hasAddressTaken() returns true if the function is subjected to a cast instruction, and functions are often casted if they are used in a call that uses a different signature than the function’s declared signature. To get anything more accurate, you’ll need to use alias analysis or points-to analysis. DSA tracks function pointers in the heap and can tell you whether the function is called from external code. However, DSA’s accuracy currently suffers if it is run after LLVM’s optimizations, and the code needs some serious TLC. Regards, John Criswell

You could conservatively assume that any function that has its address
taken has a pointer to it that escapes into memory or external code.

Right, that's what I'm doing to start with.

To make things a little more accurate, you could scan the uses of any
function for which hasAddressTaken() returns true and see if any of its
uses escapes its function or escapes into memory or external code. I
believe hasAddressTaken() returns true if the function is subjected to a
cast instruction, and functions are often casted if they are used in a call
that uses a different signature than the function's declared signature.

I'll look into that. It seems reasonable to guess that the major
confounding factor in many C++ programs will be references from virtual
function tables; there should be some way to optimize those specifically.

To get anything more accurate, you'll need to use alias analysis or
points-to analysis. DSA tracks function pointers in the heap and can tell
you whether the function is called from external code. However, DSA's
accuracy currently suffers if it is run after LLVM's optimizations, and the
code needs some serious TLC.

DSA presumably stands for data structure analysis. TLC = tender loving
care? Why does DSA become less accurate if run after optimization?

Hi Russell,

I need to track down all pointers anywhere in a module that could be pointing to functions (because some of the optimizations I want to do, require either identifying every use of a function, or conservatively identifying when such cannot be done).

A starting point is to look at all the global variables:

  for (auto &G : M.globals())
    for (auto &V : G.operands())
      if (auto F = dyn_cast<Function>(V))

Of course, instructions can also refer to functions, both as direct calls and otherwise:

  for (auto &F : M) {
    for (auto &I : inst_range(F)) {
      for (auto &V : I.operands())
        if (auto F = dyn_cast<Function>(V))

Your code is reminiscent of the traversal logic in tools/verify-uselistorder.

There's a chance you might be able to reuse the ValueMapping class from that project.

But there are other things as well, for example it seems there is something called a personality function that can be a pointer to another function, so need to add that

    if (F.hasPersonalityFn())

It seems there are other things called prefix data and prologue data, which are pointers to constants, which could contain pointers to functions, so would need to include those as well.

Am I correct in thinking that prefix data and prologue data will not be included in the global variables list, so do need special handling?

Prefix data, prologue data, and function personalities do not need to be globals.

This will require special handling. The ValueMapping class has good skeleton code for this.

Could they be recursive? That is, could those constants contain pointers to other constants... which end up containing pointers to functions... such that none of the intermediate constant objects are in the global variable list?

I don't see how this is possible. If it is, we'd need to file a PR against verify-uselistorder.

best
vedant

DSA was built when LLVM’s optimizations maintained the type information on GEP and other instructions (DSA existed before LLVM was open-source). As such, it uses LLVM’s type information to aid in its type-inference which, in turn, gives it field sensitivity which, in turn, improves its accuracy. Over time, LLVM optimizations have come to modify the type information so that it is just simple byte-level indexing (as opposed to array-of-structure indexing). DSA hasn’t been updated to handle that well. That is why its precision is better pre-optimization than post-optimization. Just out of curiosity, what are you trying to do? I need call graph analysis for C/C++ code with function pointers, and so I’m writing an NSF proposal to seek funding to do that (among other enhancements to my SVA infrastructure). If it’s something that would be useful to you (or other LLVM community members), it would be useful for me to know that. Regards, John Criswell

DSA was built when LLVM's optimizations maintained the type information on
GEP and other instructions (DSA existed before LLVM was open-source). As
such, it uses LLVM's type information to aid in its type-inference which,
in turn, gives it field sensitivity which, in turn, improves its accuracy.
Over time, LLVM optimizations have come to modify the type information so
that it is just simple byte-level indexing (as opposed to
array-of-structure indexing). DSA hasn't been updated to handle that
well. That is why its precision is better pre-optimization than
post-optimization.

Ah! I don't suppose you could point to some examples of this? E.g. a simple
test program such that one could eyeball the intermediate code before and
after optimization?

Just out of curiosity, what are you trying to do? I need call graph
analysis for C/C++ code with function pointers, and so I'm writing an NSF
proposal to seek funding to do that (among other enhancements to my SVA
infrastructure). If it's something that would be useful to you (or other
LLVM community members), it would be useful for me to know that.

SVA?

I'm trying to write a superoptimizer that can optimize code based on a
high-level understanding of what it's actually doing, so yes, call graph
analysis that can deal with function pointers does seem likely to be one of
the things that will be needed.

Off the top of my head, no, I don’t have an example, but I suspect any program with an array indexing operation with a for loop will do. Sorry. SVA is Secure Virtual Architecture. It’s my LLVM-based infrastructure for controlling operating system kernel behavior via compiler instrumentation and hardware configuration. I’ve used it to build a system that protects applications from a compromised operating system kernel as well as to enforce memory safety and control-flow integrity on operating system kernel code. I need DSA for doing things like: 1) Creating an accurate call graph for kernel code to enforce better control-flow integrity and to test our future infrastructure for measuring the efficacy of defenses against code reuse attacks. 2) Analyzing the memory accesses of kernel modules to see if they modify kernel data structures that they should not modify (e.g., to find rootkits that modify the process list). 3) For optimizing run-time checks that protect kernel data structure, at run-time, from other kernel components (useful for a number of things). In short, strong points-to and call graph analysis enable some interesting research projects. Nice. One thing you might want to investigate is whether building a call graph analysis off of the TBAA metadata would work. If TBAA works for lots of programs (I hear some non-conformant programs cause it problems), then using it as a springboard for analysis may be effective (as TBAA is already well maintained in the LLVM source tree). Regards, John Criswell

I wonder if a conservative estimate of pointed-to functions would also need
to include all functions with externally visible linkage?

I could imagine a few routes by which that allows such a function to have
its address show up in the data being handled by the Module's own code.

This is correct. Any function that is externally visible can be called be external code that is outside the analysis’s scope; you must therefore treat it is a function that can be called by external code (in LLVM CallGraph parlance, it can be called by the external calling node). If you’re going to do this analysis, you want to do it in libLTO after the Internalize pass has been run. The Internalize pass finds functions which are externally visible (because C linkage needs them to be in order for linking to work) and makes them internal since we know that we will not link with any more LLVM bitcode files. You should get much better results that way. As an FYI, DSA tracks this sort of information using the External (E) flag on the DSNode. It can determine (provided it has sufficient precision) which heap objects are accessible to external code and which are not. Regards, John Criswell

Everyone else has said good/important things here, but for the original problem I’m not quite sure why you were trying to search through to find the uses, rather than walking the Function’s use list?

  • David

Because that was before I realized that functions keep track of a use list.