We are working on a project related to indirect call analysis and Control Flow Integrity. To match the target of an indirect call, we need to know the type information. After opaque pointer introduction (from LLVM 15.0.0), getElementType() and getPointerElementType() are removed. Now, my question is how can I get the type information of functions? And also how to get type information for load and store instructions?
I would be grateful for any suggestions and please let me know if further clarification is required.
I am adding an example,
Before opaque pointer, a load instruction looked like this,
load void ()*, void ()** %4, align 8
After opaque pointer, the load instruction looks like this,
load ptr, ptr %4, align 8
Is there any suitable way to get the type of “ptr” (here the type is void()*) after opaque pointer was introduced?
It depends on the instruction or value. In general, there’s a more specific type information if the instruction would need to care about the type of memory (a nonexhaustive list):
CallBase
: CB.getFunctionType()
LoadInst
: LI.getType()
StoreInst
: SI.getValueOperand()->getType()
(getLoadStoreType(I)
also works for both load and stores, but not other instructions)
Function
: F.getFunctionType()
GlobalValue
: GV.getValueType()
(applies to functions, global variables, global aliases, etc.)
GetElementPtrInst
: GEP.getSourceElementType()
(also getResultElementType
).
Umm, thanks. But I have tried those before and they don’t help after opaque pointer was introduced (Opaque Pointers — LLVM 17.0.0git documentation).
Now the LI->getType() would just return “ptr”. But I need the underlying type of the element (maybe void()* or other complex types).
Without any other information that the frontend as added, no. Pointee types don’t really mean anything.
This sort of analysis should be done in the frontend where you know the language sematics and exactly what you’re trying to check.
Perhaps you can explain more about what you’re trying to do?
Now the LI->getType() would just return “ptr”. But I need the underlying type of the element (maybe void()* or other complex types).
When LI->getType()
returns ptr
, it means that it’s loading a pointer from memory.
As far as LLVM IR semantics are concerned, there is no meaningful distinction between pointers within the same address space, and this was true even before the transition to opaque pointers (indeed, this is one of the main reasons behind the transition to opaque pointers–the distinction between different pointer element types is primarily noise that takes some effort to maintain). If you have analysis that absolutely depends on the distinction between, say, an int*
and a float*
, then it’s not really possible to do that at the LLVM IR level.
For some cases, it may be possible to scavenge the pointer element types by looking through the uses of the pointer to guess what it might be. The one I wrote for LLVM-to-SPIR-V conversion (which requires typed pointer elements, since it’s essentially based on an old version of LLVM IR) is here: SPIRV-LLVM-Translator/SPIRVTypeScavenger.cpp at main · KhronosGroup/SPIRV-LLVM-Translator · GitHub, although it should be noted that the results still rely heavily on the fact that it’s still semantically safe to treat a ptr
as i8*
, it just wants to remove several bitcast
operations that would be implied if you systematically did that.
Thanks for your response. Basically, I am interested in the analysis of indirect calls in C programs and CFI implementation. The Clang CFI is based on type matching of callee and callsites. Given an indirect call, I want to find the probable targets by matching the function type information with the callsite. And probably I will need to perform more analysis with the types.
For example, if we have a function pointer, void(*f_ptr)(int, int), I want to find targets with type void(*)(int, int). To find the probable targets of f_ptr I will match the type information of Load/Store instruction pointers with f_ptr type.
May be I’m missing something, but why do you query the type of a load instead of the type of the call?
Every indirect call is either a call or invoke instruction. There is no type information associated with the pointer, but there is type information associated with the instruction because it must express how to set up the call frame.
Note that this is not the right type information to use for CFI because it has a load of source language type information removed.
The front end knows both of the things that you need for type-based CFI:
- The type of any address-taken function, at the point where its address is taken and at the point where it is defined. You can emit both of these as metadata, or do the hashing of the types in the front end and just emit that hash as metadata.
- The type of the expected callee at the point where the call is emitted. You can similarly emit this as metadata if required.
Trying to extract this from IR will not work reliably before or after the opaque pointer transition.
Note that you will also hit problems in real-world C code. For example, both the Perl and Python interpreters rely (relied?) on the fact that functions that took up to four integer or pointer arguments had the same calling convention for variadic and non-variadic calls and so didn’t bother to do the cast. This is UB in C, but it didn’t break anywhere (I believe it breaks with Apple’s AArch64 calling convention, so it’s hopefully fixed now but less high-profile examples of the same idiom almost certainly exist).
For indirect calls in C, the address of the function is first loaded with a load instruction and then the call is made. I want to get the type of the function or the function pointer. So I mentioned the Load Instruction,
I got your point. So instead of trying to scavenge the type information from LLVM IR, I can hash the types and emit that as metadata. Later I can utilize the metadata for CFI usage. Right? Thanks for the clarification.