Source-level dataflow analysis with LLVM IR

Hello everyone!

Greetings! I am Saheel, a PhD student in UC Davis. I have been using LLVM and Clang for a month now, aiming to do some program analysis. I am sort-of stuck and am hoping to find some help from the experienced folks here.

So, I am trying to do an intra-procedural dataflow analysis on the lines of “given a variable declaration return its various definitions, and for each definition, return the different uses”.

After some reading, I thought (correctly?) that if I am able to answer the question “is a given variable definition live at a given location in program” (Liveness analysis) my task is pretty much done.

For this, I started out with writing an LLVM FunctionPass (based on def-use section of the manual and some old LiveValues code I found online). But soon I figured out that LLVM IR generates new variables (%5, %6, etc.) whenever a program variable is used (i.e. on each variable load) and thus doing an analysis on the IR would not really answer my question. Is this correct (which would mean I should work with Clang)? Or is there a way to do work with LLVM IR and still answer source-level questions?

I do know that we can track the IR back to source but I am not sure if the LLVM-generated IR-level variables can still allow for an analysis on source-level variables.

Any pointers to existing (LLVM or Clang) analyses or help on how to do one myself would be greatly appreciated! :slight_smile:

Saheel Godhane.