Computing unique ID of IR instructions that can be mapped back

I am writing an analysis pass on LLVM which requires to:

[1] generate unique, positive ID corresponding to each instruction
[2] the ID must survive across runs
[3] given the ID, corresponding instruction has to be mapped back

For [1], the general suggestion is to use the Value* instr_ptr associated to each instruction. The instr_ptr points to specific instruction in memory, hence unique. However, it does change at every run, thus violating [2]. I have a hacking workaround of getting the first instruction of ‘main’ method and computing the offset. The problem is, this gives me negative offset for instructions lying in lower memory region than ‘main’. Even if I circumvent the problem by adding a positive bias high enough, I have no clue how can I map the instr_ptr back to corresponding IR instruction. Can anyone suggest any elegant workaround?

When you say the ID must survive across runs, you mean that instructions that look the same must have the same ID?
Or that the instructions from the first run must have the same ID?

Because doing a unique positive ID is easy.
Things are always constructed in the order they appear in the file.
So using a single counter, walking every function in the module, and then every instruction in the function, and giving each an ID, should be unique, positive, and consistent across runs for the same file.

When you say the ID must survive across runs, you mean that instructions
that look the same must have the same ID?
Or that the instructions from the first run must have the same ID?

If I don't change the bitcode, each instruction should receive the same ID
at every run.

Because doing a unique positive ID is easy.
Things are always constructed in the order they appear in the file.
So using a single counter, walking every function in the module, and then
every instruction in the function, and giving each an ID, should be unique,
positive, and consistent across runs for the same file.

Let's say, I walk the file, increment the counter but how'll I assign an
ID to an instruction. Because, I need these IDs during an instrumentation
phase. In other words, how will the instrumentation phase know what ID has
been assigned to a particular instruction? Because, the instrumentation
knows about the instr_ptr (memory address of an instruction) which changes
at every run.

(adding back llvm-dev)

Is there any standard means to add an extra field to LLVM IR instructions?

Your description implies that you are not intending to change the on-disk format, so it’s simple:

It is a class. Change the source to add a field to it. Use it as you wish.

–paulr

(adding back llvm-dev)

Is there any standard means to add an extra field to LLVM IR instructions?

Your description implies that you are not intending to change the
on-disk format, so it's simple:

It is a class. Change the source to add a field to it. Use it as you wish.

An alternative way to do it, which would be less invasive and more upstreaming friendly, would be to implement this as an analysis pass.

In the pass, calculate a map<Instruction*,ID>, and use that as the analysis result. Then query that result for the ID where you need it. If you've written the analysis correctly, then you'll always get the same IDs for the same instructions given the same input IR.

Jon

Isn't Instruction* a pointer to the instruction in memory? If so, it'll
change across runs.

Isn’t Instruction* a pointer to the instruction in memory? If so, it’ll change across runs

Correct, but none of this is meant to be stored across runs, so the actual address doesn’t matter. You recompute the map every time your IR is loaded.

Right.
As folks have told you, the following loop will always give the same instructions in the same order given the same input IR:

for (auto &BB : F)
for (auto &I : BB)

Thank you very much to you all. I got the idea. Only problem is now to relate the ID back to a specific IR instruction in case I want to trace back which ID an instruction corresponds to. I can either write the mapping on disk or augment the bitcode format to accommodate an ID field (may be).

Why not have your analysis pass build its own map.
The map would map between the Instruction and an integer index or offset.
I.e.
int index, Value *, Node *, Mod *
Where "Value *" is the reference to the instruction itself, "Node *"
is the BB that the instruction is contained in. "Mod *" is the module
containing all the Nodes.
The index will be the same every time you build the table if the IR
has not changed.
The analysis code then uses the index to refer back to the
instruction's context.

Based on earlier replies on this thread, I plan to proceed in the exact
same way. However, it'd be more elegant (implementation-wise) to be able to
hook into the IR generation process and make the instructions carry the ID
as a metadata in the IR itself. If I implement the way you suggested (and I
am planning, too), I need to write the mapping on disk in a separate file
to be able to refer back to it later on.