Marking source locations without interfering with optimization?

I've been thinking of adding an instruction, and I'm following the
advice in the docs to consult the list before doing something rash.

What I want to do is provide a way to identify variable names and
source locations that doesn't affect the effectiveness of
optimizations. This is not the same problem as supporting debug info,
because I don't care about being able to look up unique names for
memory locations or evaluating expressions, etc... I just want to be
able to say during an optimization pass what the best guess for the
source location and variable names are for a value or instruction that
the pass is doing something interesting to.

Because I don't need to support the functionality of a debugger with
this, it is OK if that best guess contains more than one possibility,
as long as it isn't a huge number of possibilities. The idea is that
I'm producing information for a programmer who needs to know what is
going on during optimization, so I want to give them as much detail as
possible, it's OK if it isn't exact, but it is not OK if it interferes
with the optimization, because that's the whole point.

So, given those goals, it seems that just using the traditional debug
info as it is designed is not a good idea, since I want more and
fuzzier answers.

Also, unless I'm missing something, the debug info uses intrinsic
function calls, which are treated as un-analyzable, and if I tried
supplying those with actual values to link the values to the source,
then some important analyses will fail. Is that right or am I
misunderstanding the docs on intrinsics?

So, I thought one way to go would be to introduce an instruction meant
just for marking the source location of a value - it'd consume a value
and some constants marking the location - then the front end could
generate it (not by default!) where necessary to make sure a value
could be traced back to its source location. It'd either be lowered
away or it'd have to be ignored during codegen since we might still
want to know that info then, for instance, to track register spills
back to which variable spilled.

What problems can you think of with that approach? Am I asking for
trouble with passes, or would a semantically meaningless 'marker'
instruction be OK?

If you have suggestions for a better way to do this, that'd be great.
There isn't a lot of prior work I found on this, most of what I saw
was about debug info, which as I stated, is not quite what I need.

Thanks!

I've been thinking of adding an instruction, and I'm following the
advice in the docs to consult the list before doing something rash.

Always a good idea! :slight_smile: Instead of adding an instruction, I'd suggest adding an intrinsic. You can mark intrinsics as not reading/writing to memory (see lib/Analysis/BasicAliasAnalysis.cpp for example, look for llvm.isunordered to see how it is handled).

What I want to do is provide a way to identify variable names and
source locations that doesn't affect the effectiveness of
optimizations. This is not the same problem as supporting debug info,
because I don't care about being able to look up unique names for
memory locations or evaluating expressions, etc... I just want to be
able to say during an optimization pass what the best guess for the
source location and variable names are for a value or instruction that
the pass is doing something interesting to.

Okay... this is tricky. Anything that will bind to variables will prevent modification to the variable. I would suggest something like this (C syntax for the llvm code):

int foo() {
   %A = alloca int
   llvm.myintrinsic("A", whatever data you want")
}

Because I don't need to support the functionality of a debugger with
this, it is OK if that best guess contains more than one possibility,
as long as it isn't a huge number of possibilities. The idea is that
I'm producing information for a programmer who needs to know what is
going on during optimization, so I want to give them as much detail as
possible, it's OK if it isn't exact, but it is not OK if it interferes
with the optimization, because that's the whole point.

Given the above, you can use the constant string "A", to look up things in the symbol table of the function. You will probably want to accept "A" and anything that starts with "A.".

So, given those goals, it seems that just using the traditional debug
info as it is designed is not a good idea, since I want more and
fuzzier answers.

Makes sense.

Also, unless I'm missing something, the debug info uses intrinsic
function calls, which are treated as un-analyzable, and if I tried
supplying those with actual values to link the values to the source,
then some important analyses will fail. Is that right or am I
misunderstanding the docs on intrinsics?

Correct.

So, I thought one way to go would be to introduce an instruction meant
just for marking the source location of a value - it'd consume a value
and some constants marking the location - then the front end could
generate it (not by default!) where necessary to make sure a value
could be traced back to its source location. It'd either be lowered
away or it'd have to be ignored during codegen since we might still
want to know that info then, for instance, to track register spills
back to which variable spilled.

I think the above will work for you, you can make it ignored or deal with it however you want using the intrinsic lowering code. Check out how other intrinsics are handled (e.g. llvm.isunordered, which is handled by the code generators and llvm.dbg.* which are not) for ideas.

What problems can you think of with that approach? Am I asking for
trouble with passes, or would a semantically meaningless 'marker'
instruction be OK?

I'd seriously suggest using an intrinsic instead of an instruction: they are far far easier to add. Aside from that, using the symbol table is really the only thing that will work, and is prone to obvious problems, but should work pretty well in practice.

If you have suggestions for a better way to do this, that'd be great.
There isn't a lot of prior work I found on this, most of what I saw
was about debug info, which as I stated, is not quite what I need.

Hope this helps!

-Chris

Chris, Thanks for the suggestions.

> I've been thinking of adding an instruction, and I'm following the
> advice in the docs to consult the list before doing something rash.

Always a good idea! :slight_smile: Instead of adding an instruction, I'd suggest
adding an intrinsic. You can mark intrinsics as not reading/writing to
memory (see lib/Analysis/BasicAliasAnalysis.cpp for example, look for
llvm.isunordered to see how it is handled).

OK, I didn't know about that - thanks.

> What I want to do is provide a way to identify variable names and
> source locations that doesn't affect the effectiveness of
> optimizations. This is not the same problem as supporting debug info,
> because I don't care about being able to look up unique names for
> memory locations or evaluating expressions, etc... I just want to be
> able to say during an optimization pass what the best guess for the
> source location and variable names are for a value or instruction that
> the pass is doing something interesting to.

Okay... this is tricky. Anything that will bind to variables will
prevent modification to the variable.

I see - so if I wanted to use my earlier approach, I'd need to change every
optimization and analysis to treat the 'marker' instructions specially as
instructions that don't modify their argument, a big mess...

So it sounds like the only way to really not interfere with
optimizations is to avoid
binding to the variables, which means that if instructions are moved
or copied, the markers I add won't be moved or copied along with the
instruction. I was hoping to find a scheme that'd stay (mostly)
up-to-date through modifications with minimal extra changes.

I would suggest something like
this (C syntax for the llvm code):

int foo() {
   %A = alloca int
   llvm.myintrinsic("A", whatever data you want")
}

Just to clarify, you're suggesting that I use the LLVM value's name to
link up with the source info instead of actually binding to it - so in
a slightly more complicated example I might do this:

C code:

1: a = foo();
2: b = bar();
3: a = a + b;

llvm code:

%a = call foo()
llvm.myintrinsic("%a", "a", 1)
%b = call bar()
llvm.myintrinsic("%b", "b", 2)
%tmp.1 = add %a, %b
llvm.myintrinsic("%tmp.1", "a", 3)

Given the above, you can use the constant string "A", to look up things in
the symbol table of the function. You will probably want to accept "A"
and anything that starts with "A.".

> So, I thought one way to go would be to introduce an instruction meant
> just for marking the source location of a value - it'd consume a value
> and some constants marking the location - then the front end could
> generate it (not by default!) where necessary to make sure a value
> could be traced back to its source location. It'd either be lowered
> away or it'd have to be ignored during codegen since we might still
> want to know that info then, for instance, to track register spills
> back to which variable spilled.

I think the above will work for you, you can make it ignored or deal with
it however you want using the intrinsic lowering code. Check out how
other intrinsics are handled (e.g. llvm.isunordered, which is handled by
the code generators and llvm.dbg.* which are not) for ideas.

> What problems can you think of with that approach? Am I asking for
> trouble with passes, or would a semantically meaningless 'marker'
> instruction be OK?

I'd seriously suggest using an intrinsic instead of an instruction: they
are far far easier to add. Aside from that, using the symbol table is
really the only thing that will work, and is prone to obvious problems,
but should work pretty well in practice.

> If you have suggestions for a better way to do this, that'd be great.
> There isn't a lot of prior work I found on this, most of what I saw
> was about debug info, which as I stated, is not quite what I need.

Hope this helps!

It's certainly given me lots to think about.

Thanks,
-mike

Okay... this is tricky. Anything that will bind to variables will
prevent modification to the variable.

I see - so if I wanted to use my earlier approach, I'd need to change every
optimization and analysis to treat the 'marker' instructions specially as
instructions that don't modify their argument, a big mess...

exactly.

So it sounds like the only way to really not interfere with
optimizations is to avoid
binding to the variables, which means that if instructions are moved
or copied, the markers I add won't be moved or copied along with the
instruction. I was hoping to find a scheme that'd stay (mostly)
up-to-date through modifications with minimal extra changes.

I don't really think there is a good way to do that.

I would suggest something like
this (C syntax for the llvm code):

int foo() {
   %A = alloca int
   llvm.myintrinsic("A", whatever data you want")
}

Just to clarify, you're suggesting that I use the LLVM value's name to
link up with the source info instead of actually binding to it - so in
a slightly more complicated example I might do this:

C code:

1: a = foo();
2: b = bar();
3: a = a + b;

llvm code:

%a = call foo()
llvm.myintrinsic("%a", "a", 1)
%b = call bar()
llvm.myintrinsic("%b", "b", 2)
%tmp.1 = add %a, %b
llvm.myintrinsic("%tmp.1", "a", 3)

Exactly. The %'s are a figment of the asmprinter's imagination, so you wouldn't need to include them, but this is basically what I was getting at.

-Chris

Given the above, you can use the constant string "A", to look up things in
the symbol table of the function. You will probably want to accept "A"
and anything that starts with "A.".

So, I thought one way to go would be to introduce an instruction meant
just for marking the source location of a value - it'd consume a value
and some constants marking the location - then the front end could
generate it (not by default!) where necessary to make sure a value
could be traced back to its source location. It'd either be lowered
away or it'd have to be ignored during codegen since we might still
want to know that info then, for instance, to track register spills
back to which variable spilled.

I think the above will work for you, you can make it ignored or deal with
it however you want using the intrinsic lowering code. Check out how
other intrinsics are handled (e.g. llvm.isunordered, which is handled by
the code generators and llvm.dbg.* which are not) for ideas.

What problems can you think of with that approach? Am I asking for
trouble with passes, or would a semantically meaningless 'marker'
instruction be OK?

I'd seriously suggest using an intrinsic instead of an instruction: they
are far far easier to add. Aside from that, using the symbol table is
really the only thing that will work, and is prone to obvious problems,
but should work pretty well in practice.

If you have suggestions for a better way to do this, that'd be great.
There isn't a lot of prior work I found on this, most of what I saw
was about debug info, which as I stated, is not quite what I need.

Hope this helps!

It's certainly given me lots to think about.

Thanks,
-mike

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

>> Okay... this is tricky. Anything that will bind to variables will
>> prevent modification to the variable.
>
> I see - so if I wanted to use my earlier approach, I'd need to change every
> optimization and analysis to treat the 'marker' instructions specially as
> instructions that don't modify their argument, a big mess...

exactly.

> So it sounds like the only way to really not interfere with
> optimizations is to avoid
> binding to the variables, which means that if instructions are moved
> or copied, the markers I add won't be moved or copied along with the
> instruction. I was hoping to find a scheme that'd stay (mostly)
> up-to-date through modifications with minimal extra changes.

I don't really think there is a good way to do that.

That's pretty much what I was afraid of.
However, it seems like making the modifications to try to keep these
intrinsic calls near the values they refer to, and to duplicate them
intelligently when copying blocks or instructions, would be easier
than what's involved with a new instruction.

Any comments on using the annotation classes for this? It only seems
to be used by codegen, but it might be appropriate for this kind of
'loose' source info.

I'll dig around some more in the code. Thanks for the help.

>> I would suggest something like
>> this (C syntax for the llvm code):
>>
>> int foo() {
>> %A = alloca int
>> llvm.myintrinsic("A", whatever data you want")
>> }
>
> Just to clarify, you're suggesting that I use the LLVM value's name to
> link up with the source info instead of actually binding to it - so in
> a slightly more complicated example I might do this:
>
> C code:
>
> 1: a = foo();
> 2: b = bar();
> 3: a = a + b;
>
> llvm code:
>
> %a = call foo()
> llvm.myintrinsic("%a", "a", 1)
> %b = call bar()
> llvm.myintrinsic("%b", "b", 2)
> %tmp.1 = add %a, %b
> llvm.myintrinsic("%tmp.1", "a", 3)

Exactly. The %'s are a figment of the asmprinter's imagination, so you
wouldn't need to include them, but this is basically what I was getting
at.

OK.

So it sounds like the only way to really not interfere with
optimizations is to avoid
binding to the variables, which means that if instructions are moved
or copied, the markers I add won't be moved or copied along with the
instruction. I was hoping to find a scheme that'd stay (mostly)
up-to-date through modifications with minimal extra changes.

I don't really think there is a good way to do that.

That's pretty much what I was afraid of.
However, it seems like making the modifications to try to keep these
intrinsic calls near the values they refer to, and to duplicate them
intelligently when copying blocks or instructions, would be easier
than what's involved with a new instruction.

As intrinisics, code copying changes should keep them up-to-date: if a block is duplicated, the intrinsics will too.

Any comments on using the annotation classes for this? It only seems
to be used by codegen, but it might be appropriate for this kind of
'loose' source info.

Can't really be used. :frowning:

-Chris

I'll dig around some more in the code. Thanks for the help.

I would suggest something like
this (C syntax for the llvm code):

int foo() {
   %A = alloca int
   llvm.myintrinsic("A", whatever data you want")
}

Just to clarify, you're suggesting that I use the LLVM value's name to
link up with the source info instead of actually binding to it - so in
a slightly more complicated example I might do this:

C code:

1: a = foo();
2: b = bar();
3: a = a + b;

llvm code:

%a = call foo()
llvm.myintrinsic("%a", "a", 1)
%b = call bar()
llvm.myintrinsic("%b", "b", 2)
%tmp.1 = add %a, %b
llvm.myintrinsic("%tmp.1", "a", 3)

Exactly. The %'s are a figment of the asmprinter's imagination, so you
wouldn't need to include them, but this is basically what I was getting
at.

OK.

-Chris

Given the above, you can use the constant string "A", to look up things in
the symbol table of the function. You will probably want to accept "A"
and anything that starts with "A.".

So, I thought one way to go would be to introduce an instruction meant
just for marking the source location of a value - it'd consume a value
and some constants marking the location - then the front end could
generate it (not by default!) where necessary to make sure a value
could be traced back to its source location. It'd either be lowered
away or it'd have to be ignored during codegen since we might still
want to know that info then, for instance, to track register spills
back to which variable spilled.

I think the above will work for you, you can make it ignored or deal with
it however you want using the intrinsic lowering code. Check out how
other intrinsics are handled (e.g. llvm.isunordered, which is handled by
the code generators and llvm.dbg.* which are not) for ideas.

What problems can you think of with that approach? Am I asking for
trouble with passes, or would a semantically meaningless 'marker'
instruction be OK?

I'd seriously suggest using an intrinsic instead of an instruction: they
are far far easier to add. Aside from that, using the symbol table is
really the only thing that will work, and is prone to obvious problems,
but should work pretty well in practice.

If you have suggestions for a better way to do this, that'd be great.
There isn't a lot of prior work I found on this, most of what I saw
was about debug info, which as I stated, is not quite what I need.

Hope this helps!

It's certainly given me lots to think about.

Thanks,
-mike

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

--
Chris Lattner's Homepage
http://llvm.org/

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris