Question regarding correctness of debug information generated by LLC

I have a program I am developing that is exhibiting some undesirable
behavior, and I'm not entirely sure whether this constitutes a bug in
LLVM or not, so I wanted to ask.

I have managed to construct a minimized program that exhibits the same
issue I am dealing with; it can be found at

For a little background of what I am working on, I am developing a
garbage collector for a language that compiles down to LLVM IR, making
use of the garbage collection safepoints infrastructure in LLVM in
order to find references to heap objects on the stack. The language is
a functional language that relies on tail call optimizations to avoid
unbounded stack growth, so it makes use of the tailcc calling
convention (which is in turn based on the older fastcc calling
convention with -tailcallopt passed to llc).

The issue I am having is that it would appear that the debug
information generated for my program indicates a different value for
the $rsp register within the `apply_rule_6870` call frame at each of
the two different call sites of `scanStackRoots`. The value seems
correct at the first call site. However, at the second call site, the
value appears incorrect and gdb actually cannot find the `main` stack
frame when it tries to do a backtrace.

This is actually a matter of correctness for me because I use
libunwind to find the stack pointer at each call frame in order to be
able to interpret the stack map generated by the LLVM GC safepoints.
When run in the full example, the code sees an incorrect address for
one or more local variables, attempts to read memory incorrectly, and
eventually segfaults as a result.

My question is: is this a bug in LLVM? My gut feeling is yes, because
the compiler ought to generate the necessary cfi declarations in order
to be able to correctly unwind the stack. But I'm not entirely sure
because I'm not completely clear on what guarantees LLVM provides with
respect to debug information on non-standard calling conventions.

So, my question to the list is, should I report this program as a bug
in LLVM? Or do I need to find another means by which to reconstruct
the canonical frame address when looking for garbage collection roots
on the stack. And if so, what means might be available?

Thanks,

I have a program I am developing that is exhibiting some undesirable
behavior, and I'm not entirely sure whether this constitutes a bug in
LLVM or not, so I wanted to ask.

I have managed to construct a minimized program that exhibits the same
issue I am dealing with; it can be found at
debug.ll · GitHub

For a little background of what I am working on, I am developing a
garbage collector for a language that compiles down to LLVM IR, making
use of the garbage collection safepoints infrastructure in LLVM in
order to find references to heap objects on the stack. The language is
a functional language that relies on tail call optimizations to avoid
unbounded stack growth, so it makes use of the tailcc calling
convention (which is in turn based on the older fastcc calling
convention with -tailcallopt passed to llc).

The issue I am having is that it would appear that the debug
information generated for my program indicates a different value for
the $rsp register within the `apply_rule_6870` call frame at each of
the two different call sites of `scanStackRoots`. The value seems
correct at the first call site. However, at the second call site, the
value appears incorrect and gdb actually cannot find the `main` stack
frame when it tries to do a backtrace.

Hi Dwight,

I looked at your gist, and the IR has no debug info annotations, which
means your front end is not generating debug info... where is the
debug info coming from? Or maybe you inadvertently generated the gist
without debug info?

This is actually a matter of correctness for me because I use
libunwind to find the stack pointer at each call frame in order to be
able to interpret the stack map generated by the LLVM GC safepoints.
When run in the full example, the code sees an incorrect address for
one or more local variables, attempts to read memory incorrectly, and
eventually segfaults as a result.

My question is: is this a bug in LLVM? My gut feeling is yes, because
the compiler ought to generate the necessary cfi declarations in order
to be able to correctly unwind the stack. But I'm not entirely sure
because I'm not completely clear on what guarantees LLVM provides with
respect to debug information on non-standard calling conventions.

Tail calls don't seem that far out of the ordinary; but without seeing
how you're generating debug info, it's a little hard to be helpful.
It might well be a bug in what LLVM is doing, but it might not, and
we'd need more complete instructions on how to reproduce the problem
before we can answer that fundamental question.

So, my question to the list is, should I report this program as a bug
in LLVM?

Right now is actually not a great time to report a new bug, because we
are transitioning from Bugzilla to github issues. Let's see if we can
solve this via email (of course many in the U.S. are about to be on
holiday for a few days) and if not, we're hoping to be up and running
on github early next week.

Thanks,
--paulr

Hi,

Sorry for the late reply, this is my first day back after a long
holiday weekend. The gist of your reply seems to be that it's a little
hard to determine what might be going wrong because there is no debug
information in the IR I shared. There was debug information in my
original IR, it's true, but I was still able to reproduce the issue I
was encountering whether or not the debug information was present, so
it got removed during the process of minimization. The reason I
considered this normal is because LLC still generates CFI/CFA
directives in the assembly even when no debug information is present
in the IR, and my understanding was that this was the information that
the debugger used in order to unwind the stack.

If my understanding is incorrect, and the debugger actually relies on
information provided via the IR debug metadata in order to unwind the
stack, it's possible that the issue I encountered might simply be as a
result of my not having provided the correct metadata in the IR, since
the code that generated that metadata was something I wrote myself. If
that is the case, can you help me understand what metadata in the IR
might be used by the debugger in order to unwind the stack, so that I
can test some more on my end before getting back to you? I don't want
to waste your time if it turns out that I simply was providing the
wrong IR to LLC.

Thanks,
Dwight

Sorry for the late reply, this is my first day back after a long
holiday weekend. The gist of your reply seems to be that it's a little
hard to determine what might be going wrong because there is no debug
information in the IR I shared. There was debug information in my
original IR, it's true, but I was still able to reproduce the issue I
was encountering whether or not the debug information was present, so
it got removed during the process of minimization. The reason I
considered this normal is because LLC still generates CFI/CFA
directives in the assembly even when no debug information is present
in the IR, and my understanding was that this was the information that
the debugger used in order to unwind the stack.

Ah, okay, I misunderstood. The CFI/CFA directives are not what I
normally think of as debug information, although of course they are
used to build the unwind tables in the .debug_frame section (or in
the .eh_frame section, if you're not producing debug info).

I suspect you are not targeting an X86-family architecture? I was
not able to persuade llc for x86_64 Ubuntu to emit the tail calls
that you describe. Could you provide a complete llc command line
(including triple) that demonstrates the problem for you?

Thanks,
--paulr

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
$ ~/llvm-project/build/bin/llc --version
LLVM (http://llvm.org/):
  LLVM version 13.0.0
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver1

  Registered Targets:
    x86 - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
$ ~/llvm-project/build/bin/llc -mtriple=x86_64-unknown-linux-gnu -O0 debug.ll

If you inspect the debug.s, you will see that all the `tail call
tailcc` calls which are actually in tail position will be compiled to
the `jmp` instruction, as expected. The other calls are either not
`tailcc` calling convention, not marked as a tail call (because in the
original unminimized source, they were not in tail position), or are
not now in tail position (because they were in the original source,
but the minimized version does not put them in that position).

$ ~/llvm-project/build/bin/llc -mtriple=x86_64-unknown-linux-gnu -O0
debug.ll

Thanks! That is very helpful.

I see a `jmp` from sender12 to apply_rule_6300, from apply_rule_6300
to sender4, and from apply_rule_6299 to sender1. Does that match
your expectations? I want to make sure I'm looking at what you're
looking at.

> The issue I am having is that it would appear that the debug
> information generated for my program indicates a different value for
> the $rsp register within the `apply_rule_6870` call frame at each of
> the two different call sites of `scanStackRoots`. The value seems
> correct at the first call site. However, at the second call site, the
> value appears incorrect and gdb actually cannot find the `main` stack
> frame when it tries to do a backtrace.

Looking at the paths to scanStackRoots, the call sequence and
my take on where you're having the problem look like this:

main
  apply_rule_6870
    sender12
      scanStackRoots // here, gdb can find apply_rule_6870 fine
      (jmp) apply_rule_6300
        (jmp) sender4
          apply_rule_6299
            (jmp) sender1
              apply_rule_6297
                koreAllocAndCollect_p1s_blocks
                  koreCollect
                    scanStackRoots // here, there's a problem

I'm not as fluent in x86 as perhaps I should be, but I do see
one thing that makes me wonder if it's entirely correct. The
end of apply_rule_6300 looks like this:

    popq %rax
    .cfi_def_cfa_offset 8 # so far so good
    addq $48, %rsp
    jmp sender4

My guess is that the code generator is adjusting the stack
frame size because sender4 has a much shorter argument list
than apply_rule_6300, and it's possible that the .cfi directives
aren't describing that correctly. Truthfully, I don't know
enough about .cfi directives to say whether they *can* describe
that correctly.

You could work around this by making sender4 not be tail-callable,
perhaps?

This is definitely worth filing a bug; unfortunately, the project
is transitioning from Bugzilla to github issues, and the transition
is not complete, which means there is literally no way to file a
bug at the moment. When that opens up, though, if you could file it
(calling it "unwind info" rather than "debug info" to avoid confusion)
that would be very much appreciated! I'll leave a note to myself to
ping this thread when the transition is done.

Thanks,
--paulr

Thanks for the info! This definitely tells me what I need to know. I
will watch for when the Github issues starts allowing bugs to be
reported and file the bug there under "unwind info" as you requested.
As an aside, yes, those three call sites are the three call sites I
expect to see a tail call at.

Dwight

Quick ping to make sure this got filed...
Thanks,
--0paulr

It got filed (https://github.com/llvm/llvm-project/issues/52758), but it looks like it’s still got the “new issue” tag. There are actually a sizeable number of issues with this tag sitting around that were created since the migration, some of which are definitely not that recent. Have you guys not sorted out a policy for assigning the appropriate tags or people to issues on GitHub yet?

Well, noone triaged the bug and assigned the tags here.

Whose responsibility is it to triage new bugs? Most people submitting bugs, myself included, aren’t going to have permission to do that, with the way it’s currently set up on GitHub.

We don’t have a formal process for that - people do it when they get to it.