Possible bug in TCO?

My compiler is generating a bunch of code including the following line:

  %57 = call fastcc i32 @aux(%1* %0, %1 %1, %1 %46, i32 0, %4 %2) ; <i32>
[#uses=1]
  ret i32 %57

The program works fine as long as this isn't a tail call. If I compile via
a .ll and insert "tail" by hand, the program segfaults. However, if I make it
a tail call and return an undef i8* or void instead of the i32 0 then it
works (the source code is returning the value () of the type unit in ML, so
it conveys no information and can be returned as anything). That makes me
think this has been a bug in LLVM rather than in my own code.

I'm using LLVM 2.6. Anyone recognise this as a bug in TCO fixed since then or
should I try to boil it down and submit it?

Hi Jon,

The last I heard about tail calls was http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-November/027355.html when you questioned why doing tail calls with no optimization caused llc to barf.

It mentions a revision number but that post is only a few days old so I hope it is fixed in the trunk.

--Sam Crow

I've come up with the following minimal repro that segfaults on my machine:

  define fastcc i32 @g({i32, {i32, i32}}) {
    %1 = extractvalue {i32, {i32, i32}} %0, 0
    %2 = extractvalue {i32, {i32, i32}} %0, 1
    %3 = extractvalue {i32, i32} %2, 0
    %4 = extractvalue {i32, i32} %2, 1
    %5 = add i32 %1, %3
    %6 = add i32 %5, %4
    ret i32 %6
  }
  
  define fastcc i32 @f({i32, {i32, i32}}) {
    %1 = tail call fastcc i32 @g({i32, {i32, i32}} %0)
    ret i32 %1
  }
  
  define i32 @main() {
    %1 = insertvalue {i32, {i32, i32}} undef, i32 1, 0
    %2 = insertvalue {i32, i32} undef, i32 2, 0
    %3 = insertvalue {i32, i32} %2, i32 3, 1
    %4 = insertvalue {i32, {i32, i32}} %1, {i32, i32} %3, 1
    %5 = call i32 @f({i32, {i32, i32}} %4)
    ret i32 %5
  }

I believe this is a bug in the handling of nested structs across tail calls.
There is no pointer manipulation or allocation in this code yet valgrind says
it dies trying to access memory location 0x1.

I'll try with the latest SVN repo to see if it works...

Jon Harrop wrote:

I've come up with the following minimal repro that segfaults on my machine:

Jon, were you able to resolve this?

FWIW, TOT is causing all kinds of weird segfaults related to tail calls
in my Pure interpreter, too (at least on x86-64). In my case these seem
to be limited to the JIT, however (batch-compiled Pure programs via
opt+llc all work fine, even with TCO), so it's probably a different
issue. When using JIT compilation, the Pure interpreter works fine with
LLVM 2.3 thru 2.6, and also with early revisions of 2.7svn, but it fails
most of my test suite with current TOT, even though the generated IR
seems to be the same as before.

Have there been any changes to the x86-64 backend of the JIT which might
break tail call elimination? I didn't see any announcements about major
changes in the JIT on the ml, so I have no idea what might be going
wrong there.

Albert

Have there been any changes to the x86-64 backend of the JIT which might
break tail call elimination? I didn't see any announcements about major
changes in the JIT on the ml, so I have no idea what might be going
wrong there.

Most probably recent changes (e.g. post-RA scheduling) just uncovered
the bugs TCO always had.

Try batch compiling with the large code model. (llc -code-model=large)
If that also causes tail calls to break, then I did something wrong in
fixing far calls in the JIT.

Nope. I just worked around the problem because HLVM was segfaulting on a
single stdlib function call (just removing "tail" fixed the segfault which
was trying to read 0x1 according to ValGrind) so I inlined it and now I have
no segfaults. I kept the full repro as a .ll though.

Arnold showed that my minimal repro was buggy: I'd confused calling
conventions.

Jeffrey Yasskin wrote:

Try batch compiling with the large code model. (llc -code-model=large)

Works fine. Anything else that I could try?

Thanks for checking. Sorry, I don't know anything else.

I haven't tested it, but the following pattern in X86Instr64bit.td
looks suspicious as it appears to attempt to support direct tailcalls
to arbitrary 64-bit immediates:

def : Pat<(X86tcret GR64:$dst, imm:$off),
          (TCRETURNri64 GR64:$dst, imm:$off)>;

Dan

Arnold pointed out to me that I was mistaken here; this offset is a
stack offset, so it's not the kind of thing I was looking for.

With the recent changes to support regular calls where the callee is not
within range for a 32-bit immediate on 64-bit targets, my suspicion was
that perhaps tailcalls needed similar fixing, but at another glance I
don't see anything obviously wrong there. It would be interesting if
someone could look at one of the segfaults in a debugger and determine
which address its trying to jump to, and compare that with the actual
address of the intended callee.

Dan

Jeffrey Yasskin wrote:

Try batch compiling with the large code model. (llc -code-model=large)
If that also causes tail calls to break, then I did something wrong in
fixing far calls in the JIT.

Jeffrey, I took a closer look at this now, and all the TCO-related
weirdness I see in the Pure interpreter is indeed related to your commit
in r88984 ("Make X86-64 in the Large model always emit 64-bit calls").
Up to and including r88983, Pure passes all checks (at least with eager
compilation, see below), with r88984 and later more than half of the
checks fail. This only happens when using dynamic compilation. As I
reported earlier, batch compilation works fine, even if the large code
model is used. OTOH, dynamic compilation is broken no matter which code
model I choose when creating the JIT.

So it seems that r88984 does break fastcc and/or tail calls in the JIT.
Maybe you don't see this in UnladenSwallow because it doesn't do tail calls?

There's also some minor breakage which isn't TCO-related (four failed
checks in the Pure interpreter) when reenabling lazy compilation with
DisableLazyCompilation(false). These seem to go all the way back to your
commit of Nick's patch (r84032 = "Keep track of stubs that are created"
fails exactly the same checks, while r84031 is fine). Those four Pure
checks all involve anonymous closures (lambdas); I still need to look at
Nick's patch to figure out what exactly is going on there.

Now I might be able to live without lazy compilation (even though it
noticably slows down some code), but it goes without saying that as a
functional language Pure definitely needs TCO. So I can only hope that
this will be fixed before the LLVM 2.7 release.

Ok, so what next? Should I submit a bug report? Reopen PR5162?

Albert

Albert Graef wrote:

Jeffrey Yasskin wrote:

Try batch compiling with the large code model. (llc -code-model=large)
If that also causes tail calls to break, then I did something wrong in
fixing far calls in the JIT.

Jeffrey, I took a closer look at this now, and all the TCO-related
weirdness I see in the Pure interpreter is indeed related to your commit
in r88984 ("Make X86-64 in the Large model always emit 64-bit calls").
Up to and including r88983, Pure passes all checks (at least with eager
compilation, see below), with r88984 and later more than half of the
checks fail. This only happens when using dynamic compilation. As I
reported earlier, batch compilation works fine, even if the large code
model is used. OTOH, dynamic compilation is broken no matter which code
model I choose when creating the JIT.

So it seems that r88984 does break fastcc and/or tail calls in the JIT.
Maybe you don't see this in UnladenSwallow because it doesn't do tail calls?

There's also some minor breakage which isn't TCO-related (four failed
checks in the Pure interpreter) when reenabling lazy compilation with
DisableLazyCompilation(false). These seem to go all the way back to your
commit of Nick's patch (r84032 = "Keep track of stubs that are created"
fails exactly the same checks, while r84031 is fine). Those four Pure
checks all involve anonymous closures (lambdas); I still need to look at
Nick's patch to figure out what exactly is going on there.

Now I might be able to live without lazy compilation (even though it
noticably slows down some code), but it goes without saying that as a
functional language Pure definitely needs TCO. So I can only hope that
this will be fixed before the LLVM 2.7 release.

Ok, so what next? Should I submit a bug report? Reopen PR5162?

Can you prepare a standalone testcase that demonstrates the problem? See unittests/ExecutionEngine/JIT/*.cpp to get your started.

Nick

Nick Lewycky wrote:

Can you prepare a standalone testcase that demonstrates the problem? See
unittests/ExecutionEngine/JIT/*.cpp to get your started.

Probably. I'd essentially have to replicate some minimal runtime
environment and IR as created by the Pure interpreter. That's doable,
but very tedious and will probably take a while.

The problems are easy to reproduce by grabbing the Pure svn source from
http://pure-lang.googlecode.com/ and running './configure && make &&
make check', though. Is that good enough?

Albert

Nick Lewycky wrote:

Can you prepare a standalone testcase that demonstrates the problem? See
unittests/ExecutionEngine/JIT/*.cpp to get your started.

Probably. I'd essentially have to replicate some minimal runtime
environment and IR as created by the Pure interpreter. That's doable,
but very tedious and will probably take a while.

It seems unlikely you'll actually need the runtime for a minimal
reduction. There's probably a particular class of tail call that's
getting messed up, or a particular arrangement of code allocations,
and we'll need to find and check in that small test case to make sure
this doesn't break again.

The problems are easy to reproduce by grabbing the Pure svn source from
http://pure-lang.googlecode.com/ and running './configure && make &&
make check', though. Is that good enough?

Not really. I need to find the IR it's compiling into bad code, and
`make check` doesn't give me the command lines so I can gdb them. It's
also going to be tricky for me, who's not familiar with your
implementation, to find which function is crashing and where it's
compiled. Either a patch to JITTest.cpp or an IR file+an lli command
line would be really helpful.

Jeffrey Yasskin wrote:

Not really. I need to find the IR it's compiling into bad code, and
`make check` doesn't give me the command lines so I can gdb them. It's
also going to be tricky for me, who's not familiar with your
implementation, to find which function is crashing and where it's
compiled. Either a patch to JITTest.cpp or an IR file+an lli command
line would be really helpful.

Ok, I'll give it a go as soon as I find some time.

At least part of this is http://llvm.org/PR5729 (thanks to Jon Harrop
for making me look at his test case.)
Your time is probably best spent reducing the lazy compilation
breakage, until we've fixed this tail-call problem.