Sporadic "RealOffset <= INT32_MAX && RealOffset >= INT32_MIN" failures with MCJIT on Windows

Hi,

We are seeing sporadic crashes since we migrated to MCJIT on Win64. The same tests pass without issues on Mac64 and Linux64. The issue is this assertion failure in RuntimeDyldELF.c:

RealOffset <= INT32_MAX && RealOffset >= INT32_MIN

I haven’t managed to successfully catch the failure in Visual to try and debug it. Any tips on how to make progress?

Oh, and we’re on LLVM 3.5.

Thanks.

Ram

That sounds like a PC-relative relocation failure. Usually this happens when the relocation target is more than 2 GB away from the source. Try using the large code model or tweaking the memory manager.

It turns out it’s surprisingly hard to portably allocate some memory and then allocate some more within a 2 GB offset of the first allocation in a 64-bit process. For various reasons that I don’t understand, reserving 2 GB of address space upfront and allocating from that is not workable for some MCJIT clients.

So it appears that we get about half the crashes with the large code model. The rest are crashing in the same way. It could either mean that large code model still takes that crashing codepath and that the number of crashes only went down by chance, or that in one place in the flow, large code model is not matched to mean ELF::R_X86_64_PC64. I’m digging into this issue further, but any hints along the way would be appreciated.

Thanks.

Ram

This might be related to GOT relocations. I rewrote that part of RuntimeDyldELFbecause I was seeing this issue. Have you tried trunk?

I didn’t notice that you were running 3.5 the first time I read this. Keno’s diagnosis is very likely to be correct. You should try trunk if you’re able to.

  • Lang.

​This sounds pretty serious and it won’t be easy for us to upgrade - particularly not to trunk. Are there plans to take bug fixes like this into llvm 3.5.x point releases? (Do I remember right that 3.5.x is supposed to have some kind of long term support? Where is that process documented?)

Thanks,

Dale

Hi Dale,

I don’t think that Keno’s rewrite is applicable for a bug fix release. We have, in the last year, moved to having some dot releases for our older releases, but these are definitely bug fix only and low risk as we don’t want to break anything new.

The release documentation is located here:

http://llvm.org/docs/HowToReleaseLLVM.html

for future reference. There’s no official long term support strategy past the information on that page, previously we released every 6 mos without dot releases at all so this is a fairly new trial for us. Backporting of patches is at the discretion of the author, the code owner, and the release manager.

Keno: perfectly happy to entertain a backport of your patch if you want to do such a thing, but IIRC it was a bit more than a simple bug fix.

-eric

The commits in question are r234839 and you’ll probably also want r236341. I don’t think these are the kinds of commits that should generally be back ported. It’s not really a small self-contained commit. If you’re willing you can probably carry these patches yourself (we will be doing so on top of 3.6 until 3.7 is released), but do note that in my experience using MCJIT with the large code model does not quite work yet (it’s on my todo list to work out exactly why and fit). Also, I believe the memory allocation scheme for MCJIT was rewritten slightly between 3.5 and trunk, so there may be additional problems I don’t know about.

That implies that 3.6 is not really useable on Windows, doesn’t it, since the legacy JIT was removed? (At least if you could need the large code model.)

Correct, though this is certainly not the only issue preventing LLVM 3.6 from being usable on Windows. I think we got a few of them backported to 3.6.1, but there’s a few more still remaining.

There was a crash on Windows when more than 4K is allocated on the stack at once, due to chkstk call offset being too large. Maybe you are seeing that.
It’s fixed in trunk, but not in any stable release.

Nick C.