[llvm-rtdyld] AArch64 ABI Relocation Restrictions

Our runtime dynamic loader llvm-rtdyld is not conforming to the AArch64 ABI, specifically related to code and data section allocations and relocation restrictions. The symptom of this are out-of-range relocation errors during load time. This has been observed before, see e.g. here.

My understanding of the problem is as follows. For each symbol in an input object, the TrivialMemoryManager mmaps that section into memory. Thus, we are at the mercy of the OS memory allocator where exactly the requested memory will be allocated. I see cases where data and text sections from the same object are allocated very far apart, “by chance”, triggering relocation out-of-range errors.

llvm-rtdyld is using the MCJIT infrastructure. ORC’s JITLink may have solved some of these problems, but we are not ready yet to move to JITLink for different reasons, and JITLink may not be entirely ready. So we are looking for a fix and workaround in the TrivialMemoryManager and MCJIT, also because that is still the default for llvm-rtdyld.

Our proposal is to see if we can change the code/data section allocation strategy in the TrivialMemoryManager. Instread of allocating sections on-the-fly, we would like to calculate and allocate the section sizes in one go. This will avoid fragmentation, and workaround the issue of sections being allocated too far apart. I can see how that would not remove all possibilities to violate relocation restrictions, but it will be a good improvement.

There seems to be some infrastructure to do this. My proposal is that we default needsToReserveAllocationSpace to true also for the TrivialMemoryManager and then implement reserveAllocationSpace to get the allocation strategy I described. I can probably take inspiration from one of the other memory managers to implement this. But I welcome any thoughts or feedback on this approach (@lhames, @gmarkall).

1 Like

For a little extra context, using llvm-rtdyld provides a simple way to reproduce the behaviour of an issue observed in Numba (and Julia too, as per the link in the previous post):

Although llvm-rtdyld is using the TrivialMemoryManager and the setup Numba uses with MCJIT uses the default SectionMemoryManager, the issue is the same - both of these memory managers don’t reserve allocation space up-front. So if a solution in llvm-rtdyld can be tested with the TrivialMemoryManager, then we should be able to resolve the issue in Numba / llvmlite by implementing a memory manager based on the SectionMemoryManager that uses the same strategy, and configure our MCJIT instance to use it.

llvm-rtdyld is not conforming to the AArch64 ABI, specifically related to code and data section allocations and relocation restrictions

This is the restriction that text and GOT segments must be within 4GB of each other (“max distance from text to GOT <4GB” in the Code Models table in https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst#7code-models)

I’ve implemented the suggested approach - pre-allocate everything via reserveAllocationSpace - by copying and tweaking SectionMemoryManager (here) and it seems to work. I have a little cleanup I still want to do, then I could look at contributing it to LLVM.

It can also be picked up independently since you can supply your own memory manager and it just needs to extend RTDyldMemoryManager. So no need to wait for an update to LLVM to try it out.

Hi @MikaelSmith, thanks for your message and confirmation!

I would be happy to review your changes.
And we can probably test the change and provide some feedback.

PR up at Implement reserveAllocationSpace for SectionMemoryManager by MikaelSmith · Pull Request #71968 · llvm/llvm-project · GitHub. Still needs tests, but this is pretty much what I’ve tested in Impala.

Many thanks for implementing this and the PR!

Does Impala ever have GOT references between linked objects? I ask because although it sounds like it works for the Impala use case, I tried porting your implementation over to llvmlite (in https://github.com/gmarkall/llvmlite/blob/aarch64memorymanager/ffi/memorymanager.cpp on GitHub - gmarkall/llvmlite at aarch64memorymanager) before the PR was made and I’m still hitting a similar issue with my reproducer.

I think the issue seems to be that reserving allocation space for an object that is a multiple of the page size results in some spare space in the allocations once an object has been loaded. Then, another object is subsequently loaded, and a new reservation takes place. The code segment is too large for the leftovers from the previous reservations so it is allocated into the latest reservation. However, the GOT is very small, especially if there are few relocations, and so it fits into a free space into an old reservation, which can be more than 4GB away.

For example, with the llvmlite changes above and the reproducer in GitHub - gmarkall/numba-issue-9001: For work on reproducing / debugging numba/numba#9001, the output shows:

$  python llonly.py
0
Reserving 12288 bytes
Code mem starts at 0xfffff7fb3000, size 1000
Reserving 3000 bytes
Code mem starts at 0xfffff7fb0000, size 1000
Rodata mem starts at 0xfffff7fb1000, size 1000
Rwdata mem starts at 0xfffff7fb2000, size 1000
Allocating 49c bytes for CodeMem at fffff7fb3000
Allocating 140 bytes for RODataMem at fffff7fb1000
Allocating 60 bytes for RODataMem at fffff7fb1130
Allocating 10 bytes for RWDataMem at fffff7fb2000
Allocating 28 bytes for RWDataMem at fffff7fb2008
Finalizing memory
Finalizing memory
Finalizing memory
1
Reserving 3000 bytes
Code mem starts at 0xfffff7fad000, size 1000
Rodata mem starts at 0xfffff7fae000, size 1000
Rwdata mem starts at 0xfffff7faf000, size 1000
Allocating 49c bytes for CodeMem at fffff7fb0000
Allocating 140 bytes for RODataMem at fffff7fae000
Allocating 60 bytes for RODataMem at fffff7fae130
Allocating 28 bytes for RWDataMem at fffff7fb2028
Finalizing memory
Finalizing memory
Finalizing memory
2
Reserving 3000 bytes
Code mem starts at 0xfffff75a7000, size 1000
Rodata mem starts at 0xfffff75a8000, size 1000
Rwdata mem starts at 0xfffff75a9000, size 1000
Allocating 49c bytes for CodeMem at fffff7fad000
Allocating 140 bytes for RODataMem at fffff75a8000
Allocating 60 bytes for RODataMem at fffff75a8130
Allocating 28 bytes for RWDataMem at fffff7fb2048
Finalizing memory
Finalizing memory
Finalizing memory
3
Reserving 3000 bytes
Code mem starts at 0xfffff75a4000, size 1000
Rodata mem starts at 0xfffff75a5000, size 1000
Rwdata mem starts at 0xfffff75a6000, size 1000
Allocating 49c bytes for CodeMem at fffff75a7000
Allocating 140 bytes for RODataMem at fffff75a5000
Allocating 60 bytes for RODataMem at fffff75a5130
Allocating 28 bytes for RWDataMem at fffff7fb2068
Finalizing memory
Finalizing memory
Finalizing memory
4
Reserving 3000 bytes
Code mem starts at 0xfffff75a1000, size 1000
Rodata mem starts at 0xfffff75a2000, size 1000
Rwdata mem starts at 0xfffff75a3000, size 1000
Allocating 49c bytes for CodeMem at fffff75a4000
Allocating 140 bytes for RODataMem at fffff75a2000
Allocating 60 bytes for RODataMem at fffff75a2130
Allocating 28 bytes for RWDataMem at fffff7fb2088
Finalizing memory
Finalizing memory
Finalizing memory
5
Reserving 3000 bytes
Code mem starts at 0xfffeeebdd000, size 1000
Rodata mem starts at 0xfffeeebde000, size 1000
Rwdata mem starts at 0xfffeeebdf000, size 1000
Allocating 49c bytes for CodeMem at fffff75a1000
Allocating 140 bytes for RODataMem at fffeeebde000
Allocating 60 bytes for RODataMem at fffeeebde130
Allocating 28 bytes for RWDataMem at fffff7fb20a8
Finalizing memory
Finalizing memory
Finalizing memory
6
Reserving 3000 bytes
Code mem starts at 0xfffeeebda000, size 1000
Rodata mem starts at 0xfffeeebdb000, size 1000
Rwdata mem starts at 0xfffeeebdc000, size 1000
Allocating 49c bytes for CodeMem at fffeeebdd000
Allocating 140 bytes for RODataMem at fffeeebdb000
Allocating 60 bytes for RODataMem at fffeeebdb130
Allocating 28 bytes for RWDataMem at fffff7fb20c8
python: /home/gmarkall/numbadev/llvm-project-14/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.

Note that each time an object is loaded, its “index” is printed out, 0-6 in the run above. For 6, we see that right before the assertion, 28 bytes for RWDataMem get allocated at 0xfffff7fb20c8:

Allocating 28 bytes for RWDataMem at fffff7fb20c8

which came from the reservation from object 0:

Rwdata mem starts at 0xfffff7fb2000, size 1000

I think perhaps one solution to this could be that finalizeMemory() could invalidate the existing reservation - maybe by “allocating” everything left over, or otherwise somehow removing everything from the free list for the memory groups. That way, future allocations should always have to come from the most recent reservation, and therefore always be close to each other.

What do you think of this analysis / suggestion?

That looks like a reasonable explanation. Fixing in finalize seems like it should work, I’ll update my PR when I get a chance.

Impala didn’t run into this because it uses a new memory manager for each Module it generates.

Freeing any prior memory during reserve also makes sense to me, as that’s the point when you know a fresh block is needed.

I just got chance to check this - adding:

  CodeMem.FreeMem.clear();
  RODataMem.FreeMem.clear();
  RWDataMem.FreeMem.clear();

to the LlmvliteMemoryManager::finalizeMemory() function seemed to resolve the issue with my reproducer linked above. I guess this is not the proper way to do things, but I think it at least shows that preventing earlier reservations being used seems to be a step in the right direction.

This seems like an improvement.
One strategy is that we review and commit this first, then follow up and build on top of this.

Gently pinging @lhames. I am happy reviewing this once tests have been added, but an opinion from the component owner would be good too.

A bit late to the party here, sorry – I was out on vacation for a few weeks.

ORC’s JITLink may have solved some of these problems, but we are not ready yet to move to JITLink for different reasons, and JITLink may not be entirely ready.

@sjoerdmeijer @MikaelSmith: JITLink is reasonably mature at this point: x86-64, aarch64, riscv, loongarch, and PPC64 are all well supported, and i386 and aarch32 are under development. The design of JITLink’s memory management interface also addresses these out-of-range errors, so adopting JITLink is an easy fix for this issue. I plan to start a discussion on MCJIT deprecation in a new discourse thread this week. We can still land fixes to MCJIT in the mean time, but you’ll want to plan for its eventual removal.

I think perhaps one solution to this could be that finalizeMemory() could invalidate the existing reservation - maybe by “allocating” everything left over, or otherwise somehow removing everything from the free list for the memory groups.
…
Freeing any prior memory during reserve also makes sense to me, as that’s the point when you know a fresh block is needed.

I’m happy to try this out. There’s an outside chance it’ll trigger other pathologies – if that happens we’ll need to stick with current behavior: the goal for MCJIT at this point is stability, rather than improvement.

Of the two approaches, freeing during reserve seems less intrusive – we should try that first.

Many thanks for the input, @lhames!

I think @sjoerdmeijer’s comment referred to the state of JITLink in LLVM 14, which Numba presently still depends on. I think all Numba maintainers are keen on moving to OrcJIT and JITLink once we get on to newer LLVM versions (I certainly am!).

I’m sure other Numba maintainers can also chime in on the new thread when you start it, but I think we’re happy with planning for its eventual removal.

Many thanks for being accommodating here! :slight_smile:

Many thanks - could you expand a bit on what “pathologies” means? Might these be potential performance cliffs / degradation of memory efficiency, or an expectation of outright crashes and incorrect behaviour?

The reason I ask this - I think even if the fix is not ultimately suitable for inclusion upstream in MCJIT, I think Numba / llvmlite will need to use a memory manager adopting this strategy because we don’t have an alternative way of avoiding crashes on AArch64.

Hi @lhames, thanks for looking at this and sharing your ideas.

I have just uploaded a work-in-progress patch here to implement the same idea in llvm-rtdyld. The same problem can be triggered in llvm-rtdyld with our reproducers because of the same underlying issue. The idea of the patch is to calculate the sizes of all objects and text/data sections, and then allocate all required memory upfront. We thought it would be good to fix llvm-rtdyld too to keep things consistent.

It’s work-in-progress because I have not added tests yet which is what I will do next. But I appreciate if you would have any opinions on the approach.

I think @sjoerdmeijer’s comment referred to the state of JITLink in LLVM 14, which Numba presently still depends on. I think all Numba maintainers are keen on moving to OrcJIT and JITLink once we get on to newer LLVM versions (I certainly am!).

That’s great news. Out of interest, do you have a timeline in mind? Is this something that people are actively working towards, or just something that you know you want to get to in the future?

Many thanks for being accommodating here! :slight_smile:

No worries. If you’re on LLVM 14 though, is there a need to make these changes in top-of-tree? Or is landing them in your own branch only an option?

Many thanks - could you expand a bit on what “pathologies” means? Might these be potential performance cliffs / degradation of memory efficiency, or an expectation of outright crashes and incorrect behaviour?

I think both outcomes are possible, but low probability. Switching to a fresh block during finalize (or even reservation) could increase memory usage for some clients. On the other hand, if the program being JIT’d contains any extern hidden references then this change could cause a crash if the current scheme is, by accident, placing a hidden symbol definition close to a user of that definition due to block re-use.

If you’re only applying these changes to Numba’s memory manager then the risk seems even lower again. I’m only concerned about fallout if we land them in LLVM – we want to avoid any churn in the MCJIT APIs while people are moving over to ORC.

There has been some active work towards it, but it’s presently paused - I wrote some initial OrcJIT support in llvmlite, which was then carried on a bit by Andre Masella (who is no longer working on Numba / llvmlite). After that we tried to use it in Numba but ran into some lifetime management issues that we didn’t have time to surmount, and presently the work is paused.

We’d like to get there, but we also have to deal with moving to opaque pointers and possibly some old passmanager stuff too, alongside keeping up with changes in CPython bytecode and new NumPy versions, so it’s hard to give a specific timeline - we’d like as soon as possible, but it’s not going to be that soon.

I think it would be good to have them in the top of the tree because there seem to be other users who hit it (Impala, Julia, I came across a couple of other people describing the issue elsewhere on the web who didn’t really get too far with it), but landing them in llvmlite is definitely an option - I presently have Fix relocation overflows by implementing preallocation in the memory manager by gmarkall · Pull Request #1009 · numba/llvmlite · GitHub for review in llvmlite.

Thanks for the additional insight - I’d guess we don’t have anything like extern hidden in Numba. I can imagine it might increase memory usage a bit, but it seems that the preallocations are getting rounded up to 4K on Linux and 16K on macOS, so in the context of Numba I think the additional overhead will be small.

Understood - I think in Mikael’s PR to LLVM, the changes are opt-in only - so that would also seem to lower the risk for incorporating the fix upstream.

We’d like to get there, but we also have to deal with moving to opaque pointers and possibly some old passmanager stuff too, alongside keeping up with changes in CPython bytecode and new NumPy versions, so it’s hard to give a specific timeline - we’d like as soon as possible, but it’s not going to be that soon.

Ok. Thanks very much for the extra context!

I think it would be good to have them in the top of the tree because there seem to be other users who hit it (Impala, Julia, I came across a couple of other people describing the issue elsewhere on the web who didn’t really get too far with it),

I believe Julia has moved to ORC now, but I suspect you’re not the only ones to hit this.

Since Mikael’s PR is opt-in I don’t think there’s any problem with this landing in tree. I’ll head over to GitHub and review. :slight_smile:

1 Like