ARM64, dropping ADRP instructions, and ld.lld

Hello,

I am working in an embedded environment with somewhat restrictive memory requirements where the page alignment requirements of an ADRP instruction cannot be guaranteed.

With the ld program inside of the Xcode, there is a -preload flag which causes ADRP instructions to be dropped, and generates code that is 100% position independent.

As near as I can determine, ld.lld does not have this same feature. I am wondering if I am missing something, if such a feature is being planned, or if there is an alternative I have not considered yet.

Regards,
Eric

Are you sure about that?

In the documentation for the ADRL pseudo it says:

“ADRL assembles to two instructions, an ADRP followed by ADD.”

“ADRL produces position-independent code, because the address is calculated relative to PC.”

From this, I’d expect ADRP to simply do Xd ← PC + n*4096, where n is a 20 bit number, just like AUIPC in RISC-V (also a 20 literal multiplied by 4096) or AUIPC in MIPS (16 bits multiplied by 65636 there).

In all cases, if you then do an add immediate with a 12 bit signed literal (16 bit for MIPS) then you’ve got a relative offset from the current PC< accurate to the byte, anywhere in a +/- 2 GB range.

The actual alignment of the PC is irrelevant. It’s not like the ADRP or AUIPC sets the low 12 bits to zero or something. It leaves them alone (as it finds them in the PC).

Hi Eric,

I am working in an embedded environment with somewhat restrictive memory
requirements where the page alignment requirements of an ADRP instruction
cannot be guaranteed.

It sounds like you're relying on the linker optimization hints that
Clang emits. As you've seen they're designed to allow the linker to
convert adrp/add pairs into simpler nop/ldr sequences. If it works for
your purposes, great; but bear in mind it was designed as a
microarchitectural optimization so it's not guaranteed to trigger or
be able to remove all adrps if it does.

As near as I can determine, ld.lld does not have this same feature. I am
wondering if I am missing something, if such a feature is being planned,

MachO support in lld is pretty immature compared to ELF and it
certainly doesn't look like it's supported yet. I'm afraid I'm not
sure about the longer-term plans.

or if there is an alternative I have not considered yet.

Ideally this would probably be handled by implementing proper
-mcmodel=tiny support in LLVM so that only ADR instructions are
emitted in the first place (instead of leaving you with a bunch of
NOPs). In ELF-land that probably wouldn't be too hard (there are
already relocations for it in the spec), but MachO is chronically
starved of free locations so that might get very nasty very quickly.

Cheers.

Tim.

Afraid not. It really is (PC & ~0xfff) + n * 0x1000. So it does
require 12-bit alignment of any code section.

Now that you mention the MIPS & RISC-V alternatives, I'm not sure why
ARM actually made that choice. It obviously saves you a handful of
transistors but I can't quite believe that's all there is to it.

Cheers.

Tim.

I don’t care about Mach-O at all. ELF is sufficient.

I can certainly submit a bug report at https://bugs.llvm.org requesting a -mcmodel=tiny feature which would cause ADR instructions to be emitted in the first place.

This is of great enough interest to me that I may be able to contribute some time to make it happen, but not generally working this deep in the toolchain, it would likely need to be more along the lines of assisting someone who knew what they were doing and could provide guidance rather then me taking on the whole task myself.

As an educated opinion, how difficult might something like this be? minutes? hours? days? weeks? months?

Thank you for providing the explanation for how ADRP works…something I should have done myself.

With this explanation in hand, one other alternative I was looking at was using a linkerscript to essentially rebase the code and have ADRP instructions that would address the correct location as a result. However, I am not a linkerscript expert, so I am not sure if such a thing is even possible or would make much sense. However, it may provide a legitimate shortcut to a solution which doesn’t involve adding a feature to the toolchain.

> "ADRL produces position-independent code, because the address is
calculated
> relative to PC."
>
> From this, I'd expect ADRP to simply do Xd <- PC + n*4096, where n is a
20
> bit number, just like AUIPC in RISC-V (also a 20 literal multiplied by
4096)
> or AUIPC in MIPS (16 bits multiplied by 65636 there).

Afraid not. It really is (PC & ~0xfff) + n * 0x1000. So it does
require 12-bit alignment of any code section.

Wow! My mistake. Knock me down with a leaf.

Now that you mention the MIPS & RISC-V alternatives, I'm not sure why
ARM actually made that choice. It obviously saves you a handful of
transistors but I can't quite believe that's all there is to it.

I'm not quite sure how passing 12 bits through an ALU unchanged uses more
transistors than inserting muxes to pass them through for some instructions
and replace them with zeros for other instructions :slight_smile:

I find Aarch64 inexplicable. There are some truly brilliant touches such as
the bit patterns in immediate operands for logical instructions, or the
pass-through/invert/negate/increment in the conditional select instruction,
or the bitfield move that can extract/insert/sign extend/truncate. But
there are are some things that make me think the designers operated in a
complete vacuum, not aware of the brilliant bits in ARM32 or other prior
art. This is one of them. The abandonment of mixed 16/32 bit opcodes that
took Thumb2 to such dominance is another. MIPS have copied that several
times with the recently announced nanoMIPS looking pretty good (and with
16, 32 & 48 bit opcodes designed in). RISC-V was of course designed for
optional variable length 16&32 bit (and longer in future) opcodes from
almost the beginning. All of these give x86_64-beating code density without
the sequential decode nightmares.

Hello Eric,

My understanding is that the ADRP instruction isn't supposed to be
used on its own. The result of the ADRP provides a 4k aligned address,
the following instruction such as an LDR has an immediate offset that
can reach any address within the 4k page. For example to get the
address of a global variable var with -fpic in ELF:
adrp x0, :got:var // relocation R_AARCH64_ADR_GOT_PAGE var
ldr x0, [x0, :got_lo12:var] // relocation R_AARCH64_LD64_GOT_LO12_NC

The resulting code section is 4 byte aligned, I'm not sure where the
requirement for 4k aligned sections come from unless you are planning
to use ADRP alone? Do you need just one instruction for the purposes
of reducing code size? Another possibility if you don't care about
code-size but mustn't use ADRP is (range permitting) to have the
linker turn an ADRP to ADR and replace the following instruction with
a NOP. I think that is something you'd need to maintain downstream
though.

If you can use gcc then that supports -mcmodel=tiny. How long it would
take to implementing it in LLVM would depend on how familiar you are
with LLVM and how much you know of the specification of -mcmodel=tiny;
on the assumption you aren't that familiar I'd guess at an order of
weeks.

Peter

I don't care about Mach-O at all. ELF is sufficient.

Ah, there's definitely no linker-optimization hints for ELF. The
compiler doesn't even emit the data that the linker would need.

As an educated opinion, how difficult might something like this be? minutes? hours? days? weeks? months?

Probably a few hours on the compiler side for me (~1 plumbing "tiny"
through as a valid option, ~1-2 implementing it in AArch64, + time
compiling etc). It's actually a pretty simple change to make as these
things go; thread-local storage is likely to be the trickiest bit.

That's assuming the linker can cope with the new relocations, which
looks plausible from a quick grep but not a foregone conclusion.

With this explanation in hand, one other alternative I was looking at was
using a linkerscript to essentially rebase the code and have ADRP
instructions that would address the correct location as a result.

You mean provide the explicit (misaligned) address you intend to load
the binary at and get the linker to fix things up? Theoretically it
would have sufficient information, but I don't know how you'd convince
it not to align pages.

I think it's the segments that need to be 4K aligned (i.e. after
linking). Normally this isn't really an extra constraint because
you're just going to map them in with the MMU anyway, but in strange
embedded situations I could see it being a problem.

Consider the fully linked sequence:

    adrp x0, #0
    add x0, x0, #8

Starting at 0x1000 this would result in x0 == 0x1008 == pc, at 0x1ffc
it would result in x0 == 0x1008 != pc. Not good for
position-independence (or static positioning, but for different
reasons not illustrated by that example).

Cheers.

Tim.

Ok, thanks for the clarification that makes sense.

Peter

Hello Eric,

If you do decide to investigate the linker script route, the ALIGN
builitin function might be useful. I think the simplest way is to do
something like:
.text ALIGN(0x1000) : { *(.text) }
.my_next_section ALIGN (0x1000) : { *(my_next_section) }
Bothe .text and .my_next_section would start at 4k boundaries.

Link to docs: https://sourceware.org/binutils/docs/ld/Builtin-Functions.html#Builtin-Functions

Peter

I went ahead a submitted a bug report, referencing this discussion.

For anyone who is interested and would like to comment to add useful clarifications, etc., the link to the report is:

https://bugs.llvm.org/show_bug.cgi?id=37543