[RFC][ARM] Add support for embedded position-independent code (ROPI/RWPI)

Hi,

We currently have a downstream patch (attached) which implements some new
addressing modes that enable position-independent code for small embedded
systems. Is this something that would be accepted upstream? I think the ARM
backend changes are fairly uncontroversial, but the clang changes introduce
a
lot of ROPI/RWPI specific changes in otherwise target-independent code. If
the
clang changes are not acceptable, it is still possible to use just the ARM
backend changes (with a smaller clang patch for command-line options only),
as
the C code which needs special lowering is rare, easy to work around and
easy
for a linker to detect.

This patch (along with the corresponding ARM backend patch) adds support
for some new relocation models:
- Read-only position independence (ROPI): Code and read-only data is
  accessed PC-relative. The offsets between all code and RO data
  sections are known at static link time.
- Read-write position independence (RWPI): Read-write data is accessed
  relative to a static base register (R9). The offsets between all writable
  data sections are known at static link time.

These two modes are independent (they specify how different objects
should be addressed), so they can be used individually or together. They are
otherwise the same as the "static" relocation model, and are not compatible
with SysV-style PIC.

These modes are normally used by bare-metal systems or small real-time
operating systems. They are designed to avoid the need for a dynamic linker,
the only initialisation required is setting the static base register to an
appropriate value for RWPI code. They also minimise the size of the writable
portion of the executable, for systems with very limited RAM.

I have only added support to SelectionDAG, not FastISel, because FastISel is
currently disabled for bare-metal targets where these modes would be used.

On the clang side, the following command-line options are added:
  -fropi
  -frwpi
  -fropi-lowering
  -frwpi-lowering
The first two enable ROPI and RWPI modes, and the second two enable
lowering of static initialisers that are not compatible with ROPI/RWPI.
Most users will not need to use the second two options, as they are
turned on by default when the -fropi and -frwpi options are used. All of
these options have -fno-* equivalents.

In addition to passing the command-line options through to the backend,
clang must be changed to work around a limitation in these modes: since
there is no dynamic loader, if a variable is initialised to the address
of a global value, it's initial value is not known at static link time.
For example:

  extern int i;
  int *p = &a; // Initial value unknown at static link time

SysV-style PIC solves this by having the dynamic linker fix up any
relocations on the data segment. Since these modes are trying to avoid
the need for a dynamic linker, we instead have the compiler emit code to
initialise these variables at startup time. These initiailisers are
expected to be rare, so the dynamic initiaslisers will be smaller than
the equivalent dynamic linker plus relocation and symbol tables.

If a variable with an initialiser that needs lowering is declared with a
const-qualified type, we must emit it as a non-constant so that it gets
put into writable memory. I'm using the "externally_initialized" flag to
prevent the optimiser from being able to turn dynamic initialisers back
into static ones.

Making a variable non-const can cause a chain of variables to need
initialisers in RWPI-only mode. For example:

  extern int a;
  static int * const b = &a;
  static int * const * const c = &b;

Here, "c" looks like is does not need an dynamic init, because "b" is
declared const. However, "b" itself needs a dynamic init, so must be
made non-const, meaning that "c" now needs a dynamic init. My patch
handles this correctly, but there is a similar case where it does not:

  extern int a;
  static int * const b;
  static int * const * const c = &b;
  static int * const b = &a;

Due to the design of clang, the IR for "c" has already been emitted
(as a constant, with a static initialiser) when the initialiser for "b"
is parsed, making "c"'s initialiser wrong. I haven't been able to find a
good way to implement this properly, so for now I'm working around this
by enabling both the ROPI and RWPI lowering when in RWPI-only mode. This
means that "c" will be given a dynamic init, and making "b" non-constant
does not change anything.

I have added some new warnings for cases where an ABI mismatch between
two translation units could be caused by ROPI/RWPI. These are:
- Extern global variables with const-qualified incomplete types. These
  are assumed to be constant, but may be put in a writable section by
  the TU which defines them if they have a non-trivial constructor or
  mutable member.
- Externally-visible variables with const-qualified types, where
  initialiser lowering makes them non-const. Other translation units
  will not know that the lowering has happened, and access them as RO
  rather than RW data.

I have also prohibited using ROPI with C++ (the vtables and RTTI are
read-only data, that must contain absolute pointers to other RO data),
but this can be overridden with -fallow-unsupported.

This also adds 3 new pre-defined macros for ARM targets:
  __APCS_FPIC
  __APCS_ROPI
  __APCS_RWPI

They are defined when building code with the -fpic, -fropi and -frwpi
options, respectively. __APCS_FPIC is also defined for AArch64 targets,
but the other two are not supported for AArch64. These macros are not
defined in the ACLE or any other standard, they are named to match the
macros defined by ARM Compiler 5.

ropi-rwpi-clang.patch (68.1 KB)

ropi-rwpi-llvm.patch (37.3 KB)

You don't need a full blown dynamic linker to handle that, just that the
linker creates output that can be appropiately references by the init
code. I don't think that dynamic initialisers will work correctly at
all, since you can access "i" in a separate module that doesn't know
about the initialiser at all.

Consider taking a look how most dynamic linkers operate themselve in the
ELF world. One of the first things they do is relocate themselve by
processing their own relocation table and applying the fixups. This
doesn't involve symbol tables at all, just patching up addresses.
As such, I don't think such transformation belongs into clang.

Joerg

What does armcc do here? It's been a while but I thought it was part
of the scatter-loading initialisation, with some kind of compressed
representation in the final linked image.

Cheers.

Tim.

armcc does the same thing as this patch: emit dynamic initialisers which get called from .init_array at startup.

Oliver

> In addition to passing the command-line options through to the backend,
> clang must be changed to work around a limitation in these modes: since
> there is no dynamic loader, if a variable is initialised to the address
> of a global value, it's initial value is not known at static link time.
> For example:
>
> extern int i;
> int *p = &a; // Initial value unknown at static link time
>
> SysV-style PIC solves this by having the dynamic linker fix up any
> relocations on the data segment. Since these modes are trying to avoid
> the need for a dynamic linker, we instead have the compiler emit code to
> initialise these variables at startup time. These initiailisers are
> expected to be rare, so the dynamic initiaslisers will be smaller than
> the equivalent dynamic linker plus relocation and symbol tables.

You don't need a full blown dynamic linker to handle that, just that the
linker creates output that can be appropiately references by the init
code.

That works fine for references in writable data, but not for references in read-only data, which a dynamic linker can't change. There are a few different ways this can be solved:
1) The patch I proposed, which makes const data writable and inserts dynamic initialisers.
2) Just move const data needing initialisation to an RW section in the compiler, and have a dynamic loader which adjusts it (based on tables generated by the linker). This would provide the same behaviour as option 1, but I don't think it would reduce the clang patch that much.
3) Don't make any change in the compiler. The linker can detect that const data needs dynamic relocations, and emit an error, so the user can change their code to make the data non-const. The linker can't make this change automatically, as there is already compiled code which accesses it, and making it non-const changes the way it should be addressed.

I don't think that dynamic initialisers will work correctly at
all, since you can access "i" in a separate module that doesn't know
about the initialiser at all.

This is only a problem when a const initialiser has to be is placed in a read-write section, as other translation units will access it incorrectly. I've added a warning when this happens.

Consider taking a look how most dynamic linkers operate themselve in the
ELF world. One of the first things they do is relocate themselve by
processing their own relocation table and applying the fixups. This
doesn't involve symbol tables at all, just patching up addresses.

That sounds like the same thing as options 2 and 3 above, right? I think the main difference in the embedded world is that code and RO data are stored in ROM or flash which are impossible or slow to overwrite, and minimising the amount of RAM used is desirable. Also, since this isn't being used for actual dynamic linking but just for a few static initialisers, the dynamic loader would be an unnecessary increase in code size.

As such, I don't think such transformation belongs into clang.

Fair enough, I posted this as an RFC because it is quite different to what already exists in clang, and option 3 above shows that the backend change can still be used without the majority of the clang patch.

Oliver

> I don't think that dynamic initialisers will work correctly at
> all, since you can access "i" in a separate module that doesn't know
> about the initialiser at all.

This is only a problem when a const initialiser has to be is placed in
a read-write section, as other translation units will access it
incorrectly. I've added a warning when this happens.

Depends on how exactly your initialiser code works. I had assumed you
are going with a TLS-like model of init-on-first access, since you
didn't want to use a dynamic-linker like startup code...

> Consider taking a look how most dynamic linkers operate themselve in the
> ELF world. One of the first things they do is relocate themselve by
> processing their own relocation table and applying the fixups. This
> doesn't involve symbol tables at all, just patching up addresses.

That sounds like the same thing as options 2 and 3 above, right?
I think the main difference in the embedded world is that code and RO
data are stored in ROM or flash which are impossible or slow to
overwrite, and minimising the amount of RAM used is desirable. Also,
since this isn't being used for actual dynamic linking but just for a
few static initialisers, the dynamic loader would be an unnecessary
increase in code size.

Stop being hung-up on the term dynamic loader. Let's take a look at the
NetBSD implementation for ARM. The entry point of RTLD is _rtld_start:

http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/libexec/ld.elf_so/arch/arm/rtld_start.S?rev=1.12

The important part here is the call to _rtld_relocate_nonplt_self, which
can be found in:

http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/libexec/ld.elf_so/arch/arm/mdreloc.c?rev=1.38

That's all the code needed for rtld to be position independent. Your
embedded case is likely to be quite similar in complexity -- no
iteration over the ELF program header, but maybe more than one
relocation type (implicitly assumed). Over all, it should be much less
than 150 Bytes of code.

Joerg