Personally I care more about the function case.
The function case improves performance (default ld
-Bsymbolic-non-weak-functions.
[PATCH] gold: Add -Bsymbolic-non-weak-functions).
For the variable case (copy relocations) I care less. I just don't want GNU
folks to make the scheme too complex.
Anyway, my replies to copy relocations are below.
>
> > > > Fixing the last point is actually easy: let -fno-pic use GOT when
> > > > taking the address of an non-definition function.
> > >
> > > I'd far prefer to have an attribute to explicitly say that the address
> > > of a given symbol should always be computed indirectly (e.g. via GOT).
> > > That gives the explicit control necessary for libraries without
> > > penalizing the larger executables like clang.
> > >
> > > Joerg
> >
> > Taking the address (in code) of a non-definition function is rare,
> > rarer after optimization. At least when building clang, I cannot find
> > any penalizing.
>
> I was not talking about just functions. I can't even think of a case
> where pointer equality for function pointers matters. But the case I
> care far more about is being able to avoid copy relocations for global
> variables and that's the same problem (loading the address of a symbol).
>
> Joerg
On the Clang side, `-fno-pic -fno-direct-access-external-data` uses
GOT to access a default visibility global variable today.
If all TUs use this option and assembly files do the right thing, copy
relocations can be avoided.
Most code in the wild doesn't use visibility flags and would be
penalized by that. An attribute would allow explicitly opting out of it
of direct access for system headers and other libraries.
OpenBSD has PIE enabled by default on most architectures since OpenBSD 5.3.
All(most?) major Linux distributions have configured their GCC with
--enable-default-pie now.
FreeBSD has switched to default PIE for 64-bit architectures this year.
Users who care about -fno-pic performance are very few now.
The static linking scheme is shifting to the static PIE model as well.
(The trend was led by OpenBSD, followed by musl in 2015, followed by
glibc world in 2017
https://sourceware.org/bugzilla/show_bug.cgi?id=19574)
Global variable access can hardly take 1% time of an application. Using
a direct variable access or an indirect access via a prefilled GOT entry
is optimization in that 0.xx% case.
extern int var;
int foo() { return var; }
I know i386 and ppc32 can take a large great performance hit if we use GOT.
If we want to default -fno-pic to -fno-direct-access-external-data,
we can leave such arch behind. I just checked, -target i386 and -target ppc32
-fno-direct-access-external-data do not use GOT - the backend has not
implemented the non-pic GOT scheme.
I know some folks prefer eliminating copy relocations for ABI and
security reasons.
I deliberately make the scope narrow to functions because functions
are where we can improve performance.
For functions there are two cases: "unnamed" address use and "named"
address use. Kind of similar to what we have already for global
variables on whether they can be merged or not. Unnamed as in "I don't
care if it is the canonical address", so the linker is free to introduce
a PLT slot. This works fine on all architectures and without any
penalties if the binding is local. There might be some flag needed here
because the glibc implementation of the dynamic linker wants to do some
wonky fixup on the PLT, but that's a glibc specific issue and outside
the scope of LLVM. For the named address use we do care about the
canonical address and that's where the distinction of attributed vs
default assumption makes a difference: loading a pointer from the GOT vs
doing a (PC relative) address load. On i386 the former didn't have
patchable relocation support for a long time and I'm not sure it exists
nowadays, i.e. allow the linker to relax the mov into lea.
The x86-64 mov->lea scheme is called GOTPCRELX optimization.
i386 has the `mov foo@GOT(%reg1), %reg2` => `lea foo@GOTOFF(%reg1), %reg2` optimization.
Anyway i386 performance probably doesn't matters for anything now.
It can be
even more complicated on other archs where address computations are
complicated like Sparc. The attribute infrastructure here is the same as
would be needed for global variables and those are where the more
expensive issues are. Copy relocations e.g. for a constant array can be
arbitrarily expensive and are an ABI maintainance nightmare, so finally
having a way that is cheap to avoid them would be a great step forward.
Yes, I have seen such a large constant array, perhaps from some old ffmpeg
assembly code, or something like that.
There is a minor security risk (relro data can become writeable; ld.lld has
fixed the problem for non-linker-script case).
Proposal for this would be to have an attribute to specify the "owner"
of the implementation as a string and a matching clang option to specify
a non-default owner (e.g. __attributed__((definedby("libc"))) and
-fdefining=libc) and the empty string being the default, meaning the
main binary.
How does your "definedby" scheme improve external variable access performance?
Windows/macOS/Solaris do record whether the symbols are imported from,
but the information is only recorded after linking.
Object files don't record imports. This provides flexibility reorganizing libraries
without needing to fix up the code.