-fpic ELF default: reclaim some -fno-semantic-interposition optimization opportunities?

I have a write-up about -fno-semantic-interposition in GCC and Clang.
<https://gist.github.com/MaskRay/2d4dfcfc897341163f734afb59f689c6&gt;

The preemptible by default property is perhaps one of default ELF properties which I favor the least.
(not an issue of ELF, but more of a toolchain default).
Clang is in a somewhat better situation than GCC because our default longstanding -fpic behavior
diverges a lot from -fsemantic-interposition and is closer to -fno-semantic-interposition.
Especially on non-x86, the function semantic interposition likely never works in
-fno-function-sections mode.

Sending this message to get some thoughts on relaiming some -fno-semantic-interposition optimization
opportunities for ELF -fpic default, at least for non-x86. We could start from a
clang/CMakeLists.txt cmake variable.

dogfood: [llvm-dev] [CMake][ELF] Add -fno-semantic-interposition and -Bsymbolic-functions build llvm-project itself with -fno-semantic-interposition

Circling back here.

The Clang default -fpic behavior is actually similar to gcc -fpic
-fno-semantic-interposition: interprocedural optimizations are
allowed.
Clang just doesn't use local aliases.

It turned out that suppressing variable interposition was my misread
of GCC's -fno-semantic-interposition documentation.
⚙ D102583 -fno-semantic-interposition: Don't set dso_local on GlobalVariable made the Clang behavior mostly match GCC.

There is only one compatibility thing left: Clang -fpic
-fno-semantic-interposition uses local aliases when taking the
address of a function, this can be incompatible with -fno-pic code
causing canonical PLT entries.
Such a pointer equality property for functions is rarely relied on in
practice (Windows require deliberate dllimport/dllexport; --icf=all
can break this from a different angle) so -fno-semantic-interposition
is generally fine.
Fixing the last point is actually easy: let -fno-pic use GOT when
taking the address of an non-definition function.
This is preferable on most architectures (only i386/ppc32 (and some
other exotic arches which may not be supported by llvm at all) may
take some performance hit, but taking the address of an non-definition
function is rare and should not be a performance bottleneck.)

100593 – [ELF] -fno-pic: Use GOT to take address of an external default visibility function I guess it may be
difficult to even get an agreed upon option from the GCC side as they
may not be fans fixing these fundamental issues.

I'd far prefer to have an attribute to explicitly say that the address
of a given symbol should always be computed indirectly (e.g. via GOT).
That gives the explicit control necessary for libraries without
penalizing the larger executables like clang.

Joerg

Taking the address (in code) of a non-definition function is rare,
rarer after optimization. At least when building clang, I cannot find
any penalizing.

With the following patch, -fno-pic will use GOT for non-i386.
I tested a stage-2 clang. The clang executable is **byte identical* if I use
such a modified clang to build the origin/main clang.

void ext(); // non-definition declaration
void *foo() { return (void*)ext; }

diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 9b31ecdbd81a..d451ec50f53d 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1057,10 +1057,10 @@ static bool shouldAssumeDSOLocal(const CodeGenModule &CGM,
      // -fno-pic sets dso_local on a function declaration to allow direct
      // accesses when taking its address (similar to a data symbol). If the
      // function is not defined in the executable, a canonical PLT entry will be
- // needed at link time. -fno-direct-access-external-data can avoid the
- // canonical PLT entry. We don't generalize this condition to -fpie/-fpic as
- // it could just cause trouble without providing perceptible benefits.
- if (isa<llvm::Function>(GV) && !CGOpts.NoPLT && RM == llvm::Reloc::Static)
+ // needed at link time. We only do this for legacy i386 where some
+ // applications may handle R_386_PC32 but not R_386_PLT32.
+ if (TT.getArch() == llvm::Triple::x86 && isa<llvm::Function>(GV) &&
+ !CGOpts.NoPLT && RM == llvm::Reloc::Static)
        return true;
    }
  diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index a6582879f6f3..e2dd02c4eae5 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -51673,7 +51673,8 @@ void X86TargetLowering::LowerAsmOperandForConstraint(SDValue Op,
      if (auto *GA = dyn_cast<GlobalAddressSDNode>(Op))
        // If we require an extra load to get this address, as in PIC mode, we
        // can't accept it.
- if (isGlobalStubReference(
+ if (getTargetMachine().getRelocationModel() != Reloc::Static &&
+ isGlobalStubReference(
                Subtarget.classifyGlobalReference(GA->getGlobal())))
          return;
      break;

I was not talking about just functions. I can't even think of a case
where pointer equality for function pointers matters. But the case I
care far more about is being able to avoid copy relocations for global
variables and that's the same problem (loading the address of a symbol).

Joerg

On the Clang side, `-fno-pic -fno-direct-access-external-data` uses
GOT to access a default visibility global variable today.
If all TUs use this option and assembly files do the right thing, copy
relocations can be avoided.

I know some folks prefer eliminating copy relocations for ABI and
security reasons.
I deliberately make the scope narrow to functions because functions
are where we can improve performance.

The x86 ABI maintainer seems to want to do this in a more complex way.
I have commented on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

>
> > > > Fixing the last point is actually easy: let -fno-pic use GOT when
> > > > taking the address of an non-definition function.
> > >
> > > I'd far prefer to have an attribute to explicitly say that the address
> > > of a given symbol should always be computed indirectly (e.g. via GOT).
> > > That gives the explicit control necessary for libraries without
> > > penalizing the larger executables like clang.
> > >
> > > Joerg
> >
> > Taking the address (in code) of a non-definition function is rare,
> > rarer after optimization. At least when building clang, I cannot find
> > any penalizing.
>
> I was not talking about just functions. I can't even think of a case
> where pointer equality for function pointers matters. But the case I
> care far more about is being able to avoid copy relocations for global
> variables and that's the same problem (loading the address of a symbol).
>
> Joerg

On the Clang side, `-fno-pic -fno-direct-access-external-data` uses
GOT to access a default visibility global variable today.
If all TUs use this option and assembly files do the right thing, copy
relocations can be avoided.

Most code in the wild doesn't use visibility flags and would be
penalized by that. An attribute would allow explicitly opting out of it
of direct access for system headers and other libraries.

I know some folks prefer eliminating copy relocations for ABI and
security reasons.
I deliberately make the scope narrow to functions because functions
are where we can improve performance.

For functions there are two cases: "unnamed" address use and "named"
address use. Kind of similar to what we have already for global
variables on whether they can be merged or not. Unnamed as in "I don't
care if it is the canonical address", so the linker is free to introduce
a PLT slot. This works fine on all architectures and without any
penalties if the binding is local. There might be some flag needed here
because the glibc implementation of the dynamic linker wants to do some
wonky fixup on the PLT, but that's a glibc specific issue and outside
the scope of LLVM. For the named address use we do care about the
canonical address and that's where the distinction of attributed vs
default assumption makes a difference: loading a pointer from the GOT vs
doing a (PC relative) address load. On i386 the former didn't have
patchable relocation support for a long time and I'm not sure it exists
nowadays, i.e. allow the linker to relax the mov into lea. It can be
even more complicated on other archs where address computations are
complicated like Sparc. The attribute infrastructure here is the same as
would be needed for global variables and those are where the more
expensive issues are. Copy relocations e.g. for a constant array can be
arbitrarily expensive and are an ABI maintainance nightmare, so finally
having a way that is cheap to avoid them would be a great step forward.

Proposal for this would be to have an attribute to specify the "owner"
of the implementation as a string and a matching clang option to specify
a non-default owner (e.g. __attributed__((definedby("libc"))) and
-fdefining=libc) and the empty string being the default, meaning the
main binary.

Joerg

Personally I care more about the function case.
The function case improves performance (default ld
-Bsymbolic-non-weak-functions.
[PATCH] gold: Add -Bsymbolic-non-weak-functions).

For the variable case (copy relocations) I care less. I just don't want GNU
folks to make the scheme too complex.

Anyway, my replies to copy relocations are below.

>
> > > > Fixing the last point is actually easy: let -fno-pic use GOT when
> > > > taking the address of an non-definition function.
> > >
> > > I'd far prefer to have an attribute to explicitly say that the address
> > > of a given symbol should always be computed indirectly (e.g. via GOT).
> > > That gives the explicit control necessary for libraries without
> > > penalizing the larger executables like clang.
> > >
> > > Joerg
> >
> > Taking the address (in code) of a non-definition function is rare,
> > rarer after optimization. At least when building clang, I cannot find
> > any penalizing.
>
> I was not talking about just functions. I can't even think of a case
> where pointer equality for function pointers matters. But the case I
> care far more about is being able to avoid copy relocations for global
> variables and that's the same problem (loading the address of a symbol).
>
> Joerg

On the Clang side, `-fno-pic -fno-direct-access-external-data` uses
GOT to access a default visibility global variable today.
If all TUs use this option and assembly files do the right thing, copy
relocations can be avoided.

Most code in the wild doesn't use visibility flags and would be
penalized by that. An attribute would allow explicitly opting out of it
of direct access for system headers and other libraries.

OpenBSD has PIE enabled by default on most architectures since OpenBSD 5.3.
All(most?) major Linux distributions have configured their GCC with
--enable-default-pie now.
FreeBSD has switched to default PIE for 64-bit architectures this year.
Users who care about -fno-pic performance are very few now.

The static linking scheme is shifting to the static PIE model as well.
(The trend was led by OpenBSD, followed by musl in 2015, followed by
glibc world in 2017
https://sourceware.org/bugzilla/show_bug.cgi?id=19574)

Global variable access can hardly take 1% time of an application. Using
a direct variable access or an indirect access via a prefilled GOT entry
is optimization in that 0.xx% case.

extern int var;
int foo() { return var; }

I know i386 and ppc32 can take a large great performance hit if we use GOT.
If we want to default -fno-pic to -fno-direct-access-external-data,
we can leave such arch behind. I just checked, -target i386 and -target ppc32
-fno-direct-access-external-data do not use GOT - the backend has not
implemented the non-pic GOT scheme.

I know some folks prefer eliminating copy relocations for ABI and
security reasons.
I deliberately make the scope narrow to functions because functions
are where we can improve performance.

For functions there are two cases: "unnamed" address use and "named"
address use. Kind of similar to what we have already for global
variables on whether they can be merged or not. Unnamed as in "I don't
care if it is the canonical address", so the linker is free to introduce
a PLT slot. This works fine on all architectures and without any
penalties if the binding is local. There might be some flag needed here
because the glibc implementation of the dynamic linker wants to do some
wonky fixup on the PLT, but that's a glibc specific issue and outside
the scope of LLVM. For the named address use we do care about the
canonical address and that's where the distinction of attributed vs
default assumption makes a difference: loading a pointer from the GOT vs
doing a (PC relative) address load. On i386 the former didn't have
patchable relocation support for a long time and I'm not sure it exists
nowadays, i.e. allow the linker to relax the mov into lea.

The x86-64 mov->lea scheme is called GOTPCRELX optimization.

i386 has the `mov foo@GOT(%reg1), %reg2` => `lea foo@GOTOFF(%reg1), %reg2` optimization.
Anyway i386 performance probably doesn't matters for anything now.

It can be
even more complicated on other archs where address computations are
complicated like Sparc. The attribute infrastructure here is the same as
would be needed for global variables and those are where the more
expensive issues are. Copy relocations e.g. for a constant array can be
arbitrarily expensive and are an ABI maintainance nightmare, so finally
having a way that is cheap to avoid them would be a great step forward.

Yes, I have seen such a large constant array, perhaps from some old ffmpeg
assembly code, or something like that.

There is a minor security risk (relro data can become writeable; ld.lld has
fixed the problem for non-linker-script case).

Proposal for this would be to have an attribute to specify the "owner"
of the implementation as a string and a matching clang option to specify
a non-default owner (e.g. __attributed__((definedby("libc"))) and
-fdefining=libc) and the empty string being the default, meaning the
main binary.

How does your "definedby" scheme improve external variable access performance?

Windows/macOS/Solaris do record whether the symbols are imported from,
but the information is only recorded after linking.
Object files don't record imports. This provides flexibility reorganizing libraries
without needing to fix up the code.