[lld] avoid emitting PLT entries for ifuncs

Hello,

We've recently started using ifuncs in the x86(_64) FreeBSD kernel.
Currently lld will emit a PLT entry for each ifunc, so ifunc calls are
more expensive that those of regular functions. In our kernel, this
overhead isn't really necessary: if lld instead emits PC-relative
relocations for each ifunc call site, where each relocation references
a symbol of type GNU_IFUNC, then during boot we can resolve each
call site and apply the relocation before mapping the kernel text
read-only. Then, ifunc calls have the same overhead as regular function
calls.

To implement this optimization, I wrote an lld patch to add
"-z ifunc-noplt". When this option is specified, lld does not create
PLT entries for ifuncs and instead passes the existing PC-relative
relocation through to the output file. The patch is below; I tested it
with lld 7.0 and the patch applied without modifications to the sources
in trunk.

I'm wondering if such an option would be acceptable in upstream lld, and
whether anyone had comments on my implementation. The patch is lacking
tests, and I had some questions:
- How should "-z ifunc-noplt" interact with "-z text"? Should the
  invoker be required to additionally specify "-z notext"?
- Could "-z ifunc-noplt" be subsumed by a more general mechanism which
  tells lld not to apply constant relocations and instead pass them
  through to the output file? I could imagine using such mechanism
  to make it possible to dynamically enable retpoline at boot time.
  It could also be useful for implementing static DTrace trace points.

Thanks,
-Mark

diff --git a/ELF/Config.h b/ELF/Config.h
index 5dc7f5321..b5a3d3266 100644
--- a/ELF/Config.h
+++ b/ELF/Config.h
@@ -182,6 +182,7 @@ struct Configuration {
   bool ZCopyreloc;
   bool ZExecstack;
   bool ZHazardplt;
+ bool ZIfuncnoplt;
   bool ZInitfirst;
   bool ZKeepTextSectionPrefix;
   bool ZNodelete;
diff --git a/ELF/Driver.cpp b/ELF/Driver.cpp
index aced1edca..e7896cedf 100644
--- a/ELF/Driver.cpp
+++ b/ELF/Driver.cpp
@@ -340,7 +340,8 @@ static bool getZFlag(opt::InputArgList &Args, StringRef K1, StringRef K2,

static bool isKnown(StringRef S) {
   return S == "combreloc" || S == "copyreloc" || S == "defs" ||
- S == "execstack" || S == "hazardplt" || S == "initfirst" ||
+ S == "execstack" || S == "hazardplt" || S == "ifunc-noplt" ||
+ S == "initfirst" ||
          S == "keep-text-section-prefix" || S == "lazy" || S == "muldefs" ||
          S == "nocombreloc" || S == "nocopyreloc" || S == "nodelete" ||
          S == "nodlopen" || S == "noexecstack" ||
@@ -834,6 +835,7 @@ void LinkerDriver::readConfigs(opt::InputArgList &Args) {
   Config->ZCopyreloc = getZFlag(Args, "copyreloc", "nocopyreloc", true);
   Config->ZExecstack = getZFlag(Args, "execstack", "noexecstack", false);
   Config->ZHazardplt = hasZOption(Args, "hazardplt");
+ Config->ZIfuncnoplt = hasZOption(Args, "ifunc-noplt");
   Config->ZInitfirst = hasZOption(Args, "initfirst");
   Config->ZKeepTextSectionPrefix = getZFlag(
       Args, "keep-text-section-prefix", "nokeep-text-section-prefix", false);
diff --git a/ELF/Relocations.cpp b/ELF/Relocations.cpp
index 8f60aa3d2..a54d87e43 100644
--- a/ELF/Relocations.cpp
+++ b/ELF/Relocations.cpp
@@ -361,6 +361,10 @@ static bool isStaticLinkTimeConstant(RelExpr E, RelType Type, const Symbol &Sym,
           R_TLSLD_HINT>(E))
     return true;

+ // The computation involves output from the ifunc resolver.
+ if (Sym.isGnuIFunc() && Config->ZIfuncnoplt)
+ return false;

Hello Mark,

Hello,

We've recently started using ifuncs in the x86(_64) FreeBSD kernel.
Currently lld will emit a PLT entry for each ifunc, so ifunc calls are
more expensive that those of regular functions. In our kernel, this
overhead isn't really necessary: if lld instead emits PC-relative
relocations for each ifunc call site, where each relocation references
a symbol of type GNU_IFUNC, then during boot we can resolve each
call site and apply the relocation before mapping the kernel text
read-only. Then, ifunc calls have the same overhead as regular function
calls.

To implement this optimization, I wrote an lld patch to add
"-z ifunc-noplt". When this option is specified, lld does not create
PLT entries for ifuncs and instead passes the existing PC-relative
relocation through to the output file. The patch is below; I tested it
with lld 7.0 and the patch applied without modifications to the sources
in trunk.

I'm wondering if such an option would be acceptable in upstream lld, and
whether anyone had comments on my implementation. The patch is lacking
tests, and I had some questions:

I'm not the LLD maintainer so this is just a personal opinion. If I
understand the optimisation correctly, if it used on some program then
either the loader for the program or the program itself is responsible
for running the ifunc resolver and resolving the callsites. I think it
would have to come with a big health warning in at least the help and
documentation that platform/OS support is needed to run the program.

- How should "-z ifunc-noplt" interact with "-z text"? Should the
  invoker be required to additionally specify "-z notext"?

I think it could it either be -z text -z ifunc-noplt = error, with -z
ifunc-noplt implying -z notext; or -ifunc-noplt is an error without -z
notext.

- Could "-z ifunc-noplt" be subsumed by a more general mechanism which
  tells lld not to apply constant relocations and instead pass them
  through to the output file? I could imagine using such mechanism
  to make it possible to dynamically enable retpoline at boot time.
  It could also be useful for implementing static DTrace trace points.

In theory on RELA platforms emit-relocs gets you pretty close; it
won't inhibit the generation of PLT or GOT entries though, but I think
it would give enough information to alter the callsites to the results
of the ifunc resolvers. I guess the problem here is where do you stop
and how portable would the solution be across different targets. For
example on Arm you would ideally only want to deal with a small subset
of the instruction relocations at run/load time. I think it is a
solvable problem but it does need some careful thought to avoid just
implementing something that works for a specific target/OS.

Peter

Yes - in our case we're using it for kernel ifuncs, and the kernel's
early reloc code handles the resolver and relocations. We'll
definitely want a cautionary note in the man page and other
documentation.

If you rewrite the PLT entry to be a plain jump whenever possible, the
difference should be pretty small. Have you considered that?

Joerg

I considered it, but don't like it as much: for each supported CPU
architecture we would need code to find the PLT entry referencing the
GOT entry being relocated, verify that the entry contains the
instruction(s) that we expect, and write the plain jump. For my
approach the kernel linker will just do the right thing for each CPU
architecture without requiring any magic. Having support in the static
linker means that the optimization is less fragile, and we do not incur
the cost of the extra jump. It may be that other projects can benefit
as well, and as I mentioned, I think there are other use-cases for
similar functionality.

Hello Mark,

> Hello,
>
> We've recently started using ifuncs in the x86(_64) FreeBSD kernel.
> Currently lld will emit a PLT entry for each ifunc, so ifunc calls are
> more expensive that those of regular functions. In our kernel, this
> overhead isn't really necessary: if lld instead emits PC-relative
> relocations for each ifunc call site, where each relocation references
> a symbol of type GNU_IFUNC, then during boot we can resolve each
> call site and apply the relocation before mapping the kernel text
> read-only. Then, ifunc calls have the same overhead as regular function
> calls.
>
> To implement this optimization, I wrote an lld patch to add
> "-z ifunc-noplt". When this option is specified, lld does not create
> PLT entries for ifuncs and instead passes the existing PC-relative
> relocation through to the output file. The patch is below; I tested it
> with lld 7.0 and the patch applied without modifications to the sources
> in trunk.
>
> I'm wondering if such an option would be acceptable in upstream lld, and
> whether anyone had comments on my implementation. The patch is lacking
> tests, and I had some questions:

I'm not the LLD maintainer so this is just a personal opinion. If I
understand the optimisation correctly, if it used on some program then
either the loader for the program or the program itself is responsible
for running the ifunc resolver and resolving the callsites. I think it
would have to come with a big health warning in at least the help and
documentation that platform/OS support is needed to run the program.

That's a good point. For FreeBSD I had documented the option in the man
page, and will amend it as you suggest.

> - How should "-z ifunc-noplt" interact with "-z text"? Should the
> invoker be required to additionally specify "-z notext"?

I think it could it either be -z text -z ifunc-noplt = error, with -z
ifunc-noplt implying -z notext; or -ifunc-noplt is an error without -z
notext.

I think the latter option is preferable for such a rarely used option,
since it's more explicit.

Hi Mark,

Although I do understand your motivation to add this feature, because the proposed change works only with a specific loader, I’d explorer other options before adding a new feature to the linker.

So the problem for the kernel loader is to know all locations from where the control jumps to ifunc PLT entries. Usually, once lld is done with linking, all traces of such relocations are discarded because they are no longer needed.

However, if you pass the -emit-relocs option to the linker, lld keeps all relocations that have already been resolved in an output executable. By analyzing a relocation table in a resulting executable, you could find all locations where the ifunc PLT is called. Then, you can construct a new table for your linker, embed it to the executable using objcopy or something like that, and then let the kernel loader interpret it.

Have you considered that?

I don't buy that argument. On the linker side, it is a custom hack for
something that is generally considered very bad nowadays: text
relocations. The PLT handling is already platform specific and some
platforms like SPARC do exactly this kind of PLT creation already.
In fact, with the recent popularity of non-lazy binding, it would make
sense to have wider support for it as it removes on one of the
performance penalties with dynamic linking. The cost of the extra jump
should be pretty low in general.

Joerg

However, if you pass the -emit-relocs option to the linker, lld keeps all
relocations that have already been resolved in an output executable. By
analyzing a relocation table in a resulting executable, you could find all
locations where the ifunc PLT is called. Then, you can construct a new table
for your linker, embed it to the executable using objcopy or something like
that, and then let the kernel loader interpret it.

Have you considered that?

I've thought about alternative ways to achieve the same thing,
including something like the above. My concern with that approach is
that it's rather cumbersome and can be error-prone, and introduces a
requirement for awkward multi-stage linking. In comparison Mark's
patch is a relatively tiny tweak in lld.

Despite the disadvantages I much prefer the proposed approach.

>
> However, if you pass the -emit-relocs option to the linker, lld keeps all
> relocations that have already been resolved in an output executable. By
> analyzing a relocation table in a resulting executable, you could find all
> locations where the ifunc PLT is called. Then, you can construct a new table
> for your linker, embed it to the executable using objcopy or something like
> that, and then let the kernel loader interpret it.
>
> Have you considered that?

I've thought about alternative ways to achieve the same thing,
including something like the above. My concern with that approach is
that it's rather cumbersome and can be error-prone, and introduces a
requirement for awkward multi-stage linking.

A couple of years ago I did something very much like this to experiment
with hot-patched static DTrace probes in the kernel. My conclusion at
the time was that even this kind of functionality (selective filtering
of relocations as a post-processing step) was quite awkward and is much
more easily implemented in the static linker.[*] In this case, we need to
modify the loadable segment descriptions, and we need to do some
unwinding of the static linker's work when processing non-RELA
relocations in the kernel. It is all doable, but requires a significant
amount of both userland and kernel code and a custom,
FreeBSD-kernel-specific post-link step.

[*] dtrace -G basically does this already. See dt_link.c in libdtrace:
it's 2000 LOC of complicated ELF handling, and it's required numerous
tweaks over the years to deal with changing behavioural quirks of GNU
ld and lld. r313262 is an example of this:
https://svnweb.freebsd.org/changeset/base/313262
The integration of dtrace(1) into the build systems of projects that
wish to provide static probes is painful and often creates race
conditions with incremental rebuilds. Rafael and I discussed this
several times and his general sentiment was that the functionality
provided by dtrace -G can be implemented much more sanely in lld.

>
> On the linker side, it is a custom hack for
> something that is generally considered very bad nowadays: text
> relocations.

True, although the argument against .text relocations doesn't hold for
our kernel use case.

Moreover, while being a hack, it is IMHO simply the right place to
implement this logic from an architectural standpoint. The static
linker is pessimizing ifunc calls in a freestanding environment and it
has all the information required to avoid doing this, and in a robust
manner. The small size of the required lld patch supports this view.
Addressing the problem any other way involves adding much larger and
more fragile hacks elsewhere.

In the context of support not only IFunc but DTrace, what kind of features do you want to add to lld, if you have a chance to implement it in the linker instead of a post-processing tool? I wonder if we can solve both of your problems.

I will try to explain what dtrace -G does first. It is required to
support USDT, which is a dtrace feature that allows one to define static
tracepoints in a usermode application. In C, the tracepoints look like
ordinary function calls (with programmer-defined parameters); dtrace
allows one to "tap" those tracepoints by overwriting them with a
breakpoint at runtime. At build time, dtrace -G processes each
relocatable object file, and outputs an object file which must be linked
into the final output. When processing the input object files, dtrace
looks for relocations against undefined symbols prefixed by "__dtrace_";
each such relocation is a tracepoint. It overwrites the corresponding
call with nops, changes the relocation type to R_[*]_NONE, and records
the tracepoint address, along with other information, in a metadata
section named ".SUNW_dof". This metadata section is placed in the
output object file, which contains a constructor that registers the DOF
section with the kernel during application startup.

To me this functionality seems closely related to the -zifunc-noplt
option even if the details differ. In both cases we really just
want the static linker to leave a certain set of relocations alone, and
pass them through to the output file. (In the case of dtrace, the
symbols referenced by the relocations are undefined, and in the case of
ifunc-noplt the symbols are defined.) A third item on my wishlist is
a way to dynamically disable retpolines during boot; having the set of
retpoline thunk calls available as relocations would make that
achievable in principle.

I haven't thought very much about how to extend lld to support my dtrace
use-case - the ifunc problem is simpler and more self-contained so I
started there.

Thank you very much for your explanation. That’s very helpful.

It looks like even though DTrace and what you are trying to do with your patch are conceptually similar, they are quite different in details, so I can’t think of a feature that can be used for both without depending on other post-processing tools. Maybe it is better to create each feature directly instead of creating something too generic.

I think my concern about this patch is the cost of maintenance. The feature will have a very small number of users (perhaps only FreeBSD), so the cost of maintenance is relatively high even thought the feature is tiny. I think I’m fine to accept this patch if you explicitly mark this as an experimental feature that might be removed in a feature release of lld if we find a better solution.

Thank you very much for your explanation. That's very helpful.

It looks like even though DTrace and what you are trying to do with your
patch are conceptually similar, they are quite different in details, so I
can't think of a feature that can be used for both without depending on
other post-processing tools. Maybe it is better to create each feature
directly instead of creating something too generic.

I do agree at this point that these features are probably too different
to be implemented in a generic way. They would all benefit from some
support in the static linker even if some amount of post-processing is
necessary in the end. I'm not sure what kind of design Rafael had in
mind for dtrace, if any.

I think my concern about this patch is the cost of maintenance. The
feature will have a very small number of users (perhaps only FreeBSD), so
the cost of maintenance is relatively high even thought the feature is
tiny. I think I'm fine to accept this patch if you explicitly mark this as
an experimental feature that might be removed in a feature release of lld
if we find a better solution.

Thanks. I'll work on writing some tests and polishing the existing
patch.