I was wandering how much is the overhead of virtual function calls of TargetInfo member functions. TargetInfo handles platform-specific details, and we have target-specific subclasses of that class. The subclasses override functions defined in TargetInfo.
The TargetInfo member functions are called multiple times for each relocation. So the cost of virtual function calls may be non-neglible. That is a motication to do the following test.
As a test, I removed all TargetInfo subclasses except for x86-64, move all code from X86_64TargetInfo to TargetInfo, and remove virtual
from TargetInfo.
The original LLD links itself (with debug info) in 7.499 seconds. The de-virtualized version did the same thing in 7.364 seconds. So it can improve it by 1.8%.
I’m just pointing out that there’s room there to improve performance, and I’m not suggesting we do something for this right now. We probably shouldn’t do anything for this because the current code is pretty straightforward. But I’d expect that we will eventually want do something for this in future.
I believe the relocation stuff that Rafael is currently working on will make this a non-issue (it will make relocation application much friendlier for the CPU).
However, even in the current scheme, since the target is fixed, all the indirect call sites should be monomorphic and so there shouldn’t be much branch-prediction cost (certainly nothing that would cause 1.8% performance delta for the entire link).
Notice that 1.8% is smaller than the performance variation from r263227 which is a very innocuous-looking change but caused ~2-3% slowdown for ScyllaDB (see the thread “LLD performance w.r.t. local symbols (and --build-id)”).
– Sean Silva
I believe the relocation stuff that Rafael is currently working on will
make this a non-issue (it will make relocation application much friendlier
for the CPU).
I don't think Rafael's patch would make this a non-issue. He's making
scanRelocs to create data, which would reduce the number of calls to the
virtual functions, but it wouldn't be reduced to zero.
However, even in the current scheme, since the target is fixed, all the
indirect call sites should be monomorphic and so there shouldn't be much
branch-prediction cost (certainly nothing that would cause 1.8% performance
delta for the entire link).
Agreed. We could template functions that call TargetInfo's member functions
for each target to eliminate the virtual function calls.
Are you measuring with a build of LLD with LTO+PGO (+ICP+WholeDevirtualization…)?
We have compiler technologies to address these problems, I think we’d better leave it to them when they can handle it.
I didn't measure it with them, but can it devirtualize these function calls?
TargetInfo is a class to handle target-specific stuff. We create an
instance of its subclass at startup and call its member functions. For each
linker invocation, there is only one TargetInfo instantiated, but the types
of TargetInfos for two linker invocations are not guaranteed to be the same
(if you are cross-linking a binary, it's different from the default.)
In order to devirtualize that, I think the compiler needs to deduct that we
create only one TargetInfo in a program and that will never change once it
is created. Is it possible?
I believe the relocation stuff that Rafael is currently working on will
make this a non-issue (it will make relocation application much friendlier
for the CPU).
I don't think Rafael's patch would make this a non-issue. He's making
scanRelocs to create data, which would reduce the number of calls to the
virtual functions, but it wouldn't be reduced to zero.
However, even in the current scheme, since the target is fixed, all the
indirect call sites should be monomorphic and so there shouldn't be much
branch-prediction cost (certainly nothing that would cause 1.8% performance
delta for the entire link).
Agreed. We could template functions that call TargetInfo's member
functions for each target to eliminate the virtual function calls.
From what has been presented I would not conclude that virtual calls are
actually the problem (or a problem at all). A root-cause analysis is
necessary. As r263227 shows, the relocation application loop is very
sensitive to small changes.
One quick thing you may also want to try as a sanity check is inserting
nops in different places in the function. I suspect you'll find that the
performance swings (both speedups and slowdowns) from doing that are
similar in magnitude to what you are seeing. You may also want to try
editing the indirect call instruction to a direct call without otherwise
modifying the binary; if that reproduces the 1.8% speedup then it will be
convincing.
If you haven't read it, I think you would enjoy this paper:
-- Sean Silva
I believe the relocation stuff that Rafael is currently working on will
make this a non-issue (it will make relocation application much friendlier
for the CPU).
I don't think Rafael's patch would make this a non-issue. He's making
scanRelocs to create data, which would reduce the number of calls to the
virtual functions, but it wouldn't be reduced to zero.
However, even in the current scheme, since the target is fixed, all the
indirect call sites should be monomorphic and so there shouldn't be much
branch-prediction cost (certainly nothing that would cause 1.8% performance
delta for the entire link).
Agreed. We could template functions that call TargetInfo's member
functions for each target to eliminate the virtual function calls.
From what has been presented I would not conclude that virtual calls are
actually the problem (or a problem at all). A root-cause analysis is
necessary. As r263227 shows, the relocation application loop is very
sensitive to small changes.
One quick thing you may also want to try as a sanity check is inserting
nops in different places in the function. I suspect you'll find that the
performance swings (both speedups and slowdowns) from doing that are
similar in magnitude to what you are seeing. You may also want to try
editing the indirect call instruction to a direct call without otherwise
modifying the binary; if that reproduces the 1.8% speedup then it will be
convincing.
Honestly I was somewhat skeptical about what you wrote here, but I observed
0.4% *slowdown* when I used gcc to compile it, so looks like I was wrong.
It is possible that devirtualization might have been effective for
clang-generated code, but it is more likely that that was a result of some
performance deviation caused by some other factor.
The relocation handling loop is really a tight loop and therefore sensitive
to small changes. How can we optimize this? Maybe PGO?
If you haven't read it, I think you would enjoy this paper:
I believe the relocation stuff that Rafael is currently working on will
make this a non-issue (it will make relocation application much friendlier
for the CPU).
I don't think Rafael's patch would make this a non-issue. He's making
scanRelocs to create data, which would reduce the number of calls to the
virtual functions, but it wouldn't be reduced to zero.
However, even in the current scheme, since the target is fixed, all the
indirect call sites should be monomorphic and so there shouldn't be much
branch-prediction cost (certainly nothing that would cause 1.8% performance
delta for the entire link).
Agreed. We could template functions that call TargetInfo's member
functions for each target to eliminate the virtual function calls.
From what has been presented I would not conclude that virtual calls are
actually the problem (or a problem at all). A root-cause analysis is
necessary. As r263227 shows, the relocation application loop is very
sensitive to small changes.
One quick thing you may also want to try as a sanity check is inserting
nops in different places in the function. I suspect you'll find that the
performance swings (both speedups and slowdowns) from doing that are
similar in magnitude to what you are seeing. You may also want to try
editing the indirect call instruction to a direct call without otherwise
modifying the binary; if that reproduces the 1.8% speedup then it will be
convincing.
Honestly I was somewhat skeptical about what you wrote here, but I
observed 0.4% *slowdown* when I used gcc to compile it, so looks like I was
wrong. It is possible that devirtualization might have been effective for
clang-generated code, but it is more likely that that was a result of some
performance deviation caused by some other factor.
The relocation handling loop is really a tight loop and therefore
sensitive to small changes. How can we optimize this? Maybe PGO?
Rafael's change will fix it. That is why he is doing it in the first place

(The idea came when we were crunching the numbers for "LLD performance
w.r.t. local symbols (and --build-id)" and looked at r263227. I suggested
that this looked like it was because this loop is getting long enough to
run-out the CPU's reorder buffer waiting on the cache misses, preventing it
from seeing the memory accesses of the next iteration and thus failing to
pipeline the memory accesses across iterations. Small changes in
scheduling, instruction count, etc. will tickle this and cause large
performance changes. The solution is to make the relocation application
loop tighter. Especially by separating some of the stuff that we currently
have inside one huge loop to be in separate loops, but also by making the
loops tighter.)
-- Sean Silva
I believe the relocation stuff that Rafael is currently working on
will make this a non-issue (it will make relocation application much
friendlier for the CPU).
I don't think Rafael's patch would make this a non-issue. He's making
scanRelocs to create data, which would reduce the number of calls to the
virtual functions, but it wouldn't be reduced to zero.
However, even in the current scheme, since the target is fixed, all the
indirect call sites should be monomorphic and so there shouldn't be much
branch-prediction cost (certainly nothing that would cause 1.8% performance
delta for the entire link).
Agreed. We could template functions that call TargetInfo's member
functions for each target to eliminate the virtual function calls.
From what has been presented I would not conclude that virtual calls are
actually the problem (or a problem at all). A root-cause analysis is
necessary. As r263227 shows, the relocation application loop is very
sensitive to small changes.
One quick thing you may also want to try as a sanity check is inserting
nops in different places in the function. I suspect you'll find that the
performance swings (both speedups and slowdowns) from doing that are
similar in magnitude to what you are seeing. You may also want to try
editing the indirect call instruction to a direct call without otherwise
modifying the binary; if that reproduces the 1.8% speedup then it will be
convincing.
Honestly I was somewhat skeptical about what you wrote here, but I
observed 0.4% *slowdown* when I used gcc to compile it, so looks like I was
wrong. It is possible that devirtualization might have been effective for
clang-generated code, but it is more likely that that was a result of some
performance deviation caused by some other factor.
The relocation handling loop is really a tight loop and therefore
sensitive to small changes. How can we optimize this? Maybe PGO?
Rafael's change will fix it. That is why he is doing it in the first place

(The idea came when we were crunching the numbers for "LLD performance
w.r.t. local symbols (and --build-id)" and looked at r263227. I suggested
that this looked like it was because this loop is getting long enough to
run-out the CPU's reorder buffer waiting on the cache misses, preventing it
from seeing the memory accesses of the next iteration and thus failing to
pipeline the memory accesses across iterations. Small changes in
scheduling, instruction count, etc. will tickle this and cause large
performance changes. The solution is to make the relocation application
loop tighter. Especially by separating some of the stuff that we currently
have inside one huge loop to be in separate loops, but also by making the
loops tighter.)
Well, as much as I'm now skeptical about my result, I'm skeptical about
that would really fix it, but maybe we should just wait and measure it once
Rafael's patch is ready. 
-- Sean Silva
Well, as much as I’m now skeptical about my result, I’m skeptical about that would really fix it, but maybe we should just wait and measure it once Rafael’s patch is ready. 
I should hopefully have it ready tomorrow 
Cheers,
Rafael