So I started doing performance analysis again, and we've slowed down
quite a bit. My current test is statically linking clang for Linux on
Windows. I currently care mostly about Windows performance as that's
where we run it.
Here's a rough breakdown of time usage (doesn't add up to %100 because
%8.8 - fs::get_magic from the driver.
%0.8 - Reading the files on the command line.
%29 - Resolver. This ~%90 of this is reading objects out of archives.
This can be parallelized, and I have an outdated patch which does
%51 - Passes. Mostly the layout pass. And in the layout pass it's
mostly due to cache misses. I've already tried parallelizing the sort,
it doesn't help much.
%9 - Writer. Most of this is in prep work. The actual writing to
disk part and applying relocations is very small.
%1 - Unaccounted for.
I'm going to do some work to solve the get_magic and resolver issue
with threads. I think we really need to look into how the layout pass
is handled. If the cache effects are bad enough, we may actually need
to change to a non-virtual POD based interface for atoms. Meaning that
readers fill in atom data at the start, instead of figuring it out at
- Michael Spencer