AutoFDO sample profiles v. SelectInst,

I am looking for advice on a problem observed with
-fprofile-sample-use for samples built with the AutoFDO tool

I took the “hmmer” benchmark out of SPEC2006
It is initially compiled

clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

This baseline binary runs in about 164.2 seconds as reported by “perf stat”

We build a sample file from this program using the AutoFDO tool “create_llvm_prof”

perf report -b hmmer nph3.hmm swiss41wa
create_llvm_prof -out hmmer.llvm …

and rebuild the binary using this profile

clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm
-O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

now, sadly, this program runs in 231.2 seconds.

The problem is that when a short conditional block is converted to a
SelectInst, we are unable to accurately recover the branch frequencies
since there is no actual branching. When we then compile in the
presence of the sample, phase “CodeGen Prepare” examines the profile
data and undoes the select conversion to disastrous results.

If we compile -O0 for training, and then use the profile now with
accurate branch weights, the program runs in 149.5
seconds. Unfortunately, of course, the training program runs in 501.4
seconds.

Alternately, if we disable the original select conversion performed in
SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
visible to sampling, the training program now runs in 229.7 seconds and
the optimized program runs in 151.5, so we recover essentially all of
lost information.

Of course both if these options are unfortunate because they alter the
workflow where it would be preferable to be able to monitor the
production codes to feed back into production builds. That suggests
that we remove the use of profile data in the CodeGen Prepare
phase. When that change is made, and we sample the baseline -O3
binary, the resulting optimized binary runs in 158.9 seconds.

That result is at least slightly better than baseline instead of much
worse but we are leaving 2-3% on the table. Maybe that is a reasonable
trade-off for having only production builds.

Any advice or suggestions?
Thanks
david

+dehao.

There are two potential problems:

  1. the branch gets eliminated in the binary that is being profiled, so there is no profile data

  2. select instruction is lowered into branch – but the branch profile data is not annotated back to the select instruction.

  3. is something that can be improved in SampleFDO.

I field two bugs

https://llvm.org/bugs/show_bug.cgi?id=28990

https://llvm.org/bugs/show_bug.cgi?id=28991

Which appear different but may be related.

This seems like a fundamental problem for PGO. Maybe it is also responsible
for this bug: https://llvm.org/bugs/show_bug.cgi?id=27359 ?

Should we limit select optimizations in IR for a PGO-training build? Or
should there be a 'select smasher' pass later in the pipeline that turns
selects into branches for a PGO-training build? (I don't have a good
understanding of PGO, so I'm just throwing out ideas...maybe a better
question is: how do other compilers handle this?)

I agree, this a fundamental problem with how the AutoFDO maps addresses to statements.

I have an experimental build where, rather than turning off certain optimizations, I change the DebugLoc information when we hoist instructions into a new execution context. That avoids the problem of wrong branch weights but means some branch weights are inferred rather than measured.

Perhaps rather than limit optimizations we have a variant of ‘-g’ which tolerates this kind of change

For instrumentation based PGO (IR-based), this is a known problem. I have a
solution for it and will send out patches soon. Before that, there will be
more changes in LLVM to make sure profile data associated with selectInst
is well preserved.

Thanks,

David

The problem reported by David here is for Sample FDO/PGO, not instrumentation based PGO.

David

Sounds great. Let me know if I can help without getting in your way. If
there's more like https://reviews.llvm.org/D23590 , I can try to fix them
up in parallel.

On a related note, I want to ask about profile-guided inlining. It does not
seem to exist after https://reviews.llvm.org/D16381 was reverted. Is there
a plan to bring it back independently of the new pass manager?

Profile-guided inlining was the original motivation for the test case in
https://llvm.org/bugs/show_bug.cgi?id=28964 . But I think we'll miss this
case if we fix SimplifyCFG to produce a 'select' before fixing IR-based PGO
(and making inlining work again)?

+dehao.

There are two potential problems:

1) the branch gets eliminated in the binary that is being profiled, so
there is no profile data

This seems like a fundamental problem for PGO. Maybe it is also
responsible for this bug: https://llvm.org/bugs/show_bug.cgi?id=27359 ?

Should we limit select optimizations in IR for a PGO-training build? Or
should there be a 'select smasher' pass later in the pipeline that turns
selects into branches for a PGO-training build? (I don't have a good
understanding of PGO, so I'm just throwing out ideas...maybe a better
question is: how do other compilers handle this?)

For instrumentation based PGO (IR-based), this is a known problem. I have
a solution for it and will send out patches soon. Before that, there will
be more changes in LLVM to make sure profile data associated with
selectInst is well preserved.

Sounds great. Let me know if I can help without getting in your way. If
there's more like https://reviews.llvm.org/D23590 , I can try to fix them
up in parallel.

yes there might be more missing cases -- I only did manual audition so
there is no guarantee to be exhaustive. Other passes may be dropping
profile data too. What we need is to introduce a verification pass (that
can be inserted after any given pass just like IR dumping) to do sanity
checking: if any branch/selectInst/switchInst has dropped branch profile
data (with PGO is on), emits some warning. For machine inst passes, we need
something similar. More elaborate check can also be added in the future to
check the integrity of the profile data -- some pass such as
jump-threading, switch lowering needs complicated update and things can go
wrong there.

If you can help with that (verification), that will be great :slight_smile:

On a related note, I want to ask about profile-guided inlining. It does
not seem to exist after https://reviews.llvm.org/D16381 was reverted. Is
there a plan to bring it back independently of the new pass manager?

yes there is a plan. We will update the status early next month.

Profile-guided inlining was the original motivation for the test case in
https://llvm.org/bugs/show_bug.cgi?id=28964 . But I think we'll miss this
case if we fix SimplifyCFG to produce a 'select' before fixing IR-based PGO
(and making inlining work again)?

it should not. After the selectInst is produced, the calll will be in the
block outside the original if-then-else so its profile data/hotness should
not be affected.

David

If AutoFDO is using debug info to map instructions back to source locations, then the problem is that debug info (at least DWARF) cannot describe the situation where one instruction in effect comes from two different places. This is a long-standing problem for debugging optimized code, and I guess would affect AutoFDO as well. We can either assign the instruction to one of the source locations (which is a lie part of the time) or we can say we don’t know where it comes from (which is kind of always true as the origin is ambiguous). I know that the debugging experience is not great the way things are now, but I don’t know whether it would be better if we started saying we don’t know where the code comes from.

There are a variety of optimizations that would have to address this: branch folding, various combines, tail merging, probably more. Right now I think they all just pick one somewhat arbitrarily.

–paulr