Minimizing the difference between `opt -OX` and `opt -passes="..."`

I can observe a fairly large difference in terms of binary size between using the pipeline provided by opt -OX -print-pipeline-passes via opt -passes="..." and using opt -OX directly.

I’d like to reduce the difference between opt -passes="..." and opt -OX, in particular, Oz.

I collected some data on the matter and summarized it in the following plots where dumped_OX is opt -passes="..." and OX is opt -OX.

For Oz and Os, we can see a trend of dumped_OX being larger than OX.

Unfortunately, I’m a beginner so I need some help/pointers.

  • Do you know why the pipelines could be different? (missing pass parameters?)
  • Are there some fundamental limits in how close we could get?
  • How much effort do you think this requires?
  • Which files should I have a look at? I know a bit about PassBuilder.cpp and PassBuilderPipelines.cpp but I have no clue what the issue could be.

Thanks!

(I’ll try to reduce some cases and see where the IR starts to differ.)

(Edit: I made a mistake in the initial histograms but they should be ok now.)

2 Likes

I doubt there is a conceptual reason we could not match exactly. It is different if you use clang vs opt, but opt vs opt should be the same, assuming our printing of the pipeline and reading of the pipeline are extended as needed.

I am guessing but I would check how many passes you actually execute. I could imagine that OX will schedule passes on demand, e.g., as part of the inliner loop or if devirtualization happens. Such conditional execution is probably not “encoded” in the pass pipeline.

I compared the passes that are run by running opt ... -print-changed and greping for *** IR Dump After on one file.

Pass count for -Oz:
5695
Pass count for dumped_Oz:
5698

The three additional passes in dumped_Oz come from opt always inserting verify in the beginning and a PrintModulePass and verify at the the end. So the amount of passes run is the same in this case.

Looking at the diff of the two runs, the pipelines diverge after ~1000 dumps when in Oz a speculative execution pass does not change the IR but does in dumped_Oz.
After that there are differences to which LICM ‘triggers’ first in the part of the pipeline that runs [...]loop-simplifycfg,licm,loop-rotate,licm[...].
Following that there are some InstCombine, SimplifyCfg, LoopUnrolling and Inliner pass that don’t change the IR in Oz but for dumped_Oz.

I’ll try to expand the search to more files later today.

My previous test was for opt 14.0.6.
I moved to trunk and now there’s a difference in how many passes are printed. At least for the (different) file I test.

I reduced the file such that the IR and the passes are different (amount or which pass) for dumped_Oz and Oz.
This is the reduce file for one case.

define void @a(i64 %0, i64 %1, i64* %2, i64* %3, i64 %4, i64 %5, i32 %6) {
  %8 = alloca i64, i32 0, align 8
  %9 = alloca i64, i32 0, align 8
  store i64 0, i64* %2, align 8
  store i64 %0, i64* %2, align 8
  %10 = load i64, i64* null, align 8
  %11 = sub i64 0, 1
  %12 = load i64, i64* %2, align 8
  %13 = mul i64 1, %0
  %14 = trunc i64 %0 to i32
  call void @b(i32 0, i32 %6)
  ret void
}

define internal void @b(i32 %0, i32 %1) {
  %3 = alloca i32, i32 0, align 4
  %4 = alloca i32, i32 0, align 4
  %5 = alloca i32, i32 0, align 4
  store i32 %0, i32* %3, align 4
  store i32 %0, i32* %4, align 4
  %6 = load i32, i32* undef, align 4
  %7 = load i32, i32* %3, align 4
  %8 = icmp sgt i32 1, %0
  br i1 %8, label %9, label %14

9:                                                ; preds = %2
  %10 = load i32, i32* %4, align 4
  store i32 0, i32* %5, align 4
  br label %11

11:                                               ; preds = %9
  br label %12

12:                                               ; preds = %11
  call void @c()
  call void @b(i32 1, i32 1)
  %13 = load i32, i32* %5, align 4
  call void @b(i32 %1, i32 undef)
  br label %14

14:                                               ; preds = %12, %2
  ret void
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #0

declare void @c()

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #0

attributes #0 = { argmemonly nofree nosync nounwind willreturn }

Resulting IR for Oz:

define void @a(i64 %0, i64 %1, ptr nocapture writeonly %2, ptr nocapture readnone %3, i64 %4, i64 %5, i32 %6) local_unnamed_addr {
  store i64 %0, ptr %2, align 8
  tail call void @c()
  %8 = icmp slt i32 %6, 1
  br i1 %8, label %tailrecurse.i, label %b.exit

tailrecurse.i:                                    ; preds = %7, %tailrecurse.i
  tail call void @c()
  br label %tailrecurse.i, !llvm.loop !0

b.exit:                                           ; preds = %7
  ret void
}

declare void @c() local_unnamed_addr

!0 = distinct !{!0, !1}
!1 = !{!"llvm.loop.peeled.count", i32 1}

Resulting IR for dumped_Oz:

define void @a(i64 %0, i64 %1, ptr nocapture writeonly %2, ptr nocapture readnone %3, i64 %4, i64 %5, i32 %6) local_unnamed_addr {
  store i64 %0, ptr %2, align 8
  %8 = icmp slt i32 %6, 1
  br label %tailrecurse.i

tailrecurse.i:                                    ; preds = %tailrecurse.i, %7
  tail call void @c()
  br i1 %8, label %tailrecurse.i, label %b.exit

b.exit:                                           ; preds = %tailrecurse.i
  ret void
}

declare void @c() local_unnamed_addr

In the diff of the two pass traces, after ~45 printed passes, Oz does nothing in a loop-rotate pass but dumped_Oz does.
The additional passes that are run by Oz are loop-instsimplify,loop-simplifycfg,licm<no-allowspeculation>,loop-rotate,licm<allowspeculation>. I think at this point Oz has a loop left that dumped_Oz does not. As the additional passes are parts of the pipeline, I don’t think there are additional dynamically used passes in Oz in this case unless these passes don’t print anything when they are executed.

Is there a way to dump or inspect the state of opt when doing an optimization?
Or print the configurations/settings/heuristics of individual passes?

I’ve heard of the remarks/opt-viewer.py but have not used them so far. I’ll check them next.

I had a look at opt-viewer.py but it seems only a few passes emit remarks. I implemented one or two new remarks but they weren’t really helpful.

Another reduction in C this time gave me the following testcase:

int a;
void b(){
  for (;a;a++){}
}

For this case, all passes run are the same, but some apply and some don’t. Again the first difference was in loop-rotate.
I found that loop-rotate has options that are not exposed to the textual interface. Exposing them fixes the difference for this testcase.

I’ll try to contribute the changes and search for more differences.

1 Like

I created a diff(?) here. ⚙ D153437 [opt] Exposing the parameters of LoopRotate to the -passes interface
Feedback is very much appreciated :smile: