Does anyone use llvm-exegesis? Feedback wanted

In all the time i’ve used it, i found the measurement step to be excruciatingly,
unbearably, slow. It takes many minutes on the whole-opcode sweep,
but worse yet, it doesn’t really spend all that time well.

  • --benchmark-phase=prepare-snippet is instantaneous, takes less than a second
  • --benchmark-phase=prepare-and-assemble-snippet takes maybe 2…5 seconds
  • --benchmark-phase=assemble-measured-code takes minutes.

It is expected and known that the codegen is slow.
It is expected that it is linear-or-worse to the instruction count

We can parallelize it, and achieve almost an order of magnitude improvements.
I’ve done that in ⚙ D140271 [NFCI][llvm-exegesis] Benchmark: parallelize codegen (5x ... 8x less wallclock),
but the code owners do not think this is worthwhile.

If you are a llvm-exegesis user, please consider commenting your thoughts on it’s performance.

Roman

cc @RKSimon @oontvoo @john-brawn-arm @atanasyan @jsji

llvm-exegesis is a niche tool, so I don’t expect many people to use it.

Myself, I’ve often used llvm-exegesis to help identify problems in the x86 scheduler models, and it’s been incredibly useful. It doesn’t quite match the quality of info from the likes of instlatx64 and uops.info (mainly because there’s so little target specific handling), but that hasn’t been a stopper so far - although I do think we need to up our game here.

I always get annoyed when llvm-mca (a tool a lot more people use, especially with its compiler-explorer integration) gets blamed for what is really an issue with a scheduler model :frowning:

I agree that when running across the entire opcode list, the runtime is too long, but I don’t see myself doing this stage very frequently - once the PFM mappings are in place for that model, and I’ve got my latency/uops/throughput captures, my time is spent iteratively using llvm-exegesis in analysis mode as I tweak the model, which isn’t so bad.

But, I can see us having to perform this costly codegen/assembly stage more in the future as I think we will need to start measuring different (x86 specific) variants of the same instruction (different source values, different registers, different addressing modes, etc…) properly.

D140271 proposes to multi-thread the codegen stage (and then run the measurement stage as a single-thread) - we don’t do multi-threading very often in llvm tools, and I think it shows in the size of the change necessary, and it makes it a lot trickier to grok the code unfortunately.

This isn’t the first time that llvm-exegesis speed has been a problem (D52866 comes to mind…), and in the past we’ve been able to identify more specific optimizations, (e.g. container type / poor string handling) - I think we should be investigating that first.

Yep, which is why i’ve found the feedback in the likes of “find someone else to comment” a bit offputting…

Yep. And removing this massive performance gap makes it less of a problem to do more exploration.

Conceptually, the only thing it does, is changing

for(i : snippets) {
  preprocess(i);
  measure(i);
}

into

while(snippets) {
  tmp <- snippets[0:4]
  snippets <- snippets[4:]
#pragma omp parallel
  {
#pragma omp for
    for(i : tmp)
      preprocess(i);
  }
  for(i : tmp)
    measure(i);
}

I understand that the intersection between the people that know multi-threading,
people that write compilers, and llvm-exegesis developers, may be small,
but that change is almost entirely an idiomatic boilerplate. It really isn’t complex.

I’m quite certain the time is not being spent in the exegesis itself, but in the actual LLVM codegen.
While there may be some improvements there, the fix i’m proposing is a right way to solve this.

Here’s a faithful X-Ray trace (in chrome trace format) of

./bin/llvm-exegesis --mode=uops --benchmark-phase=assemble-measured-code --num-repetitions=100000 --opcode-name=VADDPSrr

exegesis-xray-ctf-trace.json.xz.txt (813.7 KB)
(rename to exegesis-xray-ctf-trace.json.xz, un-xz it, and e.g. upload to https://ui.perfetto.dev/)
As we can see, we spend all the time in llvm::exegesis::assembleToStream() (1.2 s in the example),
and there are 3 main places where it spends time:

  • DuplicateSnippetRepetitor::repeat() (0.2 s)
  • MachineVerifier (0.2 s)
  • llvm::AsmPrinter::emitFunctionBody() (0.8 s)

i’ve found the feedback in the likes of “find someone else to comment” a bit offputting…

The goal of that comment is to help get your patch move forward: Two reviewers raise the same concern. I’m trying to see whether that concern can be alleviated by more people who might have more context to give to support the tradeoff that is made towards more complexity and more speed.

I understand that the intersection between the people that know multi-threading,
people that write compilers, and llvm-exegesis developers, may be small,
but that change is almost entirely an idiomatic boilerplate. It really isn’t complex.

Multithreading is fine - especially in that embarrassingly parallel case. But I’d rather read the first loop than the second one - and all reviewers seem to agree on that. The discussion is not (yet) about the code, I first wanted us to agree that the performance improvement warrants the readability cost.

Let me describe my use case and why I’m not sure the speed of the codegen is that important: I’ve mostly used llvm-exegesis for three things:

  • Finding errors in LLVM scheduling models systematically.
  • Investigating scheduling models for one specific instruction
  • Investigating the performance of specific basic blocks (using the “snippet” mode), when simulation is not enough.

The first case requires running codegen and benchmarking once, then analysis a lot. Analysis speed is therefore important, but codegen is done only once so it’s fine it it takes a few minutes.

In the second and third use case, codegen runs in a fraction of a second so speed is unimportant.

This is why I don’t see codegen speed as being particularly critical, but I’m happy to change my mind Can you could describe the use case where you are running large amount of codegen repeatedly ?

Note that you have not explained what your workflow is,
so i can not begin to comment on that. Now, i do recall from
previous discussions that you essentially throw it onto a bunch
of servers, and fetch the results the next day.

I think, that is an extreme outlier, a luxury that i’m afraid i do not have.
So far in all of the instances i had, i’ve always run it on a single machine,
that i was using in the first place, and then just wait until it finishes,
not doing anything in parallel.

I agree that currently, a single run of several minutes (let’s call it 5min?)
is not as bad overall. But essentially, i fundamentally do not agree with the
assumption i just made in the previous line.

Firstly, i’ve always found that a single run results are not good. They are noisy,
they may be incomplete – notice all the randomness in snippet generation.
So i always perform 10 runs. And boom, we are up to 50 minutes.
And then multiply that by the 3 measurement modes (latency/uops/inverse throughtput),
and we are now up to 150 minutes already… Is that too slow yet?

Now, let’s make an observation: we do a very little exploration.

  • We generally don’t see what the effect is of choosing different registers is
  • We don’t try different rounding modes (AVX512)
  • We don’t try different masks (AVX512)
  • etc etc etc (see @RKSimon acknowledgement)
    … and all that will make it even slower, non-linearly.

Sure, you generally don’t run the whole-opcode analysis very often, so in principle you could wait.
But who says one must wait 10 days if the results can be computed in a day?

Fundamentally, i do not agree that the readability decrease of that patch is big enough
to be a cause of the worry, and in reverse, i don’t think an reverse patch would be reasonable,
given just how much slower it’d make the tool.

I’m not sure if this answered you question, but this is my position.

Roman.

Note that you have not explained what your workflow is,
so i can not begin to comment on that.

I’ve tried to do so in the previous comment, but let me answer to your additional questions.

Now, i do recall from previous discussions that you essentially throw it onto a bunch
of servers, and fetch the results the next day.

While I’m doing that for some other parts of my work, I don’t that for that specific use case. That is because if I’m looking at improving SKX and SKL and it takes 5 minutes to run the tool, I only need to run on 2 machines for 5 minutes, I don’t need a bunch of servers overnight.

Now, let’s make an observation: we do a very little exploration.

If we start doing more exploration, say we explore 10x more snippets, then I think that optimization starts making more sense. But then let’s start by doing that exploration, and optimize then if that’s needed.

Firstly, i’ve always found that a single run results are not good. They are noisy,

I’ve found that there is some noise across runs coming from changes in register allocation (we sometimes trigger idioms), but apart from that I’ve found measurements for a single compiled to be extremely reproducible. I think the noise coming from changes in register allocation advocates for more exploration.

they may be incomplete

We’re the incompleteness from ? The snippet crashing during measurement ?

So i always perform 10 runs. And boom, we are up to 50 minutes.
And then multiply that by the 3 measurement modes (latency/uops/inverse throughtput),
and we are now up to 150 minutes already… Is that too slow yet?

OK, that is a new data point. It’s not a thing that I’ve had to do, but if running a bunch of times is required then that does change everything. I tend to be convinced by this argument. @gchatelet what do you think ?

… and all that will make it even slower, non-linearly.

If more exploration leads to non linearity then it seems that parallelization is not the way to go. It’s an easy short term fix but it will add code complexity and it will likely not help much in the long term.

I think we need to be smarter in the way we explore the space, this includes the randomness bit that @LebedevRI is mentioning.

I believe we can improve quite a lot through a combination of precomputation, filtering and exploration strategies.

@LebedevRI I think I still didn’t quite grasp your use case. What is your use of the tool? Is it the same as @legrosbuffle or @RKSimon (i.e., check the scheduler model for a particular CPU)?
Do you need to run through all the opcodes? With ~16K instructions on SKX I can see a clear win by restricting exploration to a particular class (arithmetic / vector / feature based etc…)

You can do both. Everything that can be parallelized should be parallelized

1 Like

I’ve already kinda started going that road with
https://reviews.llvm.org/D60066
https://reviews.llvm.org/D74156
https://reviews.llvm.org/D139283
… and i have a follow-up to do the same as D139283 for
the normal "instruction is parallel, repeating a random one." case.

I don’t recall specific examples, but in principle it’s most notable for latency measurement,
when you need helper instructions, because my another patch is still not accepted:
https://reviews.llvm.org/D60401

To be noted, i’m not making this up for the sake of the argument. I really do run it 10 times.

What @arsenm said. This argument is fundamentally invalid and i will not comment on it further.

Yes.

Hm, my apologies, it would appear that i have lied.
It, of course, does not take 150 minutes, but ~300 minutes - there’s two repetition modes.

I would like to acknowledge that my argument was flawed and unhelpful. My apologies for that.

I think there is frustration on both sides and I would like to take a step back and reflect on the situation.
It is clear that we all want the tool to be useful and you’ve been contributing a lot to it over the years.

As it is today, the tool is quite complicated. The exploration is ad-hoc and the search is not guided, it boils down to a lot of random sampling that falls short when it comes to exploring corner cases. On top of that, we are far from the level of exploration that we all want. Ultimately I believe we need a more principled approach and that a collection of small improvements cannot get us where we want to be.

That is essentially why I push back on anything that adds more complexity to the tool. The patch we’re talking about is not horrible in that regard but it adds up.

That said, with the last numbers you shared it is clear that the patch helps with your workflow so I agree to move forward with it. I just wanted to be clearer on where I stand.

SG, we have an agreement on the principle.

I want to note that I think that discussion could have been carried out in a softer tone. As a reviewer, I think it is my duty to correctly review the code that I’m presented with. That means pushing back when there are things I do not understand or am not convinced about. This does not mean that patches do not have merit, but simply that I need more data to reach a conclusion. I think the data has been provided now.

1 Like

Thanks!

I think it’s quite apparent to anyone paying attention that my ability to contribute,
and communicate, do not correlate. I really don’t deal well with unreasonable feedback,
or being intentionally pushed out of a comfort zone, or antagonized, etc etc etc.
Which is why i try to keep any kind of communication, that is not absolutely required,
to a bare minimum.

I think it’s quite apparent to anyone paying attention that my ability to contribute,
and communicate, do not correlate.

I acknowledge that and I do value your contributions, but working on an open-source project - and code review in particular - unfortunately (?) requires communication.

The point I was trying to make is that I don’t think that a reviewer asking for more context for a patch qualifies as unreasonable feedback.

2 Likes

I was quite obviously enumerating in general.
It would have better if that comment was left without reply.

My experience is that discussions and reviews go more smoothly when

  • comments are made about the work, not the person doing the work
  • comments are taken to be about the work, not the person doing the work
  • replies are about the work, not about the person one is replying to

I say this as someone who does not always succeed in these things, but I’m aware of the above principles and they do seem to help when I remember them.

3 Likes