Note that you have not explained what your workflow is,
so i can not begin to comment on that. Now, i do recall from
previous discussions that you essentially throw it onto a bunch
of servers, and fetch the results the next day.
I think, that is an extreme outlier, a luxury that i’m afraid i do not have.
So far in all of the instances i had, i’ve always run it on a single machine,
that i was using in the first place, and then just wait until it finishes,
not doing anything in parallel.
I agree that currently, a single run of several minutes (let’s call it 5min?)
is not as bad overall. But essentially, i fundamentally do not agree with the
assumption i just made in the previous line.
Firstly, i’ve always found that a single run results are not good. They are noisy,
they may be incomplete – notice all the randomness in snippet generation.
So i always perform 10 runs. And boom, we are up to 50 minutes.
And then multiply that by the 3 measurement modes (latency/uops/inverse throughtput),
and we are now up to 150 minutes already… Is that too slow yet?
Now, let’s make an observation: we do a very little exploration.
- We generally don’t see what the effect is of choosing different registers is
- We don’t try different rounding modes (AVX512)
- We don’t try different masks (AVX512)
- etc etc etc (see @RKSimon acknowledgement)
… and all that will make it even slower, non-linearly.
Sure, you generally don’t run the whole-opcode analysis very often, so in principle you could wait.
But who says one must wait 10 days if the results can be computed in a day?
Fundamentally, i do not agree that the readability decrease of that patch is big enough
to be a cause of the worry, and in reverse, i don’t think an reverse patch would be reasonable,
given just how much slower it’d make the tool.
I’m not sure if this answered you question, but this is my position.
Roman.