X86VZeroUpper optimization question.

Hi Bruno,

I’m looking at a test case where we’re failing to insert a vzeroupper between an instruction that dirties the YMM regs and a call that uses SSE regs. No test case yet - I’m still trying to reduce it to something sane. I can see where the logic in the X86VZeroUpper optimization goes off the rails though: The entry state for the basic block is ST_UNKNOWN, and the optimization contains the following logic:

if (CurState == ST_DIRTY) {

// Only insert the VZEROUPPER in case the entry state isn’t unknown.
// When unknown, only compute the information within the block to have
// it available in the exit if possible, but don’t change the block.
if (EntryState != ST_UNKNOWN) {
BuildMI(BB, I, dl, TII->get(X86::VZEROUPPER));
++NumVZU;
}
// After the inserted VZEROUPPER the state becomes clean again, but
// other YMM may appear before other subsequent calls or even before
// the end of the BB.
CurState = ST_CLEAN;
}

If CurState == ST_DIRTY and EntryState == ST_UNKNOWN, then some instruction in this basic block has dirtied the YMM regs. In that case, why would you want to avoid putting a vzeroupper instruction in? Is it just to avoid inserting duplicate vzerouppers when the block is revisited? If that’s the case then I think the problem is actually in runOnMachineFunction, which contains the comment: “Each BB state depends on all predecessors, loop over until everything converges. (Once we converge, we can implicitly mark everything that is still ST_UNKNOWN as ST_CLEAN.)”. We do iterate to convergence, but we don’t mark anything as clean afterwards, nor do a final re-visit of the basic blocks that had previously had ST_UNKNOWN entry states. Is that an oversight?

Cheers,
Lang.

Hi all,

Upon further investigation I can confirm that the pass does miss some important cases, and visits some instructions more than it needs to. A rewrite is in the works and will be posted soonish.

Cheers,
Lang.