Potential missed optimisation with SEH funclets

I’ve been experimenting with SEH handling in LLVM, and it seems like the unwind funclets generated by LLVM are much larger than those generated by Microsoft’s CL compiler.

I used the following code as a test:

void test() {
MyClass x;
externalFunction();
}

Compiling with CL, the unwind funclet that destroys ‘x’ is just two lines of asm:

lea rcx, QWORD PTR x$[rdx]
jmp ??1MyClass@@QEAA@XZ

However when compiling with clang-cl, it seems like it sets up an entire function frame just for the destructor call:

mov qword ptr [rsp + 16], rdx

push rbp
.seh_pushreg 5
sub rsp, 32
.seh_stackalloc 32
Lea rbp, [rdx + 48]
.seh_endprologue
Lea rcx, [rbp - 16]
call "??1MyClass@@QEAA@XZ”
nop

add rsp, 32
pop rbp
ret

Both were compiled with “/c /O2 /MD /EHsc”

Is LLVM missing a major optimisation here?

Yes, not much effort has been applied to optimizing Windows exception handling. We were primarily concerned with making it correct, and improving it hasn’t been a priority. You can follow the code path through X86FrameLowering::emitPrologue with IsFunclet=true and see that it mechanically emits all the extra instructions mentioned above without any logic to skip such steps when not necessary.

However, while the mid-level representation we chose makes it hard to write these types of micro-level code quality optimizations, it allows the optimizers to do a variety of fancy things like heap to stack promotion on unique_ptr in the presence of exceptional control flow.

A quick skim of this code looks as if we are explicitly disabling frame pointer elimination for funclets in the back end. It looks as if this is done because FP-elim sometimes breaks funclets - if anyone has a test case for this then that would probably help tracking it down.

David

The main reason it is done is so that frame index resolution just works inside funclets. Otherwise, we’d have to code up some logic to use a different base register for stack object offsets inside funclets. Which, when you say it that way, seems pretty easy to implement. It’s just a matter of changing X86FrameLowering::getFrameIndexReference.

I’d like to work on improving this, and I’ve got a few ideas thanks to your pointers. However there’s one issue that I can’t seem to work out.

The funclets are treated as save and restore blocks for the associated function, which means that they’ll push/pop every callee saved register that the associated function uses, even if the funclets themselves don’t use them. I tried fixing this with some custom logic in X86FrameLowering::[spill/restore]CalleeSavedRegisters, but I couldn’t find a good way to determine which registers the block for the funclet actually use (without iterating over each instruction).

Is there a better way to approach this?