llvm emits unoptimized code

Hi Devs,
Consider testcase here
https://godbolt.org/z/qHZzqw
When optimization is O1 or above it produces unoptimized code
because it calls __tls_get_address in loops.
While with optimization disabled
It produce single call to __tls_get_address outside of loop.
is this a missed optimization by llvm?

./Kamlesh

Looks pretty similar to the GCC generated code - have you benchmarked this code compared to something else hand-crafted to see if the tls_get_address is measurably slow? (& I guess there’s the possibility that the loop runs zero times - in which case doing it outside the loop would be a pessimization, potentially)

Hi Devs,
Consider testcase here
https://godbolt.org/z/qHZzqw
When optimization is O1 or above it produces unoptimized code
because it calls __tls_get_address in loops.
While with optimization disabled
It produce single call to __tls_get_address outside of loop.
is this a missed optimization by llvm?

It’s interesting to me that there’s a big difference in -fpie and -fpic.

https://godbolt.org/z/klX3q3

In particular, with -fpie, no call to __tls_get_addr is needed, so the underlying considerations for optimization change. This feels like the optimizer isn’t taking in to account the overhead of -fpic, when determining whether to hoist the address calculation out of the loop.

Hi Devs,
Consider testcase here
https://godbolt.org/z/qHZzqw
When optimization is O1 or above it produces unoptimized code
because it calls __tls_get_address in loops.
While with optimization disabled
It produce single call to __tls_get_address outside of loop.
is this a missed optimization by llvm?

It’s interesting to me that there’s a big difference in -fpie and -fpic.

https://godbolt.org/z/klX3q3

In particular, with -fpie, no call to __tls_get_addr is needed, so the underlying considerations for optimization change. This feels like the optimizer isn’t taking in to account the overhead of -fpic, when determining whether to hoist the address calculation out of the loop.

Looks pretty similar to the GCC generated code

Challenge accepted => https://godbolt.org/z/8PX2La

Which challenge? Sorry, could’ve linked to the godbolt I was looking at when I said that: https://godbolt.org/z/_07tOk - comparing GCC and Clang trunk on the code linked in the original post. Looked/looks fairly similar to me. But yeah, I don’t know much beyond that.

Right, your example showed where gcc and clang were similar.

My example https://godbolt.org/z/8PX2La showed where gcc produced code that was possibly twice as fast as clang’s code.

– Jorg

Looks like,
CodeGenPrepare::optimizeMemoryInst is sinking address computation into
users basic block.
so if we disable this(-mllvm -disable-cgp) we get same code as gcc.
see here https://godbolt.org/z/bMvIsx