Disable memset synthesis

Our application is 32-bit big-endian ARM and we use -O3 with LTO.

clang optimizes certain initialization of structures to zero with
calls to memset, which are not further lowered to move instructions.
Investigating perf reports, it looks like it may be beneficial to
disable this optimization that introduces a function call to memset in
certain hot paths.

I tried passing -fno-builtin, but that doesn't seem to help my case;
the code doesn't compile with -ffreestanding. Any suggestions on what
I could try to avoid the calls to memset? It is possible to reorganize
the code to avoid this, but I am looking for a more general solution.

I find that GCC has an option -fno-tree-loop-distribute-patterns that
can be used to disable memcpy/memset synthesis. I wonder if there is
something similar in llvm/clang.

Thanks,
Bharathi

I have no idea what that means, but we almost certainly don't have any
option with similar semantics. Clang does not provide options to
control specific optimization passes like that.

The best advice is to file a bug report about the situation you're
seeing where a call to memset is bad for performance. There's clearly
something going wrong with Clang's heuristics and the best solution is
to fix that.

Cheers.

Tim.

Do you mean clang or the backend? The discussion elsewhere about
disabling memcpy intrinsic forming already makes me suspect that some
targets don't handle those intrinsics well enough.

Joerg

I agree with Joerg. I don’t really see much value in disabling idiom recognition. If the target isn’t lowering memset/memcpy well, seems that fixing that would be more beneficial than being able to disable idiom recognition.

I think Sony exposes an option to disable idiom recognition in PS4
compiler. This seems like one of those areas where users keep asking for
something and we keep insisting that what they think they want isn't
actually what they want, i.e. disabling idiom recognition blocks mid-level
canonicalization and that leads to missing optimizations and bad
performance, etc. However, the user feedback has been persistent, and in
the interests of not having to hear about it again, we might want to
consider giving users the rope they need to hang themselves. It would let
them work around real performance problems today rather than waiting for
the next version of the compiler that will lower memset/memcpy/memcmp
better.

Right, it’s a balance between those snooty compiler developers who think they always know best versus those pesky real-world code authors who think they can hand-tune their code to do better than what the compiler comes up with.

Offhand I don’t know how often our licensees use the option in production, but it surely gives them a tool that lets them do their own measurements, and only come back to us when there is something worthwhile to report.

–paulr

My concern wasn’t a phylosophical one but a pragmatic one. Learning about poor choices when lowering memset is probably quite useful. Having a flag that just turns off idiom recognition for it may just work around the problem. But the problem may still exist.
In any case, I’m not fundamentally against such a flag but it just seems like something that could

  1. Hide a problem
  2. Get a bit unwieldy - today it’s memset, maybe tomorrow memcpy, etc. And then does a single flag turn off all idiom recognition? A separate flag for each? Maybe groups (i.e. Memory functions). And so on.

My concern wasn’t a phylosophical one but a pragmatic one. Learning about poor choices when lowering memset is probably quite useful. Having a flag that just turns off idiom recognition for it may just work around the problem. But the problem may still exist.
In any case, I’m not fundamentally against such a flag but it just seems like something that could

  1. Hide a problem
  2. Get a bit unwieldy - today it’s memset, maybe tomorrow memcpy, etc. And then does a single flag turn off all idiom recognition? A separate flag for each? Maybe groups (i.e. Memory functions). And so on.

Right. If someone can come up with a reasonable limiting principle here, that’s one thing, but we do not want to end up with endless options to enable and disable individual optimizations from individual passes.

There are always ways to work around optimizations that seem to actually pessimize code. In this case, you can probably just use volatile stores, but when there isn’t such a convenient idiomatic solution, empty inline asm blocks will almost always do the trick. The ability to subvert the optimizer like that is really crucial — for many reasons, really, but most importantly because it’s usually a much better solution for the user. In the short term, users want to get on with their lives without waiting for a new toolchain drop. In the medium term, it’s really nice for workarounds to be local (i.e. written explicitly in a specific piece of code) rather than global (passing a command-line option) because local changes only affect one piece of code and can be clearly commented (with intent, purpose, and a link to the compiler bug), whereas command-line options affect everything and have a way of continually accumulating until someone comes along ten years later wondering why you’re building ten million lines of code with -fbifurcate-endosomes.

John.

Since we are talking about the frontend, it is not even Loop Idiom
Recogniation, but the much simpler initialisation of larger variables.

Joerg

Oh, I missed that. I have no intention of changing IRGen to hand-roll
zero-initialization loops as a performance workaround. That's not really even
disabling an optimization, that's literally asking us to add more code to IRGen
to work around a backend problem. No way.

John.