Fast scalar fsqrt tuning in x86

I recently discovered that the X86 code generation for sqrt with fast-math enabled is inconsistent for what I think should be essentially synonymous sets of function attributes. Consider the following IR:

define dso_local float @foo(float noundef %0) local_unnamed_addr #1 {
  %2 = tail call fast float @llvm.sqrt.f32(float %0)
  ret float %2

If this IR is compiled with “target-cpu”=“x86-64” and either no “tune-cpu” attribute or “tune-cpu”=“x86-64” the sqrt intrinsic gets expanded to an approximation using rsqrtss and a refinement sequence. However, if it is compiled with “tune-cpu”=“generic” the sqrtss instruction is selected.

See Compiler Explorer

I consider this to be rather unfortunate because in addition to being more accurate, I believe the sqrtss instruction is faster than the approximation sequence for Intel processors since Sandy Bridge. I understand that the situation is less clear for AMD processors, but I don’t know the specifics.

If I compile using clang with no -mtune or -march option, clang will use “tune-cpu”=“generic” so clang has the default behavior I want, but if another front end simply omits the “tune-cpu” attribute, it will get different default tuning. In fact, that’s exactly how I came to notice this. I was investigating a problem reported against a Fortran compiler and was surprised to see that it behaved differently than clang in this respect.

It seems to me that “tune-cpu”=“generic” should be equivalent to “tune-cpu”=“x86-64” or no “tune-cpu” attribute. I don’t know how complicated it would be to make them use the same code path. At the very least, I’d like to add TuningFastScalarFSQRT to the “x86-64” ProcModel. Does anyone object to this?

I’m not opposed to this specifically. Are you also going to be updating the Fortran compiler to use -tune-cpu=“generic”? Seems like there could be other issues like this.

I have recommended setting that in the Fortran compiler (not flang, BTW), but like I said above I think it would be better if we just made “tune-cpu”=“generic” the default behavior so that we don’t have to update every LLVM-based front end to avoid problems like this.

I guess it comes down to how you interpret “generic” and “x86-64”. In terms of target architecture, “x86-64” has a rather specific meaning as distinguished from “x86-64-v[2|3|4]”, but since it’s also used as the “target-cpu” when none is specified, that’s a bit problematic for tuning. The comments in say "‘generic’ is the default llc CPU, but it isn’t really. The comments for “generic” recommend using “x86-64” in LIT tests since “generic” will change over time. That was added in this change: ⚙ D118534 [X86] Introduce more common modern tunings into `generic`

My thinking is that whatever front ends get if they don’t specify an explicit tuning should change over time, and if that means LIT tests have to be updated I’m OK with that. @phoebe pointed out to me that the reason you get “x86-64” tuning by default is that if you don’t specify a “tune-cpu” that gets copied from the “target-cpu” and there’s no way to distinguish “x86-64” as an explicit target from it just being the default. I don’t like that. Maybe we need an explicit “x86-64-v1” that stays stable and an “x86-64” that uses the moving “generic” tuning.

“x86-64” is supposed to refer to the first generation of 64-bit CPUs from both Intel and AMD. And -march=“x86-64” or -mtune=“x86-64” should tune for that. At least I believe that’s what gcc does.

For 32-bit mode you get “pentium4” by default on linux with clang. This meant we couldn’t evolve the 32-bit default tuning over time. This is why I added tune-cpu=“generic” as the default in clang. What does your Fortran compile do for 32-bit mode?

Given what “x86-64” is supposed to mean, I don’t know if it makes sense to change the feature flags for it. Maybe we should change the default tuning to “generic” at X86TargetMachine::getSubtargetImpl if the “tune-cpu” isn’t present. We’d probably want to add a -mtune option to llc if we did that.

I agree we should change the feature flags. I think tuning to “generic” if “tune-cpu” isn’t specified and “target-cpu” is “x86-64” would be a reasonable compromise and would achieve what I want, but if “target-cpu” is anything else, then whatever it is should be used for “tune-cpu” I think.

I prefer to changing each front-end, because only FE can distinguish the 3 different behaviors: Compiler Explorer
Currently, when user uses option #2, they wants and gets #3. I don’t think we should change such assumption and ask user to use as #3 explicitly.
Backend can only distinguish from no “tune-cpu” to “tune-cpu”=“SomeValue”. If we change no “tune-cpu” from equaling “target-cpu” to “generic”. We need to change all FE to emit both “target-cpu” and “tune-cpu” for case #2. It doesn’t solve Andy’s problem, just switching argument passing from one method to another.

I don’t think that case #2 is important. I don’t think we have any users who want to specifically tune for old CPUs but no old CPU in particular. It makes sense to generate code that will execute correctly on older CPUs, but users will want the code optimized for the CPUs that their program will run on most often.

I should clarify what I meant in saying that case #2 is not important. There may very well be users who specify “-march=x86-64” without also passing a “-mtune” option. What I am questioning is whether they really want to tune for the x86-64-v1 architectures. My expectation is that they want to generate code that will execute correctly on those early processors, but they probably would want whatever tuning we use for “generic”.

To be explicit about what I think should happen in the absence of a “tune-cpu” attribute

  • If “tune-cpu” is present, that value should be used
  • If “target-cpu” is “x86-64” and “tune-cpu” is not present, we should tune for “generic”
  • If “target-cpu” is anything other than “x86-64” and “tune-cpu” is not present, we should tune for the architecture named in “target-cpu”

Put a patch ⚙ D129647 [X86] Use generic tuning for "x86-64" if "tune-cpu" is not specified according to Craig’s suggestion.
I had a look at supporting -mtune for llc. But it seems we need to change several interfaces to pass the new option. I’d like to leave it as is since we can anyhow control it by the function attribute.

1 Like

I agree that it’s best not to change llc at this time. I had looked at that and it seems it would require updating a lot of existing tests, some of which may depend on the current behavior to exercise the code path they intend to test.