[PATCH] Detect Haswell subarchitecture (i.e. using -march=native)

Hello,

This is my first patch on this list, however I've already submitted several trough bug tracking system. Since it probably needs some review and it's not a bug I am submitting it here.

The main intent of this patch is to detect "core-avx2" platform on Haswell i7 CPUs when running -march=native. Currently it detects it as generic x86_64.

lib/Support/Host.cpp:
* Haswell is detected for CPUID Family 6 Model 60
* Similar to Ivy and Sandy Bridge we check for AVX2 since some Haswell Pentiums are SSE4.x only
* I have marked HasAVX2 as "volatile", since otherwise it gets magically zeroed (by optimizer?) when compiling clang with latest clang build from trunk

lib/Target/X86/X86Subtarget.cpp:
* Also enabling X86::FeatureFastUAMem for Haswell

Regards,

llvm-detect-haswell.patch (1.72 KB)

Hi Adam,

* I have marked HasAVX2 as "volatile", since otherwise it gets
magically zeroed (by optimizer?) when compiling clang with latest
clang build from trunk

That's far more worrying to me than not being able to detect Haswell.
I can't reproduce the problem here at the moment: both debug and
release builds give identical assembly for Host.cpp.

I don't suppose you could post the command-line clang is using to
build Host.cpp (and perhaps the differing object files if you have
them handy)?

Cheers.

Tim.

That's far more worrying to me than not being able to detect Haswell.
I can't reproduce the problem here at the moment: both debug and
release builds give identical assembly for Host.cpp.

OK. I know the reason you cannot reproduce it, before posting the patch I've decided to check for AVX before checking AVX2, just not to cpuid AVX2 when we don't have AVX1 anyway.

So the problem exists with following predicate: (1)
  bool HasAVX2 = !GetX86CpuIDAndInfo(0x7, &EAX, &EBX, &ECX, &EDX) &&
                 (EBX & 0x20);

However it is working absolutely fine if I add "volatile": (2)
  volatile bool HasAVX2 = !GetX86CpuIDAndInfo(0x7, &EAX, &EBX, &ECX, &EDX) &&
                          (EBX & 0x20);

Or add extra check for HasAVX: (3)
  bool HasAVX2 = HasAVX && !GetX86CpuIDAndInfo(0x7, &EAX, &EBX, &ECX, &EDX) &&
                 (EBX & 0x20);

Attaching object files related to all of these cases below.

Also attaching patch that removes volatile, but leaves HasAVX check that makes the code run fine here.

My platform is Ubuntu 12.04 LTS 64-bit & i7-4470k.

llvm[1]: Compiling Host.cpp for Release build
if clang++ -I/home/ono/Documents/llvm/include -I/home/ono/Documents/llvm/lib/Support -I/tmp/llvm.q35LGwjHmR/include -I/tmp/llvm.q35LGwjHmR/lib/Support -DNDEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -O3 -fomit-frame-pointer -fvisibility-inlines-hidden -fno-exceptions -fno-rtti -fPIC -Woverloaded-virtual -Wcast-qual -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcovered-switch-default -Wno-uninitialized -Wno-missing-field-initializers -c -MMD -MP -MF "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.d.tmp" -MT "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.o" -MT "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.d" /home/ono/Documents/llvm/lib/Support/Host.cpp -o /tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.o ; \
          then /bin/mv -f "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.d.tmp" "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.d"; else /bin/rm "/tmp/llvm.q35LGwjHmR/lib/Support/Release/Host.d.tmp"; exit 1; fi

Regards,

Host-objects.tbz2 (6.22 KB)

llvm-detect-haswell-r2.patch (1.68 KB)

Hi Adam,

OK. I know the reason you cannot reproduce it, before posting
the patch I've decided to check for AVX before checking AVX2,
just not to cpuid AVX2 when we don't have AVX1 anyway.

I suspect it was also incompetence on my part. Given the differences
I'm seeing now I can't believe there'd be *no* difference in my tests
if I'd done them properly.

Anyway, thanks very much for the information. Hopefully that'll let me
track things down.

Also attaching patch that removes volatile, but leaves
HasAVX check that makes the code run fine here.

Would you mind me taking a day or so to investigate what's going on
here properly? Introducing a volatile to work around a bug in Clang
itself just seems perverse to me. (And we shouldn't let a CodeGen bug
dictate how we implement our functions either).

I promise I'll do the review of your code after that.

Cheers.

Tim.

Anyway, thanks very much for the information. Hopefully that'll let me
track things down.

Let me know if you need some more information or dumps.

Would you mind me taking a day or so to investigate what's going on
here properly? Introducing a volatile to work around a bug in Clang
itself just seems perverse to me. (And we shouldn't let a CodeGen bug
dictate how we implement our functions either).

Looking at the assembly there's something wrong with SI that is not getting saved anywhere after CPUID and 0x20 bit test before it gets overwritten by LEA.

332: mov eax,0x7
337: mov rsi,rbx
33a: cpuid
33c: xchg rsi,rbx
33f: and esi,0x20
342: shr esi,0x5
345: lea rbp,[rip+0x0] # 34c <llvm::sys::getHostCPUName()+0xbc>
34c: lea r12,[rip+0x0] # 353 <llvm::sys::getHostCPUName()+0xc3>
353: cmove rbp,r12
357: lea rdi,[rsp+0x188]
35f: lea rsi,[rip+0x0] # 366 <llvm::sys::getHostCPUName()+0xd6>

In both other cases (2) & (3) SI is saved into stack region.

I promise I'll do the review of your code after that.

Thanks.

Regards,

Pretty sure you need to check EAX>=7 from cpuid leaf 0 before calling leaf 7 and you need to use the pass ECX=0 to leaf 7. See lib/Target/X86/X86Subtarget.cpp which uses a GetX86CpuIDAndInfoEx function to pass EAX and ECX to cpuid.

I don’t think it explains your compiler bug though.

Actually there is no miscompile there as esi isn’t needed. The flags are which the cmove is using.

342: shr esi,0x5
345: lea rbp,[rip+0x0] # 34c llvm::sys::getHostCPUName()+0xbc
34c: lea r12,[rip+0x0] # 353 llvm::sys::getHostCPUName()+0xc3
353: cmove rbp,r12 ← this is dependent on the flags from the shift.

I think your real problem is that garbage went into ECX instead of 0 and caused cpuid to return 0.

I think your real problem is that garbage went into ECX instead of 0 and
caused cpuid to return 0.

Ah, that looks very likely. The value seems to come from "xorl %eax,
%eax" in both good object files, but a previous cpuid in the bad one.

Excellent work Craig, I suspect that would have taken me days to find.

Tim.

It helps that I’m the one who implemented a portion of the code in X86Subtarget.cpp.

I promise I'll do the review of your code after that.

Tim, I don’t want to push too much. But since there’s 3.4 release on the horizon, maybe you could find a moment review this patch. Especially Haswell is all there since few months.

Cheers,

Hi Adam,

+ bool HasAVX2 = HasAVX && !GetX86CpuIDAndInfo(0x7, &EAX, &EBX, &ECX, &EDX) &&
+ (EBX & 0x20);

I don't think this guarantees %ecx is 0, does it? Wasn't that the
entire reason the original code went wrong?

Cheers.

Tim.

+ bool HasAVX2 = HasAVX && !GetX86CpuIDAndInfo(0x7, &EAX, &EBX, &ECX, &EDX) &&
+ (EBX & 0x20);

I don't think this guarantees %ecx is 0, does it? Wasn't that the
entire reason the original code went wrong?

I don’t remember really, but presuming the conclusions of the discussion, seems it is fixed now. It was something about registers when using inline assembly. Anyway this works just fine on all my Haswell machines.

Regards,

I don’t remember really, but presuming the conclusions of the discussion, seems it is fixed now. It was something about registers when using inline assembly. Anyway this works just fine on all my Haswell machines.

I think that's more coincidence than anything else (something
perturbed in your host compiler's backend). If you look at
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp there's a
GetCpuIDAndInfoEx function which specifically sets %ecx to a valid
value before executing "cpuid".

The code in Host.cpp needs to do something similar.

Cheers.

Tim.

I agree with Tim, you need to implement a GetCpuIDAndInfoEx function in Host.cpp and pass the correct value to ecx. Also you need to verify that 7 is a valid leaf because an invalid leaf is defined to return the highest supported leaf on that processor. So if a processor supports say leaf 6 and not leaf 7, then an access leaf 7 will return the data from leaf 6 causing unrelated bits to be interpretted as feature flags for AVX etc.

Here we go, updated patch following your advice checking max leaf and porting cpuidex for subleaf (ECX) 0.

NOTE: I’ve set Haswell to be not only 60, but also 63, 69 & 70 model, following changes in Linux kernel & Xen. Also set 62 as Ivy Bridge EP aka E5 v3 (which I has in my workstation).

Cheers,

Here we go, updated patch following your advice checking max leaf and porting cpuidex for subleaf (ECX) 0.

Thanks for the updated version Adam, that looks good! I've committed
this as r195632 and asked Bill (the release manager) whether it can be
merged to the 3.4 branch, I think it would be a good idea too.

Cheers.

Tim