Behavior of -mcpu

When I pass -mcpu=<something> to clang, I get this:

clang-9: warning: argument unused during compilation: '-mcpu=<something>' [-Wunused-command-line-argume nt]

Worse, I do not get code generated for <something> but rather (I think)
for whatever the host architecture is.

Is this expected behavior? I know I can use -mtune and -march to
control things but I expected -mcpu to behave like it does in gcc, where
it implies -mtune and -march. I certainly did not expect it to be
completely ignored.

Thanks for any help!

                            -David

Which triple are you targeting. Whether or not -mcpu does anything is specific to that.

Interesting. I am observing different behavior between -mcpu= and
-march= -mtune= with aarch64-unknown-linux-gnu:

$ clang -Ofast -S -target aarch64-unknown-linux-gnu -mcpu=thunderx2t99 -mllvm -debug-only=loop-data-prefetch ./prefetch.c
Prefetching 3 iterations ahead (loop size: 35) in partial: Loop at depth 1 containing: %for.body<header>
<latch><exiting>

$ clang -Ofast -S -target aarch64-unknown-linux-gnu -march=armv8 -mtune=thunderx2t99 -mllvm -debug-only=loop-data-prefetch ./prefetch.c
[nothing]

I expected to see the same behavior. Is this a bug? Just to be sure I
tried armv8.1a, armv8.2a and armv8.3a and still got nothing.

                                -David

Craig Topper <craig.topper@gmail.com> writes:

We never supported -mtune=thunderx2t99.

-mcpu=thunderx2t99
-march=armv8.1-a+lse

Although the -march flag here is somewhat redundant --
-mcpu=thunderx2t99 implies -march=armv8.1-a+lse:

%> cat t.c
#include <stdint.h>

int32_t compare_and_swap(volatile int32_t* ptr, int32_t oldval, int32_t newval)
{
  int32_t ret = *ptr;
  (void) __sync_bool_compare_and_swap(ptr, oldval, newval);
  return ret;
}

%> clang -O3 -std=c99 -mcpu=thunderx2t99 -S t.c -o t.S
%> egrep -e 'casal' -n t.S
9: casal w1, w2, [x0]
%>

I.e. you still get the LSE Atomics even without the -march=armv8.1-a+lse.

--Stefan

Stefan Teleman <stefan.teleman@gmail.com> writes:

Interesting. I am observing different behavior between -mcpu= and
-march= -mtune= with aarch64-unknown-linux-gnu:

We never supported -mtune=thunderx2t99.

-mcpu=thunderx2t99
-march=armv8.1-a+lse

This is problematic for users. If clang interprets command-line flags
differently based on subtarget, how are users to know what to pass to
clang to get, say, the best performance, or best code size, or whatever?

Users don't expect to have to set up their build systems to pass
different -mcpu/-march/-mtune flags based on what they are targeting.
They want to do something like this and be done with it:

TARGET=$(call get_triple)
ARCH=$(call get_arch)
TUNE=$(call get_tune)

CFLAGS += --target ${TARGET} -march=${ARCH) -mtune=${TUNE}

They could, but probably won't be happy to, do:

ifeq (TARGET,aarch64-unknown-linux-gnu)
  ifeq (TUNE,thunderx2t99)
    # -mtune unsupported
    CFLAGS += -mcpu=${TUNE}
  else ifeq (TUNE,anothertarget)
    # -mcpu unsupported
    CFLAGS += -march=${ARCH} -mtune=${TUNE}
  else ifeq (TUNE,thirdtarget)
    CFLAGS += -mtune=${TUNE}
  endif
else ifeq(TARGET,x86_64-unknown-linux-gnu)
  ifeq (TUNE,skylake-avx512)
    # -mtune unsupported
    CFLAGS += -mcpu=${TUNE}
  else ifeq (TUNE,broadwell)
    # -mcpu unsupported
    CFLAGS += -march=${ARCH} -mtune=${TUNE}
  endif
else ifeq(TARGET,third-target-triple)
  # -mcpu and -mtune unsupported
  CFLAGS += -march=${ARCH}
endif

It's hard enough for users to support building for multiple subtargets
without having to deal with basic compiler behavior differences per
subtarget.

Perhaps there is some capability of clang I'm not understanding. What's
the expected/canonical way to tune compilation for a subtarget?

                         -David

Yes, clang interprets -mcpu= | -march= | -mtune= differently based on
subtarget. So does GCC. This has been the case for many years.

For example, on X86_64, both clang and GCC complain if you pass -mcpu=
on compile-line. On AArch64 or SPARC, neither of them do.

AArch64 doesn't have the same diversity of sub-arch'es and
micro-arch'es and micro-tuning that X86_64 does. So, fiddling with
-mtune= and -march= on AArch64 is probably not as useful as it is on
X86_64. On AArch64, -mcpu= usually is sufficient. You will get the
most of what that particular CPU can do by default.

Hi David,

I thought that mtune just wasn’t supported for any target in clang, but I may be wrong.
FWIW, the following thread discusses what would be needed to implement support for mtune: http://lists.llvm.org/pipermail/llvm-dev/2017-October/118344.html
AFAIK, so far nobody has worked on implementing that proposal.

Thanks,

Kristof

Stefan Teleman <stefan.teleman@gmail.com> writes:

This is problematic for users. If clang interprets command-line flags
differently based on subtarget, how are users to know what to pass to
clang to get, say, the best performance, or best code size, or whatever?

Yes, clang interprets -mcpu= | -march= | -mtune= differently based on
subtarget. So does GCC. This has been the case for many years.

Ok, but is that the user experience we want?

For example, on X86_64, both clang and GCC complain if you pass -mcpu=
on compile-line. On AArch64 or SPARC, neither of them do.

I did not realize gcc has the same behavior, so thank you for clarifying
that. I understand the desire to be gcc-compatible.

AArch64 doesn't have the same diversity of sub-arch'es and
micro-arch'es and micro-tuning that X86_64 does.

Not yet.

So, fiddling with -mtune= and -march= on AArch64 is probably not as
useful as it is on X86_64. On AArch64, -mcpu= usually is
sufficient. You will get the most of what that particular CPU can do
by default.

It used to be the case that in gcc -mcpu was a shorthand for
-march/-mtune, such that common combinations (broadwell/avx2 for
example) were easy to use. It's unfortunate the gcc developers
abandoned that idea.

What is the rationale for the behavior of --target/-mcpu/-march/-mtune
in clang? It would be helpful to have a better idea why things are the
way they are.

                              -David

Kristof Beyls <Kristof.Beyls@arm.com> writes:

Hi David,

I thought that mtune just wasn’t supported for any target in clang,
but I may be wrong.
FWIW, the following thread discusses what would be needed to implement
support for mtune:
http://lists.llvm.org/pipermail/llvm-dev/2017-October/118344.html
AFAIK, so far nobody has worked on implementing that proposal.

Now I'm (more) confused. If -mtune doesn't work and -mcpu isn't
supported for x86-64 (I get a warning about an unused option) how do
users compile for a target like skylake-avx512 and have the optimizer
tune for it?

                            -David

According to gcc’s documentation, -march implies -mtune for x86. Which matches what clang does without having -mtune. Their documentation also says -mcpu is a depcrated synonym for -mtune. So I’m not sure that -mcpu ever did -march and -mtune together on gcc at least for x86

Thanks. Just to make sure I understand everything, for X86-64, using
-march will cause LLVM (via TargetTransformInfo) to use tuning
parameters appropriate to the subtarget given, whatever they may be,
along with the appropriate scheduling model, etc.?

And the same is true with -mcpu for AArch64 targets?

What about other targets? Do we know what the rules are for them?

It would be great to have this all documented. Is there any effort
around that?

                                 -David

Craig Topper <craig.topper@gmail.com> writes: