LLVM ARM VMLA instruction

Hi,

Hi,

I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction.

I wanted to know if there is any way in which these VMLA hazards can be ignored and make LLVM to emit VMLA instructions? Is there any command line option/compiler switch/FLAG for doing this? I tried ‘-ffast-math’ but it didn’t work.

I was going through Code of LLVM instruction code generation for ARM. I came
across VMLA instruction hazards (Floating point multiply and accumulate). I
was comparing assembly code emitted by LLVM and GCC, where i saw that GCC
was happily using VMLA instruction for floating point while LLVM never used
it, instead it used a pair of VMUL and VADD instruction.

It looks like Clang allows the formation by default, but you need to
be compiling for a CPU that actually supports the instruction (the key
feature is called "VFPv4". That means one strictly newer than
cortex-a8: cortex-a7 (don't ask), cortex-a9, cortex-a12, cortex-a15 or
krait I believe. With that I get:

$ cat tmp.c
float foo(float accum, float lhs, float rhs) {
  return accum + lhs*rhs;
}
$ clang -target armv7-linux-gnueabihf -mcpu=cortex-a15 -S -o- -O3 tmp.c
[...]
foo: @ @foo
@ BB#0: @ %entry
        vmla.f32 s0, s1, s2
        bx lr

Cheers.

Tim.

Hi Tim,

Cortex A8 and A9 use VFPv3. A7, A12 and A15 use VFPv4.

cheers,
--renato

Cortex A8 and A9 use VFPv3. A7, A12 and A15 use VFPv4.

That's what I thought! But we do seem to generate vfma on Cortex-A9.
Wonder if that's a bug, or Cortex-A9 is "VFPv3, but chuck in vfma
too"?

Tim.

Hi Tim,

I believe that's the NEON VMLA, not the VFP one. There was a discussion in
the past about not using NEON and VFP interchangeably due to IEEE
assurances (which NEON doesn't have), but the performance gains are too
big. I think the conclusion is to only use NEON instead of VFP (when
they're semantically similar) when -unsafe-math is on.

cheers,
--renato

I believe that's the NEON VMLA, not the VFP one.

Turns out I was misreading the assembly. I wish "vmla" and "vfma"
weren't so similar-looking.

For Suyog that means the option "-ffp-contract=fast" is needed to get
vfma when needed. Sorry about the bad information earlier.

Cheers.

Tim.

“-ffp-contract=fast” is needed

Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. Still haven’t seen any explanation for how this is better though…

http://llvm.org/bugs/show_bug.cgi?id=17188
http://llvm.org/bugs/show_bug.cgi?id=17211

http://llvm.org/bugs/show_bug.cgi?id=17188
http://llvm.org/bugs/show_bug.cgi?id=17211

Ah, thanks. That makes a lot more sense now.

Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. Still
haven't seen any explanation for how this is better though...

That would be because it follows what C tells us a compiler has to do
by default but provides overrides in either direction if you know what
you're doing.

The key point is that LLVM (currently) has no notion of statement
boundaries, so it would fuse the operations in this function:

float foo(float accum, float lhs, float rhs) {
  float product = lhs * rhs;
  return accum + product;
}

This isn't allowed even under FP_CONTRACT=on (the multiply and add do
not occur within a single expression), so LLVM can't in good
conscience enable these optimisations by default.

Cheers.

Tim.

Thanks for the explanation, Tim!

gcc 4.8.1 does generate an fma for your code example for an x86 target that supports fma. I’d bet that the HW vendors’ compilers do the same, but I don’t have any of those installed at the moment to test that theory. So this is a bug in those compilers? Do you know how they justify it?

I see section 6.5 “Expressions” in the C standard, and I can see that 6.5.8 would seem to agree with you assuming that a “floating expression” is a subset of “expression”…is there any other part of the standard that you know of that I can reference?

This is made a little weirder by the fact that gcc and clang have a ‘fast’ setting for fp-contract, but the C standard that I’m looking at states that it is just an “on-off-switch”.

Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation or other flags specified.

Hi all,

Thanks for the info. Few observations from my side :

LLVM :

cortex-a8 vfpv3 : no vmla or vfma instruction emitted

cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4)

cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn’t come with vfpv4 then vfma instructions generated will be invalid )

cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted.

GCC :

cortex-a8 vfpv3 : vmla instruction emitted

cortex-a15 vfpv4 : vfma instruction emitted

I agree to the point that NEON and VFP instructions shouldn’t be used interchangeably.

However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn’t LLVM also emit vmla (NEON) instruction? Can someone please clarify on this point? The performance gain with vmla instruction is huge. Somewhere i read that LLVM prefers precision accuracy over performance. Is this true and hence LLVM is not emiting vmla instructions for cortex-a8?

cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this
seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma
instructions generated will be invalid )

If I'm understanding correctly, you've specifically told it this
Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can
be useful sometimes (if only for tests), so I'm not sure policing them
is a good idea.

cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

I get a VFP vmla here rather than a NEON one (clang -target
armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
you seeing something different?

However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't
LLVM also emit vmla (NEON) instruction?

It appears we've decided in the past that vmla just isn't worth it on
Cortex-A8. There's this comment in the source:

// Some processors have FP multiply-accumulate instructions that don't
// play nicely with other VFP / NEON instructions, and it's generally better
// to just not use them.

Sufficient benchmarking evidence could overturn that decision, but I
assume the people who added it in the first place didn't do so on a
whim.

The performance gain with vmla instruction is huge.

Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
in odd ways, and that was a very primitive core so it's almost
certainly not going to be just as good as a vmul (in fact if I'm
reading correctly, it takes pretty much exactly the same time as
separate vmul and vadd instructions, 10 cycles vs 2 * 5).

Cheers.

Tim.

Hi,

One more addition to above observation :

LLVM :

cortex-a15 + vfpv4-d16 + ffast-math option WITHOUT ffp-contract=fast option also emits vfma instruction.

Hi Tim,

As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.

My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).

It may seem that total number of cycles are more or less same for single vmla
and vmul+vadd. However, when vmul+vadd combination is used instead of vmla,
then intermediate results will be generated which needs to be stored in memory
for future access.

Well, it increases register pressure slightly I suppose, but there's
no need to store anything to memory unless that gets critical.

Correct me if i am wrong on this, but my observation till date have shown this.

Perhaps. Actual data is needed, I think, if you seriously want to
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).

Of course, if we're just speculating we can carry on.

Cheers.

Tim.

> As per Renato comment above, vmla instruction is NEON instruction while
vmfa is VFP instruction. Correct me if i am wrong on this.

My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).

I will check on this part.

> It may seem that total number of cycles are more or less same for single
vmla
> and vmul+vadd. However, when vmul+vadd combination is used instead of
vmla,
> then intermediate results will be generated which needs to be stored in
memory
> for future access.

Well, it increases register pressure slightly I suppose, but there's
no need to store anything to memory unless that gets critical.

> Correct me if i am wrong on this, but my observation till date have
shown this.

Perhaps. Actual data is needed, I think, if you seriously want to
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).

Of course, if we're just speculating we can carry on.

I wasn't speculating. Let's take an example of a 3*3 simple matrix
multiplication (no loops, all multiplication and additions are hard coded -
basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0] and
so on for all 9 elements of the result ).

If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16"
(only 16 floating point registers present with my arm, so specifying
vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9
store and 20 load ops. So, its clear that extra load/store ops gets added
with clang as it is not emitting vmla instruction. Won't this lead to
performance degradation?

I would also like to know about accuracy with vmla and pair of vmul and
vadd ops.

actually run the code that clang produces vs the code that gcc produces on
some actual hardware and see if there is a performance difference and if it
is significant. Often direct experimentation is often quicker than trying
to figure out how some code ought to perform. (In almost every experiment
I've performed on trying optimizations the actual performance on hardware
has been different from the expectations I had before running the code.)
Granted, testing doesn't always show benefits in that sometimes
microbenchmarks are so simple the compiler can hide the deficiencies of
inefficient code that it can't in more complex real-world code, but it's
still a good first thing to try.

Cheers,
Dave

VMLA.F can be either NEON or VFP on A series and the encoding will
determine which will be used. In assembly files, the difference is mainly
the type vs. the registers used.

The problem we were trying to avoid a long time ago was well researched by
Evan Cheng and it has shown that there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD.

Also, please note that, as accurate as cycle counts go, according to the A9
manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to provide
the results, so a sequence of VMUL+VADD might be faster, in some contexts
or cores, than half the sequence of VMLAs.

As Tim and David said and I agree, without hard data, anything we say might
be used against us. :wink:

cheers,
--renato

Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :

Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .

clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)

flags passed to both gcc and clang : -march=armv7-a -mfloat-abi=softfp
-mfpu=vfpv3-d16 -mcpu=cortex-a8
Optimization level used : O3

No vmla instruction emitted by clang but GCC happily emits it.

This was tested on real hardware. Time taken for a 4x4 matrix
multiplication:

clang : ~14 secs
gcc : ~9 secs

Time taken for a 3x3 matrix multiplication:

clang : ~6.5 secs
gcc : ~5 secs

when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla
instructions (gcc emits by default)

Time for 4x4 matrix multiplication :

clang : ~8.5 secs
GCC : ~9secs

Time for matrix multiplication :

clang : ~3.8 secs
GCC : ~5 secs

Please let me know if i am missing something. (-ffast-math option doesn't
help in this case.) On examining assembly code for various scenarios above,
i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?

Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .

This is one very specific case. How does that behave on all other cases?
Normally, every big improvement comes with a cost, and if you only look at
the benchmark you're tuning to, you'll never see it. It may be that the
cost is small and that we decide to pay the price, but not until we know
that the cost is.

This was tested on real hardware. Time taken for a 4x4 matrix

multiplication:

What hardware? A7? A8? A9? A15?

Also, as stated by Renato - "there is a pipeline stall between two

sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?

A8 and A15 are quite different beasts. I haven't read about this hazard in
the A15 manual, so I suspect that they have fixed whatever was causing the
stall.

cheers,
--renato