[AArch64] Address computation folding


I was looking at some AArch64 benchmarks and noticed some simple cases
where addresses are being folded into the address mode computations
and was curious as to why. In particular, consider the following
simple example:

  void f2(unsigned long *x, unsigned long c)
    x[c] *= 2;

This generates:

  lsl x8, x1, #3
  ldr x9, [x0, x8]
  lsl x9, x9, #1
  str x9, [x0, x8]

Given the two uses of the address computation I was expecting this:

  add x8, x0, x1, lsl #3
  ldr x9, [x8]
  lsl x9, x9, #1
  str x9, [x8]

From reading 'SelectAddrModeXRO' the computation is getting folded if

the add node is *only* used with memory related operations?

Why wouldn't it consider the number of uses in any operation? The
"expected" code is easy to get by checking the number of uses. This
may be desirable on some micro-architectures depending on the cost of
the various loads and stores.

-- Meador

As you say, very microarchitecture-dependent. The code produced is
probably optimal for Cyclone ("[x0, x8]" is no more expensive than
"[x8]" and the "lsl" is slightly cheaper than the complicated "add").
If I'm reading the Cortex-A57 optimisation guide correctly, the same
reasoning applies there too.




Indeed, the complex add is more expensive on all Cortex cores I know of.

However there is an important point here that the code sequence we generate requires two registers live instead of one. In high regpressure loops, were probably losing performance.


If you have a patch I would be interested in experimenting with it.


Yeah, my reading is the same. For Cortex-A57 it looks like the same
number of u-ops and latency either way (since LDR [x1, x2] is free).

-- Meador

Hi Chad,

The attached is what I was experimenting with to produce the code
snippet in my original mail. I really only tested it with the LLVM
test suite (with an obvious failure in arm64-addr-mode-folding.ll) and
some toy examples.



aarch64-no-addr-fold-more-than-one-use.patch (1.54 KB)