Is the correct behavior of getelementptr i192* for opt + llc -march=aarch64?

Hi all,
opt and opt + llc generate the difference aarch64 asm code for the following LLVM code.

Is it intended behavior?
I expected (A) because I cast %p from i192* to i64*.
The information is dropped by opt and 8-byte padding is inserted or I write a bad code?

% cat a.ll
define void @store0_to_p4(i192* %p)
{
  %p1 = bitcast i192* %p to i64*
  %p2 = getelementptr i64, i64* %p1, i64 3
  %p3 = getelementptr i64, i64* %p2, i64 1
  store i64 0, i64* %p3
  ret void
}

% llc-3.8 a.ll -O3 -o - -march=aarch64
store0_to_p4:
  str xzr, [x0, #32] ; (A)
  ret

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64
store0_to_p4:
  str xzr, [x0, #40] ; (B)
  ret

Yours,
Shigeo

Can you provide the full repro?

Also what is the IR output of opt -O3?

Hi Shigeo,

opt and opt + llc generate the difference aarch64 asm code for the following LLVM code.

This looks like it's because the IR doesn't contain a datalayout
declaration, which affects how i192 is interpreted (particularly
sizeof(i192) for GEP purposes).

Is it intended behavior?

It'll disappear if you provide a correct datalayout, incorrect ones
are unsupported in any configuration.

Cheers.

Tim.

Hi Tim,

That’s why I asked for the full repro by the way, I though you were showing only part of the test case.

The datalayout is *required* if you want to perform any transformation on the IR, otherwise you may have some surprise like this at codegen time.
You should get it from the TargetMachine after you initialize a Target, and set it on the module from the beginning.

If you just want to play with some IR, you can look it up in the source code, or in the test directory:

$ git grep datalayout test/CodeGen/AArch64/
test/CodeGen/AArch64/GlobalISel/arm64-callingconv.ll:target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll:target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
test/CodeGen/AArch64/GlobalISel/arm64-instructionselect.mir: target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
test/CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll:target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

Hi Mehdi,

If you just want to play with some IR, you can look it up in the source code, or in the test directory:

$ git grep datalayout test/CodeGen/AArch64/
test/CodeGen/AArch64/GlobalISel/arm64-callingconv.ll:target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

Thank you for advice.
I verified it by adding 'target datalayout="e-m:o-i64:64-i128:128-n32:64-S128:i192:192"' at the top of a.ll.

% opt-3.7 -O3 a.ll -o - | llc-3.7 -O3 -o - -march=aarch64
store0_to_p4:
    str xzr, [x0, #32]
    ret

Yours,
Shigeo

Hi all,
opt and opt + llc generate the difference aarch64 asm code for the following LLVM code.

Is it intended behavior?
I expected (A) because I cast %p from i192* to i64*.
The information is dropped by opt and 8-byte padding is inserted or I write a bad code?

% cat a.ll
define void @store0_to_p4(i192* %p)
{
  %p1 = bitcast i192* %p to i64*
  %p2 = getelementptr i64, i64* %p1, i64 3
  %p3 = getelementptr i64, i64* %p2, i64 1
  store i64 0, i64* %p3
  ret void
}

% llc-3.8 a.ll -O3 -o - -march=aarch64
store0_to_p4:
  str xzr, [x0, #32] ; (A)
  ret

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64

Is your default target aarch64? Otherwise opt may be assuming a different
target which might explain the difference.

-Tom

Hi Tom,

Is your default target aarch64? Otherwise opt may be assuming a different
target which might explain the difference.

No, My target is x86-64, x86, arm, aarch64, ..., then I'll avoid using i192* and datalayout.

Yours,
Shigeo

There is nothing specific with i192. You will likely run into issues by not specifying the right datalayout.

The optimizations will always run with a datalayout: if you don’t specify one there will be a default one, which can cause problems on some target (like you saw on arm).
For instance, the optimizer will assume a pointer size and optimize based on this.

I believe Tom's point was about the line:

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64

If your host is x86_64, then the first call to opt will assume x86_64
unless you have a triple in the IR (which I believe you didn't).

You can override with:

% opt-3.8 -march=aarch64 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64

Or making sure your IR always have triple+layout.

I'm not sure it would have made any difference on the i192* case, but
it will have noticeable impact on more complicated (and more target
specific) IR, so you should be careful.

Also, don't assume that OPT+LLC == LLC, as you'll be running more of
the same passes on the first case, which can, in rare cases, have an
impact (for better or worse) on the code generated.

I recommend you keep the passes to a minimum. Opt is a debug tool, not
an optimiser.

To generate target code, use llc directly, which will (should) have
the same effect without command line flag duplication. Better still,
use Clang, or make sure your own front-end uses the middle and back
ends in a consistent way, and use it instead of llc.

cheers,
--renato

Hi Mehdi,

No, My target is x86-64, x86, arm, aarch64, ..., then I'll avoid using i192* and datalayout.

There is nothing specific with i192. You will likely run into issues by not specifying the right datalayout.

The optimizations will always run with a datalayout: if you don’t specify one there will be a default one, which can cause problems on some target (like you saw on arm).
For instance, the optimizer will assume a pointer size and optimize based on this.

I write a code without i192* as the following, then I get what I wanted.
I'll rewrite the other code like this.

// load 192-bit data from %r2
define i192 @load192(i64* %r2)
{
%r3 = load i64, i64* %r2
%r4 = zext i64 %r3 to i128
%r6 = getelementptr i64, i64* %r2, i32 1
%r7 = load i64, i64* %r6
%r8 = zext i64 %r7 to i128
%r9 = shl i128 %r8, 64
%r10 = or i128 %r4, %r9
%r11 = zext i128 %r10 to i192
%r13 = getelementptr i64, i64* %r2, i32 2
%r14 = load i64, i64* %r13
%r15 = zext i64 %r14 to i192
%r16 = shl i192 %r15, 128
%r17 = or i192 %r11, %r16
ret i192 %r17
}

/*
  struct i192_t {
    uint64_t v[3];
  };
  void add(i192_t *y, const i192_t* x)
  {
    *y = x[0] + x[1]; // pseudo code
  }
*/
define void @add(i64* noalias %r1, i64* noalias %r2)
{
%r3 = call i192 @load192(i64* %r2)
%r5 = getelementptr i64, i64* %r2, i32 3
%r6 = call i192 @load192(i64* %r5)
%r7 = add i192 %r3, %r6
%r9 = getelementptr i64, i64* %r1, i32 0
%r10 = trunc i192 %r7 to i64
store i64 %r10, i64* %r9
%r11 = lshr i192 %r7, 64
%r13 = getelementptr i64, i64* %r1, i32 1
%r14 = trunc i192 %r11 to i64
store i64 %r14, i64* %r13
%r15 = lshr i192 %r11, 64
%r17 = getelementptr i64, i64* %r1, i32 2
%r18 = trunc i192 %r15 to i64
store i64 %r18, i64* %r17
ret void
}

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=x86-64
add:
        movq 16(%rsi), %rax
        movq 24(%rsi), %rcx
        movq 32(%rsi), %rdx
        addq (%rsi), %rcx
        adcq 8(%rsi), %rdx
        adcq 40(%rsi), %rax
        movq %rcx, (%rdi)
        movq %rdx, 8(%rdi)
        movq %rax, 16(%rdi)
        retq

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64
add:
        ldp x8, x9, [x1]
        ldp x10, x11, [x1, #24]
        ldr x12, [x1, #16]
        ldr x13, [x1, #40]
        adds x8, x10, x8
        adcs x9, x11, x9
        stp x8, x9, [x0]
        adcs x8, x13, x12
        str x8, [x0, #16]
        ret

Yours,
Shigeo

And what we’re trying to tell you, is that this may “fix” *this* particular case, but it does not make it a correct solution though.

Hi Renato,

I believe Tom's point was about the line:

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64

If your host is x86_64, then the first call to opt will assume x86_64
unless you have a triple in the IR (which I believe you didn't).

Thank you, I lost it, but I always use opt and llc on each host and do not mix them.
I'm sorry, I should have written %opt -O3 a.ll -o -|llc -O3 -o - (on x86-64 / aarch64).

% opt-3.8 -O3 a.ll -o - | llc-3.8 -O3 -o - -march=aarch64

Also, don't assume that OPT+LLC == LLC, as you'll be running more of
the same passes on the first case, which can, in rare cases, have an
impact (for better or worse) on the code generated.

I recommend you keep the passes to a minimum. Opt is a debug tool, not
an optimiser.

I see, but I want load192() in the previous mail should be inlined, but
only llc does not it if alwaysinline attribute is add.

Yours,
Shigeo

Hi Mehdi,

I write a code without i192* as the following, then I get what I wanted.

And what we’re trying to tell you, is that this may “fix” *this* particular case, but it does not make it a correct solution though.

I could not understand your advice yet, I'll read your comment again from the begining.
Thank you.

Thank you, I lost it, but I always use opt and llc on each host and do not mix them.
I'm sorry, I should have written %opt -O3 a.ll -o -|llc -O3 -o - (on x86-64 / aarch64).

Yes, that works if it's all native.

I see, but I want load192() in the previous mail should be inlined, but
only llc does not it if alwaysinline attribute is add.

Right, and that shows my argument very well: opt+llc != llc alone.

This time, it worked for you and you got the result you wanted. Next
time, it may work against you and you'll be in the situation where you
don't know if you run opt or not beforehand, depending on the case, or
how many times you'll run opt.

I recommend you identify why opt is making a difference and submit a bug report.

In theory, opt will pass the same passes in the same order as llc
(give or take a few things), so this falls into two scenarios:

1. Some pass *after* inlining is reducing the threshold of the
function you want to inline, so it only inlines on the second pass.

To see if this is the case, try to run opt twice on the IR (opt | opt)
and see if the function is inlined the second time. If it is, then the
"fix" would be working around the heuristics, or fiddling with your
function to understand and correct the problem.

Using --print-after-all will give you an idea which pass is
responsible for the simplification (hint: check for the state of IR
just before the inlining phase on both runs, then trace back "who did
it" when it worked).

2. Opt and llc are not passing the same passes in the same order.

With the --print-after-all results from the investigation above, you
can try to re-order the passes via opt and see, if you run in the same
order as llc, you get only the out-of-line function.

If this is the case, then changing llc to re-order the passes (and be
like opt) could be an easy fix.

But always remember: both opt and llc are *debug* tools. If you build
a compiler with LLVM you should use the middle and back end classes
inside your front-end driver.

Emitting IR and using opt/llc is only a way to bootstrap your
front-end, and not meant to be embedded in a final product.

As an example, if we change llc to be like opt here, all the other
tools that rely on llc's behaviour will change unexpectedly, and this
may break or generate worse code for them. You don't want to be in
that situation.

But it would be good if both opt and llc behave in similar ways, so
that you can bootstrap products and run more reliable tests with them.

cheers,
--renato

I don’t believe there is any bug: opt is the *optimizer* and llc the *codegenerator*.

The behavior observed is absolutely expected.