dmb ishld in AArch64

Hi,

I got an optimization problem (O1, O2) regarding memory barrier “dmb ishld”

I find in the test/CodeGen/AArch64/intrinsics-memory-barrier.ll , it’s stated that memory access around DMB should not be reordered, but when compiling the Linux kernel, I found load/store in

static inline void hlist_add_before_rcu(struct hlist_node *n,
                    struct hlist_node *next)
{
    n->pprev = next->pprev;
    n->next = next;
    rcu_assign_pointer(hlist_pprev_rcu(n), n);
    next->pprev = &n->next;
}

can reordered, and causes kernel crash.

f94006a8 ldr x8, [x21,#8]
f9000275 str x21, [x19]
d5033abf dmb ishst
f9400669 ldr x9, [x19,#8]
f9000668 str x8, [x19,#8] <==== reordered str
f9000133 str x19, [x9]
f90006b3 str x19, [x21,#8]

It should be:

f94006a8 ldr x8, [x21,#8]
f9000668 str x8, [x19,#8]
f9000275 str x21, [x19]
d5033abf dmb ishst
f9400669 ldr x9, [x19,#8]
f9000133 str x19, [x9]
f90006b3 str x19, [x21,#8]

I guess it’s because "dmb ishst" is not checked/tested? Any quick way to fix this?

Thanks,
Chengyu

What is dmb ishst? hehe .. But tackling these problems one-by-one do
not seem the quickest way to test our idea; try -O0?

static inline void hlist_add_before_rcu(struct hlist_node *n,
                    struct hlist_node *next)
{
    n->pprev = next->pprev;
    n->next = next;

     barrier();

    rcu_assign_pointer(hlist_pprev_rcu(n), n);

     barrier();

Sorry, should be "dmb ishst"

Hi Chengyu,

I guess it’s because "dmb ishst" is not checked/tested? Any quick way to fix this?

That barrier should be treated the same as any other dmb by LLVM. What
I'm most confused about is why it's there in the first place. From
what I can tell, the rcu_assign_pointer function should map to a
native store-release operation (stlr).

Also, isn't the reordering wrong just from a data-dependency point of
view? It looks like it moves the hlist_pprev_rcu(n) access before the
n->pprev assignment. Could be because of an aliasing violation, or it
could be LLVM.

Do you have preprocessed source or LLVM IR handy? (You can get a .i
file from "clang -save-temps" for example).

Cheers.

Tim.

Hi Tim,

That barrier should be treated the same as any other dmb by LLVM. What
I'm most confused about is why it's there in the first place. From
what I can tell, the rcu_assign_pointer function should map to a
native store-release operation (stlr).

I guess it's because we disabled the integrated-asm, so it won't complain when compiling the kernel.

Also, isn't the reordering wrong just from a data-dependency point of
view? It looks like it moves the hlist_pprev_rcu(n) access before the
n->pprev assignment. Could be because of an aliasing violation, or it
could be LLVM.

The problem is explained below, the reordering causes accessing to uninitialized data.

f94006a8 ldr x8, [x21,#8]
f9000275 str x21, [x19]
d5033abf dmb ishst
f9400669 ldr x9, [x19,#8] <==== uninitialized data
f9000668 str x8, [x19,#8] <==== initialization
f9000133 str x19, [x9] <==== accessing uninitialized address
f90006b3 str x19, [x21,#8]

Do you have preprocessed source or LLVM IR handy? (You can get a .i
file from "clang -save-temps" for example).

The IR looks OK to me, the order is correct. I think the optimization is on machine code. The code is part of the insert_leaf_info function in net/ipv4/fib_trie.c

  %37 = getelementptr inbounds %struct.leaf_info* %new, i64 0, i32 0, !dbg !10245
  %38 = getelementptr inbounds %struct.leaf_info* %li.0.lcssa, i64 0, i32 0, !dbg !10245
  tail call void @llvm.dbg.value(metadata !{%struct.hlist_node* %37}, i64 0, metadata !10246, metadata !6594), !dbg !10247
  tail call void @llvm.dbg.value(metadata !{%struct.hlist_node* %38}, i64 0, metadata !10248, metadata !6594), !dbg !10249
  %39 = getelementptr inbounds %struct.leaf_info* %li.0.lcssa, i64 0, i32 0, i32 1, !dbg !10250
  %40 = load %struct.hlist_node*** %39, align 8, !dbg !10250
  %41 = getelementptr inbounds %struct.leaf_info* %new, i64 0, i32 0, i32 1, !dbg !10250
  store %struct.hlist_node** %40, %struct.hlist_node*** %41, align 8, !dbg !10250
  %42 = getelementptr inbounds %struct.leaf_info* %new, i64 0, i32 0, i32 0, !dbg !10251
  store %struct.hlist_node* %38, %struct.hlist_node** %42, align 8, !dbg !10251
  tail call void asm sideeffect "dmb ishst", "~{memory}"() #5, !dbg !10252, !srcloc !10255
  %43 = load %struct.hlist_node*** %41, align 8, !dbg !10252
  store %struct.hlist_node* %37, %struct.hlist_node** %43, align 8, !dbg !10252
  store %struct.hlist_node** %42, %struct.hlist_node*** %39, align 8, !dbg !10256
  br label %hlist_add_head_rcu.exit

Thanks,
Chengyu

I guess it's because we disabled the integrated-asm, so it won't complain when compiling the kernel.

No, it's something different. The inline assembly (for whatever
reason) says to use a "dmb ishst" rather than an "stlr". I don't
actually think the reason is that important, it's just surprising
because I'd expect stlr to be more efficient.

The problem is explained below, the reordering causes accessing to uninitialized data.

Exactly, I think the dmb is pretty much irrelevant here.

The IR looks OK to me, the order is correct. I think the optimization is on machine code. The code is part of the insert_leaf_info function in net/ipv4/fib_trie.c

Thanks for the IR snippet (I agree, it does look like the order is
reasonable there), but it's not really enough to debug the problem. We
need an actual Module we can compile and examine to have a hope of
finding out what's going on.

Cheers.

Tim.

I guess it's because we disabled the integrated-asm, so it won't complain when compiling the kernel.

No, it's something different. The inline assembly (for whatever
reason) says to use a "dmb ishst" rather than an "stlr". I don't
actually think the reason is that important, it's just surprising
because I'd expect stlr to be more efficient.

I see .... guess I'm using an old version of Linux kernel ... from the Android branch ...

Thanks for the IR snippet (I agree, it does look like the order is
reasonable there), but it's not really enough to debug the problem. We
need an actual Module we can compile and examine to have a hope of
finding out what's going on.

You mean the source code I'm compiling?

Thanks,
Chengyu

Thanks for the IR snippet (I agree, it does look like the order is
reasonable there), but it's not really enough to debug the problem. We
need an actual Module we can compile and examine to have a hope of
finding out what's going on.

You mean the source code I'm compiling?

The whole .ll file you pulled those lines from looks like it would be
most useful (assuming we're right that the IR is sound). If that
fails, we can move further back to the C source in some form or
another.

Cheers.

Tim.

Should I just copy and paste the content of the .ll file?

Thanks,
Chengyu

Well, an attachment would probably be a bit easier to deal with, but I
can cope with copy/paste.

Cheers.

Tim.

I thought you said the problem occurred at O1 and O2? Does it happen
at O0 as well? We need source that actually produces the problem to
work out what's going on.

Cheers.

Tim.

I'm not sure, I was never able to compile the whole kernel with -O0, too many errors. Plus, the problem seems to be within machine code generation. I tried opt -O1 and -O2, the generated .ll file does not diff much for the target function (insert_leaf_info).

Thanks,
Chengyu

I'm not seeing anything resembling the code you originally posted by
compiling this (either before or after opt -On). I think it's probably
best if we try with preprocessed source.

What you need to do is run the clang compile command (the optimised
one that fails), but add the "-save-temps" option. It should produce a
"fib_trie.i" file in the directory you're running from. Could you
upload that and the exact command line you're using?

Also, I should have asked earlier, but what version of LLVM are you
using? The bug may already have been fixed on trunk.

Cheers.

Tim.

Please find it attached. The command I'm using is:

Thanks Chengyu. The code is very different from what you started the
thread with, and seems correct to me:

        // x19 == n, x21 == next
        ldr x8, [x21, #8]
        stp x21, x8, [x19]
        stlr x19, [x8]
        str x19, [x21, #8]

Can you confirm that this has the same bug when you try to use it?

Cheers.

Tim.

It's looking closer now, but I still can't trigger the bug I'm afraid.
Even disabling the load/store optimiser and fiddling with CPUs (to try
different schedulers) just gives me what I'd expect:

        ldr x8, [x21, #8]
        str x21, [x19]
        str x8, [x19, #8]
        dmb ishst
        ldr x8, [x19, #8]
        str x19, [x8]

I'm not aware of any big changes in this area recently. You don't have
any local LLVM patches or anything that could be causing the problem?
What revision of trunk did you get it to fail on? I could try checking
that one out.

Cheers.

Tim.

I'm using r223407. Switching to a clean built on r223853, it still gives me:

    ldr x8, [x21,#8]
    stp x21, x8, [x19]
    dmb ishst
    ldr x8, [x19,#8]
    str x19, [x8]
    str x19, [x21,#8]

Thanks,
Chengyu

I'm using r223407.

Just trying out this build now. Hopefully I'll be able to see the
error too there.

Switching to a clean built on r223853, it still gives me:

    ldr x8, [x21,#8]
    stp x21, x8, [x19]
    dmb ishst
    ldr x8, [x19,#8]
    str x19, [x8]
    str x19, [x21,#8]

Isn't that correct though? The problematic "str" has been folded into
the "stp" instruction, so "x19 + 8" is written before it gets read.

Cheers.

Tim.

Switching to a clean built on r223853, it still gives me:

   ldr x8, [x21,#8]
   stp x21, x8, [x19]
   dmb ishst
   ldr x8, [x19,#8]
   str x19, [x8]
   str x19, [x21,#8]

Isn't that correct though? The problematic "str" has been folded into
the "stp" instruction, so "x19 + 8" is written before it gets read.

Indeed. I didn't pay too much attention to the str -> stp change. Unfortunately, the trunk build crashes when compiling the kernel. Hope it will be fixed tomorrow, else I need to find a revision that gives me this fix.

Thank you so much!
Chengyu

Indeed. I didn't pay too much attention to the str -> stp change.

Oh good. I can reproduce it now, on the build you were using. Still
only with Clang rather than llc, but that's just a minor
inconvenience. It should still be solvable.

Unfortunately, the trunk build crashes when compiling the kernel. Hope it will be fixed tomorrow, else I need to find a revision that gives me this fix.

Ah, that's not so good. Still, probably better than the silently wrong
code you had to debug.

Cheers.

Tim.

I think this was an intentional fix, rather than just some random
change concealing the issue (the improvement bisected down to Tom's
r223717, which fits both by description and location). So I think we
can probably stop worrying about this one.

Cheers.

Tim.