Some strange i64 behavior with arm 32bit. (Raspberry Pi)

Hi,

I have this function:

define private ghccc void @c4JC_info(i32*, i32*, i32*, i32, i32, i32, i32, i32) prefix { i32, i32, i32 } { i32 add (i32 sub (i32 ptrtoint ({ i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }* @S4J7_srt to i32), i32 ptrtoint (void (i32*, i32*, i32*, i32, i32, i32, i32, i32)* @c4JC_info to i32)), i32 16), i32 0, i32 196638 } {
  %9 = add i32 %3, 3
  %10 = inttoptr i32 %9 to i64*
  %11 = load i64, i64* %10, align 8
  call void @debug(i64 -1)
  call void @debug(i64 1)
  call void @debug(i64 4294967296)
  call void @debug(i64 4294967297)
  call void @debug(i64 4294967298)
  [...]

  ret void
}

where @debug is:

@.str = private unnamed_addr constant [6 x i8] c"%lld\0A\00", align 1
declare i32 @printf(i8*, ...)

define void @debug(i64 %val) {
  %cret = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.str, i32 0, i32 0), i64 %val)
  ret void
}

When I run it the output is:

-1
4294967296 ; expected this to be          1 = [0x0,0x0,0x0,0x0
           ;                                  ,0x0,0x0,0x0,0x1]
1          ; expected this to be 4294967296 = [0x0,0x0,0x0,0x1
           ;                                  ,0x0,0x0,0x0,0x0]
4294967297 ; this is correct due to symmetry
8589934593 ; expected this to be 4294967298 = [0x0,0x0,0x0,0x1
           ;                                  ,0x0,0x0,0x0,0x2]

As such it looks like my words are in reverse somehow.

However, if I try to build something similar and boil it down to just a simple C main function:

define i32 @main(i32, i8**) {
  call void @debug(i64 -1)
  call void @debug(i64 1)
  call void @debug(i64 4294967296)
  call void @debug(i64 4294967297)
  call void @debug(i64 4294967298)
  ret i32 0

}

it produces the correct result

-1
1
4294967296
4294967297
4294967298

If someone could offer some hint, where to look further for debugging this, I'd very much appreciate the advice!
I'm a bit lost right now how to figure out why I end up getting swapped words.

Thank you in advance!

Cheers,
Moritz

Hi Moritz,

If someone could offer some hint, where to look further for debugging this, I'd very much appreciate the advice!
I'm a bit lost right now how to figure out why I end up getting swapped words.

If one file was compiled for big-endian ARM and the other for
little-endian that could be the result. I'm not aware of any other
possible cause and from local tests I don't think the "ghccc" alone
explains the difference.

So maybe some glitch in how GHC was configured on your system? What's
the triple at the top of the GHC module and the module containing the
definition of @debug?

Cheers.

Tim.

Hi Tim,
thanks for the swift response!

@debug is defined in the same module, which makes this all the more confusing.

The target information from the working example are:
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "armv6kz--linux-gnueabihf"

from the ghc produced module:
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "arm-unknown-linux-gnueabihf"

However there ones more thing, I could think of, arm does allow mixed mode
I believe. And as such as the code from the ghc produced module is called
from outside of the module, could the endianness be set there prior to
entering the function?

The working module contains the main directly and is not called from a main
function in a different module.

I've also tried to define a regular c function with the same code and called
that from within the ghccc function with the same (incorrect) results.

Any further ideas I could expore?

Cheers,
Moritz

Alright, so after some more debugging (injeting print statements at the llvm ir level),
I came across the following:

GHC has the following code for the C into STG and back bridge: `RunStg`, which is defined
in https://github.com/ghc/ghc/blob/master/rts/StgCRun.c; the resulting llvm ir ends up being:

; Function Attrs: nounwind
define hidden %struct.StgRegTable* @StgRun(i8* ()* ()*, %struct.StgRegTable*) local_unnamed_addr #0 {
  
  %3 = tail call %struct.StgRegTable* asm sideeffect "stmfd sp!, {r4-r11, ip, lr}\0A\09vstmdb sp!, {d8-d11}\0A\09sub sp, sp, $3\0A\09mov r4, $2\0A\09bx $1\0A\09.globl StgReturn\0A\09.type StgReturn, %function\0AStgReturn:\0A\09add sp, sp, $3\0A\09mov $0, r7\0A\09vldmia sp!, {d8-d11}\0A\09ldmfd sp!, {r4-r11, ip, lr}\0A\09", "=r,r,r,i,~{r4},~{r5},~{r6},~{r7},~{r8},~{r9},~{r10},~{r12},~{lr}"(i8* ()* ()* %0, %struct.StgRegTable* %1, i32 8192) #1, !srcloc !3

  ret %struct.StgRegTable* %3
}

The assembly for better readability reads:

  stmfd sp!, {r4-r11, ip, lr}
  vstmdb sp!, {d8-d11}
  sub sp, sp, $3
  mov r4, $2
  bx $1
.globl StgReturn
.type StgReturn, %function
StgReturn:
  add sp, sp, $3
  mov $0, r7
  vldmia sp!, {d8-d11}
  ldmfd sp!, {r4-r11, ip, lr}

And when this results in the following assembly being emitted (for armv-unknown-linux-gnueabihf):

00000074 <StgRun>:
  74:   e92d4ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, lr}
  78:   e28db01c        add     fp, sp, #28, 0
  7c:   e92d5ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
  80:   ed2d8b08        vpush   {d8-d11}
  84:   e24dda02        sub     sp, sp, #8192   ; 0x2000
  88:   e1a04001        mov     r4, r1
  8c:   e12fff10        bx      r0

00000090 <StgReturn>:
  90:   e28dda02        add     sp, sp, #8192   ; 0x2000
  94:   e1a00007        mov     r0, r7
  98:   ecbd8b08        vpop    {d8-d11}
  9c:   e8bd5ff0        pop     {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
  a0:   e8bd8ff0        pop     {r4, r5, r6, r7, r8, r9, sl, fp, pc}

By adding extra ptinf statements, I found out that adding a `printf` statement after the assembly and before
the `ret`, the generated code looks slightly different:

00000074 <StgRun>:
  74:   e92d4ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, lr}
  78:   e28db01c        add     fp, sp, #28, 0
  7c:   e24dd004        sub     sp, sp, #4, 0
  80:   e92d5ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
  84:   ed2d8b08        vpush   {d8-d11}
  88:   e24dda02        sub     sp, sp, #8192   ; 0x2000
  8c:   e1a04001        mov     r4, r1
  90:   e12fff10        bx      r0

00000094 <StgReturn>:
  94:   e28dda02        add     sp, sp, #8192   ; 0x2000
  98:   e1a00007        mov     r0, r7
  9c:   ecbd8b08        vpop    {d8-d11}
  a0:   e8bd5ff0        pop     {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
  a4:   e58d0000        str     r0, [sp]
  a8:   e3a00002        mov     r0, #2, 0
  ac:   ebfffffe        bl      44 <.LdebugEnd>
  b0:   e59d0000        ldr     r0, [sp]
  b4:   e24bd01c        sub     sp, fp, #28, 0
  b8:   e8bd8ff0        pop     {r4, r5, r6, r7, r8, r9, sl, fp, pc}

and we can see that an additional `sp = sp - 4` was added.

With the log statement in StgRun, subsequent log statements so far work.

Now I wonder
  a) could I write this logic in llvm ir directly,
     without having to resort to assembly?
  b) could I force llvm to emit 32 instead of 28 somehow? to make sure
     my sp is 8byte aligned?

Of course I'm happy to take any other ideas as well.

Cheers,
Moritz

Ok...

after some more digging it turned out that the underlying issue was a bug in my
code generator. For the record I'll just note down the issue.

My code generator generated /unpacked/ structs for simplicity reasons, and because
I though--incorrectly--that we (GHC) generated GEP accessors. We don't! GHC
computes absolute offsets into those structs, as such generating /unpacked/
structs (e.g. { i32, i64 }, does not guarantee that the i64 is at offset +4; there
might be padding) is futile and all I needed to change was to generate packed
instead of unpacked structs.

However I still believe that the code gen for the C to STG bridge should add an
`sub sp, sp, 4` line to the inline assembly *if* it emits the `vstmdb sp!, {d8-d11}`
part, to ensure that the stack is 8byte aligned.

Thank you.

Cheers,
Moritz