ARM Jump table pcrelative relaxation in clang / llc

Hi,

I have written a PassManager (IR) pass that seriously increases the size of the original IR code.
As a result it seems that the generated machine code is incorrect (as of LLVM 3.5): The AsmPrinter generates the following instruction :
adr r2, .LJTI4_0_0
when going through the MC streamer, I get a “fatal error: error in backend: out of range pc-relative fixup” .
Apparently, the fixup does not hold in the 12 bits we have available. I would have expected clang to perform relaxation on this instruction on that particular case. Using the flag mrelax-all does not help.
Is there a way in the PassManager::runOnFunction to anticipate that so that I can generate a IR code that would fit when converted to machine code?
Strangely enough, this is not happening when using llc to generate the code from the bc file, I get the object file.
The target is armv5e-none-linux-androideabi ( I used -mtriple with llc).
I have seen a similar thread in 2012 " Questions on MachineFunctionPass and relaxation of pcrel calls (ARM/thumb2)". Even though there have been improvements since them, I am concerned with the difference of behavior of the two tools.

Thanks for your help.

Eric

Hi Eric,

As a result it seems that the generated machine code is incorrect (as of
LLVM 3.5): The AsmPrinter generates the following instruction :
adr r2, .LJTI4_0_0
when going through the MC streamer, I get a "fatal error: error in backend:
out of range pc-relative fixup" .

We've fairly recently fixed a bug that looks very similar (r238680,
which was well after 3.6)

Apparently, the fixup does not hold in the 12 bits we have available. I
would have expected clang to perform relaxation on this instruction on that
particular case.

Agreed, whatever's going on it's a bug in the ARM backend.

Is there a way in the PassManager::runOnFunction to anticipate that so that
I can generate a IR code that would fit when converted to machine code?

Not as far as I'm aware; since the bug is further back there's no
reason to try and provide such information to earlier passes. The
backend is expected to cope with whatever you give it.

Strangely enough, this is not happening when using llc to generate the code
from the bc file, I get the object file.

That's weird. Even with "-filetype=obj" (the bug only occurs when
directly writing an object file)? Not that it really affects anything,
getting the same backend options with llc can be a bit tricky.

Even though there have been
improvements since them, I am concerned with the difference of behavior of
the two tools.

The most common one I find perturbing output is "-mcpu" (even with a
triple), but really there are so many options front-ends can twiddle
that you just have to know what it's doing and copy that.

Cheers.

Tim.

Hi Tim,
Thank you for your answer.

We’ve fairly recently fixed a bug that looks very similar (r238680,
which was well after 3.6)

If I wanted to back port that to 3.5 where should I look at? Where in the ARM backend the decision to relax an instruction is taken?

That’s weird. Even with “-filetype=obj” (the bug only occurs when
directly writing an object file)? Not that it really affects anything,
getting the same backend options with llc can be a bit tricky.

This is passing even with -filetype=obj. The transformation I apply are in the optimizer so I must build the new bc to create the object file.

Thanks for your help

Eric

We've fairly recently fixed a bug that looks very similar (r238680,
which was well after 3.6)

If I wanted to back port that to 3.5 where should I look at? Where in the
ARM backend the decision to relax an instruction is taken?

Hi Eric,

First, I'd make sure if Tim's fix works for you. If you can't forward
port your pass to trunk, try to backport Tim's patch into your tree.

This is passing even with -filetype=obj. The transformation I apply are in
the optimizer so I must build the new bc to create the object file.

This is good news, means that the problem is probably not in the
asm/obj emitters. The difference in behaviour between llc and clang
are normally due to target description issues, as Tim mentioned.

I'd encourage you to check on llc's object file and see how the jump
table is being lowered. It's possible that the lack of a few flags
clang passes to the back-end made that instruction not be selected
during ISel.

Essentially, "clang -target armv5t" is *not* the same as "llc -mtriple armv5t".

I'm guessing you're hitting the same bug Tim found earlier...

cheers,
--renato

It is certainly helping - Thanks Renato.

I have created a small ll file to reproduce the problem.
I used the intrinsic function llvm.arm.space to introduce space between the beginning of the code and the jump table.
If the first argument of llvm.arm.space is higher than INT_MAX (2147483647), then the bug is hit. Lower or equal to that value, it passes. It looks like a precision issue. Does this sound familiar to someone?

; ModuleID = ‘test.c’
target datalayout = “e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64”
target triple = “armv5e-none-linux-androideabi”

declare i32 @llvm.arm.space(i32, i32)

; Function Attrs: nounwind
define i32 @main() #0 {
entry:
%retval = alloca i32, align 4
%a = alloca i32, align 4
store i32 0, i32* %retval
store i32 0, i32* %a, align 4
%0 = load i32* %a, align 4
call i32 @llvm.arm.space(i32 2147483647, i32 undef)
switch i32 %0, label %sw.default [
i32 0, label %sw.bb
i32 1, label %sw.bb1
i32 2, label %sw.bb2
i32 3, label %sw.bb3
]

sw.bb: ; preds = %entry
store i32 1, i32* %retval
br label %return

sw.bb1: ; preds = %entry
store i32 2, i32* %retval
br label %return

sw.bb2: ; preds = %entry
store i32 3, i32* %retval
br label %return

sw.bb3: ; preds = %entry
store i32 4, i32* %retval
br label %return

sw.default: ; preds = %entry
br label %sw.epilog

sw.epilog: ; preds = %sw.default
store i32 0, i32* %retval
br label %return

return: ; preds = %sw.epilog, %sw.bb3, %sw.bb2, %sw.bb1, %sw.bb
%2 = load i32* %retval
ret i32 %2
}

; Function Attrs: nounwind
declare i32 @rand() #0

attributes #0 = { nounwind “less-precise-fpmad”=“false” “no-frame-pointer-elim”=“true” “no-frame-pointer-elim-non-leaf” “no-infs-fp-math”=“false” “no-nans-fp-math”=“false” “stack-protector-buffer-size”=“8” “unsafe-fp-math”=“false” “use-soft-float”=“true” }
attributes #1 = { nounwind }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 1, !“wchar_size”, i32 4}
!1 = !{i32 1, !“min_enum_size”, i32 4}
!2 = !{i32 1, !“PIC Level”, i32 1}
!3 = !{!“clang version 3.7.0 (trunk 229364)”}

It does look like the value in @llvm.arm.space is interpreted
incorrectly if it's bigger than INT_MAX, but that's well outside its
intended range and could inevitably be used to break ConstantIslands
(the longest ARM immediate branch is 26-bits; no indivisible entity in
.text can be bigger than that). It's probably an unrelated issue.

Also, I know I said the backend should accept any size code, but 2GB
is definitely going to trigger more edge cases than average.

Cheers.

Tim.

Hi,
I have kept working on this and found the following (as llvm 3.5):

  1. In the function MCObjectStreamer::EmitInstruction there is a check for the instruction being relaxable or not:

if (!Assembler.getBackend().mayNeedRelaxation(Inst)) {
EmitInstToData(Inst, STI);
return;
}

At this stage, the instruction as been already selected to be ARM::ADR.
The call to mayNeed

Hi,
I have kept working on this and found the following (as llvm 3.5):

  1. In the function MCObjectStreamer::EmitInstruction there is a check for the instruction being relaxable or not:

if (!Assembler.getBackend().mayNeedRelaxation(Inst)) {
EmitInstToData(Inst, STI);
return;
}

At this stage, the instruction as been already selected to be ARM::ADR.
The call to mayNeedRelaxation() resolve to ARMAsmBackend::getRelaxedOpcode().
There is no processing in there for ARM:ADR. I added the following line:
case ARM::ADR: return ARM::t2ADR;

As a result, if relaxation is enabled or bundling is enabled then the instruction is relaxed.
And compilation to object passes.
I am not familiar enough with this to understand why there is a condition to enter the relaxtion step : I had to set manually Assembler.setRelaxAll(true) to get into this step.
2) It seems that Fast instruction selection is enabled by default (even when using -O3). The problem does not appear when not using Fast sel (again used a hack in the code) although the same ADR instruction is selected since the offset to apply to the fixup is small enough.

I am not sure I am on the right track, but as far as I understand:
1)ARM::ADR is not handled by relaxation
2)Relaxtion happens under some condition in the ObjectStreamer that I don’t fully understand

What do you think?

Thanks,

Eric

   There is no processing in there for ARM:ADR. I added the following line:
           case ARM::ADR: return ARM::t2ADR;
   As a result, if relaxation is enabled or bundling is enabled then the instruction is relaxed.

Unfortunately, that's not going to work at runtime, for a couple of reasons:

1. An ARM::ADR instruction is ARM-mode, but ARM::t2ADR is Thumb-mode.
They can't be mixed in the same function (to a first approximation).
It'll be interpreted as an entirely different ARM instruction if a CPU
ever sees it.
2. Even if it did what you were hoping, it only staves of the issue:
t2ADR has a limited range too, it's just longer than ADR.

I am not sure I am on the right track, but as far as I understand:
1)ARM::ADR is not handled by relaxation
2)Relaxtion happens under some condition in the ObjectStreamer that I don't fully understand

As suggested by the second problem above, relaxation is not the
correct approach. There is no instruction that we can guarantee will
reach the jump table. There are two plausible ways to fix it (that I
could think of):

1. Enhance ARMConstantIslands.cpp to move the jump table in range if
needed (this is what we did on trunk, see r238680).
2. Fuse the ADR to the jump-table with a pseudo-instruction when
they're first created and expand them much later. This is uglier, but
might be a simpler way to do it.

Of course, the real solution is the usual recommendation to track
trunk wherever possible. Getting stuck on 3.5 is a recipe for ongoing
pain.

Cheers.

Tim.