Clearing the BSS section

Hi,

I am writing a function that clears the BSS section on an Cortex-M4 embedded system.

The LLVM (version 3.7.0rc3) code I had wrote is :
;------------
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv7em-none--eabi"

@__bss_start = external global i32
@__bss_end = external global i32

define void @clearBSS () nounwind {
entry:
  br label %bssLoopTest

bssLoopTest:
  %p = phi i32* [@__bss_start, %entry], [%p.next, %bssLoop]
  %completed = icmp eq i32* %p, @__bss_end
  br i1 %completed, label %clearCompleted, label %bssLoop

bssLoop:
  store i32 0, i32* %p, align 4
  %p.next = getelementptr inbounds i32, i32* %p, i32 1
  br label %bssLoopTest

clearCompleted:
  ret void
}
;------------

This code runs. But when I optimize it with :
  opt -disable-simplify-libcalls -Os -S source.ll -o optimized.ll

I get the following code for the @clearBSS function :
;------------
define void @clearBSS() nounwind {
entry:
  br label %bssLoop

bssLoop: ; preds = %entry, %bssLoop
  %p1 = phi i32* [ @__bss_start, %entry ], [ %p.next, %bssLoop ]
  store i32 0, i32* %p1, align 4
  %p.next = getelementptr inbounds i32, i32* %p1, i32 1
  %completed = icmp eq i32* %p.next, @__bss_end
  br i1 %completed, label %clearCompleted, label %bssLoop

clearCompleted: ; preds = %bssLoop
  ret void
}
;------------
The optimizer has transformed the while loop into a repeat until.

I think it assumes the two variables @__bss_start and @__bss_end are distinct. But they are solved at link time, and they are the same if the BSS section is empty : in this case, the optimized function fails.

Is there a way to prevent the optimizer to assume the two variables are distinct ? Or what is the proper way to deal with link time values ?

Thanks,

Pierre Molinaro

Hi,

I am writing a function that clears the BSS section on an Cortex-M4 embedded system.

I assume that, for some reason, the operating system is not demand-paging in zeroed memory. Is that correct?

The LLVM (version 3.7.0rc3) code I had wrote is :
;------------
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv7em-none--eabi"

@__bss_start = external global i32
@__bss_end = external global i32

define void @clearBSS () nounwind {
entry:
   br label %bssLoopTest
  bssLoopTest:
   %p = phi i32* [@__bss_start, %entry], [%p.next, %bssLoop]
   %completed = icmp eq i32* %p, @__bss_end
   br i1 %completed, label %clearCompleted, label %bssLoop
  bssLoop:
   store i32 0, i32* %p, align 4
   %p.next = getelementptr inbounds i32, i32* %p, i32 1
   br label %bssLoopTest
  clearCompleted:
   ret void
}
;------------

This code runs. But when I optimize it with :
  opt -disable-simplify-libcalls -Os -S source.ll -o optimized.ll

I get the following code for the @clearBSS function :
;------------
define void @clearBSS() nounwind {
entry:
   br label %bssLoop

bssLoop: ; preds = %entry, %bssLoop
   %p1 = phi i32* [ @__bss_start, %entry ], [ %p.next, %bssLoop ]
   store i32 0, i32* %p1, align 4
   %p.next = getelementptr inbounds i32, i32* %p1, i32 1
   %completed = icmp eq i32* %p.next, @__bss_end
   br i1 %completed, label %clearCompleted, label %bssLoop

clearCompleted: ; preds = %bssLoop
   ret void
}
;------------
The optimizer has transformed the while loop into a repeat until.

I think it assumes the two variables @__bss_start and @__bss_end are distinct. But they are solved at link time, and they are the same if the BSS section is empty : in this case, the optimized function fails.

Is there a way to prevent the optimizer to assume the two variables are distinct ? Or what is the proper way to deal with link time values ?

Have you tried using the memset intrinsic? You could case bss_start and bss_end to integers, subtract them to find the length, and then use memset to zero the memory. I would think memset should work if the length is zero.

Regards,

John Criswell

Make one of them weak.

Joerg

I had thought to use the memset intrinsic, unfortunately I did not succeed to cross compiling compiler-rt on my Mac.

Regards,

Pierre Molinaro

I don’t think you need compiler-rt to use the memset intrinsic. I think the code generator will generate efficient inline code for it (though I’m not certain). In any event, Joerg’s suggestion of making one external weak sounds a lot easier. :slight_smile: Regards, John Criswell

You can use this:

@__bss_start = extern_weak externally_initialized global i32
@__bss_end = extern_weak externally_initialized global i32

-Krzysztof

Without the -disable-simplify-libcalls option, opt generates a call to memset intrinsic, and llc in turn generates a call to __eabi_memset that remains unsolved. That is why I think that compiler-rt is needed here.

I have change the global declaration to :
;———

@__bss_start = external global i32
@__bss_end = weak global i32 0
;———

But no change : the optimized code is still a repeat until.

I have also tried without success :

;———

@__bss_start = extern_weak global i32
@__bss_end = extern_weak global i32
;———

Regards,

Pierre Molinaro

The declaration is the solution :
;—————
@__bss_start = extern_weak externally_initialized global i32
@__bss_end = extern_weak externally_initialized global i32
;—————

Now, the optimized code generated by opt is :
;—————
define void @clearBSS() nounwind {
entry:
  br i1 icmp eq (i32* @__bss_start, i32* @__bss_end), label %clearCompleted, label %bssLoop.preheader

bssLoop.preheader: ; preds = %entry
  br label %bssLoop

bssLoop: ; preds = %bssLoop.preheader, %bssLoop
  %p1 = phi i32* [ %p.next, %bssLoop ], [ @__bss_start, %bssLoop.preheader ]
  store i32 0, i32* %p1, align 4
  %p.next = getelementptr inbounds i32, i32* %p1, i32 1
  %completed = icmp eq i32* %p.next, @__bss_end
  br i1 %completed, label %clearCompleted.loopexit, label %bssLoop

clearCompleted.loopexit: ; preds = %bssLoop
  br label %clearCompleted

clearCompleted: ; preds = %clearCompleted.loopexit, %entry
  ret void
}
;—————

Thank you for the time you spent to help me solve my problem.

Pierre Molinaro

It'd be better to make the both zero-sized, like an array of i8 with zero
elements. This idiom has come up before, and that's our recommended
solution. We've tweaked the optimizers to ensure that zero-sized objects
are not assumed to be distinct.