how to track down a kernel miscompilation?

Hi,

I am trying to build the Linux kernel with LLVM.
'ARCH=um' appears to work, now I am trying to get 'ARCH=x86' to work.

So far it seems there is something wrong with the boot vga code (it
finds no video modes), the acpi code, and the serial console code.

I am now trying to compile drivers/ with llvm-gcc and the rest with
gcc-4.2 (I have a wrapper script), I am compiling to native code, not
.bc files.

When compiling drivers/serial/serial_core.c with gcc-4.2 it boots fine,
but when compiled with llvm-gcc, I get a kernel panic.

How should I proceed further to find where the miscompilation is?
Can bugpoint help me here? (it takes ~10 seconds to compile and boot a
failing kernel with kvm/qemu)

This is 2.6.28-rc6-tip, and LLVM SVN r59914.

[ 0.000000] Detected 2832.689 MHz processor.
[ 0.004000] Console: colour VGA+ 80x25
[ 0.004000] BUG: unable to handle kernel NULL pointer dereference at
00000000000000e8
[ 0.004000] IP: [<ffffffff802b3680>] kmem_cache_alloc+0x30/0xa0
[ 0.004000] PGD 0
[ 0.004000] Thread overran stack, or stack corrupted
[ 0.004000] Oops: 0000 [#1] PREEMPT SMP
[ 0.004000] last sysfs file:
[ 0.004000] CPU 0
[ 0.004000] Modules linked in:
[ 0.004000] Pid: 0, comm: swapper Not tainted 2.6.28-rc6-tip #165
[ 0.004000] RIP: 0010:[<ffffffff802b3680>] [<ffffffff802b3680>]
kmem_cache_alloc+0x30/0xa0
[ 0.004000] RSP: 0018:ffffffff8075dd48 EFLAGS: 00010086
[ 0.004000] RAX: 0000000000000000 RBX: ffffffff806e2320 RCX:
ffffffff80213546
[ 0.004000] RDX: 0000000000002580 RSI: 00000000000000d0 RDI:
0000000000000000
[ 0.004000] RBP: ffffffff8075dd68 R08: 0000000000000008 R09:
000000000000006e
[ 0.004000] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000286
[ 0.004000] R13: 00000000000000d0 R14: ffffffff80847500 R15:
0000000000000000
[ 0.004000] FS: 0000000000000000(0000) GS:ffffffff80757ec0(0000)
knlGS:0000000000000000
[ 0.004000] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 0.004000] CR2: 00000000000000e8 CR3: 0000000000201000 CR4:
00000000000006a0
[ 0.004000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 0.004000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 0.004000] Process swapper (pid: 0, threadinfo ffffffff8075c000,
task ffffffff806e2320)
[ 0.004000] Stack:
[ 0.004000] ffffffff806e2320 ffffffff806e2320 0000000000000000
ffffffff80847500
[ 0.004000] ffffffff8075dd88 ffffffff80213546 ffffffff806e2320
ffffffff8075c000
[ 0.004000] ffffffff8075dda8 ffffffff8020d6b9 0000000000000001
0000000000000000
[ 0.004000] Call Trace:
[ 0.004000] [<ffffffff80213546>] init_fpu+0x106/0x130
[ 0.004000] [<ffffffff8020d6b9>] math_state_restore+0x89/0xc0
[ 0.004000] [<ffffffff80588739>] do_device_not_available+0x9/0x10
[ 0.004000] [<ffffffff8020d2a5>] device_not_available+0x15/0x20
[ 0.004000] [<ffffffff8045a5df>] ? uart_set_options+0xf/0xf0
[ 0.004000] [<ffffffff8045ca5f>] ? uart_parse_options+0x2f/0x90
[ 0.004000] [<ffffffff807896a8>] serial8250_console_setup+0xa8/0xc0
[ 0.004000] [<ffffffff8023801e>] register_console+0x28e/0x2f0
[ 0.004000] [<ffffffff807899c5>] serial8250_console_init+0x155/0x160
[ 0.004000] [<ffffffff80788542>] console_init+0x32/0x50
[ 0.004000] [<ffffffff80765d30>] start_kernel+0x230/0x3e0
[ 0.004000] [<ffffffff8076527c>] x86_64_start_reservations+0x7c/0xc0
[ 0.004000] [<ffffffff807653bb>] x86_64_start_kernel+0xcb/0xf0
[ 0.004000] Code: 83 ec 20 4c 89 6c 24 10 48 89 1c 24 41 89 f5 4c 89
64 24 08 4c 89 74 24 18 48 8b 4d 08 9c 41 5c fa 65 8b 04 25 24 00 00 00
48 98 <48> 8b 94 c7 e8 00 00 00 48 8b 1a 44 8b 72 18 48 85 db 74 44 8b
[ 0.004000] RIP [<ffffffff802b3680>] kmem_cache_alloc+0x30/0xa0
[ 0.004000] RSP <ffffffff8075dd48>
[ 0.004000] CR2: 00000000000000e8

Best regards,
--Edwin

To get x86 to work, you need to compile /arch/x86/boot and
/arch/x86/boot/compressed with gcc. The rest can be compiled with
llvm-gcc. The only adjustment you need to make is to
/arch/x86/kernel/signal_32.c to fix the stack offset in sys_sigreturn.

Also you must compile with -O0. instcombine is causing the crashes
you are seeing. I know mem2reg, sccp, simplifycfg, and dce result in
a working kernel. (My build script does llvm-gcc -> opt with custom
passes for my project -> llc. You also must compile arch/x86/lib with
-O2 as it has inline asm that the fast regalloc can't handle.

Last time I tried to specifically find the transform pass that caused
a crash, I had to find the specific file that was affected by the
transform and compare the pre and post transformed files to see what
was wrong. At the time (~ 1 year ago) it was almost always
instcombine and almost always volatile related. Now bugpoint has some
ways to run the code by some external agent which I haven't
investigated yet.

Oh and if you are trying to construct an entire kernel bitcode file,
the following rule will help:

cmd_ll_as = echo "module asm \"" > $@ ; $(ASMCC) $(a_flags) -c
-o - -S $< |perl -npe s/\"/\\\\22/g >> $@ ; echo "\"" >> $@

Andrew

was wrong. At the time (~ 1 year ago) it was almost always
instcombine and almost always volatile related. Now bugpoint has some
ways to run the code by some external agent which I haven't
investigated yet.

People have fixed a ton of bugs (volatile and otherwise) in instcombine since then. I will be very interested to hear what pass ends up being the problem here, and what exactly is the code that triggers the bug-- please post these when the results are known.

John Regehr

For people who want to try hacking on the linux kernel with llvm, here
is a head start:

http://llvm.org/~alenhar2/k.tbz

This is a 2.6.27.5 kernel with a .config file for qemu/kvm with virtio
devices (I also think it will work with the default devices). The
build process uses llvm-gcc -> opt -> llc so you can add your own
(per-file) passes to the build process or debug specific passes.

Directories arch/x86/boot and arch/x86/boot/compressed specify NOLLVM
in their makefile which causes the make system to use gcc to build the
directory. options to opt are controlled by OPT_OPTIONS in the top
level makefile.

Some caviots and notes:
* This has only been tested in kvm and qemu and mostly only with
virtio net and block devices. Other configurations are untested
* The kernel is compiled at -O0 with some optimizations in the opt
step. This is because...
* instcombine breaks the kernel

Andrew

Yes, the rash of volatile bug fixes made getting 2.6 working much much
easier than getting 2.4 originally was. Unfortunately I am in the
middle of prepping a paper so I can't track down the current
instcombine bug, but I've posted a building and working kernel should
other people want to play with llvm compiled linux kernels.

Andrew

One other note I forgot...

sig_sysreturn (arch/x86/kernel/signal_32.c) has a magic fudge factor
to fix up stack layout differences between llvm and gcc. If when you
hit userspace you get signal handling errors, look at the frame
addresses printed out and adjust the fudge factor to make the frame
addresses match. (the code automatically enables the debug output
after the first signal frame address error). The fudge factor depends
on the optimizations and register allocator used, but is otherwise
seems stable for a given configuration.

Andrew

Sorry, one last note:

The latest version I tested is revision 59302 of llvm and revision
59286 of llvm-gcc.

Andrew

Hey that is great to hear that the volatile stuff is helping someone.

Just broadly speaking do you know if the instcombine bug involves pointer code vs. scalar? I ask because intensive random testing has not found the bug that you are seeing. That says that either (1) the bug lies in a part of the program space we don't explore or (2) it does, but we haven't run the tests for long enough. The former sounds most likely to me. If I knew more about what kind of code evoked the problem we could give high priority to adding the appropriate extensions to our testcase generator.

John

Hey that is great to hear that the volatile stuff is helping someone.

Yes, quite a bit. The 2.4 port we did we had to throw in asm no-ops
with "clobbers memory" to enforce volatile. None of that was
necessary this time.

Just broadly speaking do you know if the instcombine bug involves pointer
code vs. scalar? I ask because intensive random testing has not found the
bug that you are seeing. That says that either (1) the bug lies in a part
of the program space we don't explore or (2) it does, but we haven't run
the tests for long enough. The former sounds most likely to me. If I
knew more about what kind of code evoked the problem we could give high
priority to adding the appropriate extensions to our testcase generator.

I don't know. If edwin doesn't beat me to it (and he probably will),
it will get tracked down to the file or files.

Andrew