LLVM on OpenBSD

Sometimes you get a clean build of llvm, sometimes you don't and instead
get a bus error.

Nonreproducible behaviour in a batch application is usually a sign of
hardware problems.

> In the linux-community, people say that bus-error's are almost
> always because of faulty hardware, e.g. problem with DRAM
> timing, overheated CPU, power-supply that cannot provide enought
> power during current surges, things like that.

That is one reason a bus error might occur, but my more common
understanding of a bus error is data not properly aligned with the byte
boundaries and/or out of range memory at the physical level.

Bus errors are usually the result of pointers getting corrupted.
That may be due to a bug, or due to hardware problems.

The machine I am building on is my workstation which I use 9-4.30
mon-fri. I run all manner of apps without any problems, so if it were
bad hardware it would have shown itself by now surely.

Not really. gcc produces a different kind of load than most
applications.

As a test I got another developer to try on a different machine and he
has the same problem.

It is possible that both hardwares are faulty, though it reduces the
probability considerably.

In another test he also tried a more aggressive
malloc.conf (a mechanism which causes malloc to do all sorts of
randomisation and page filling to test for memory based bugs) and a
completely different error was encountered:

SelectionDAG.cpp:2602: warning: converting of negative value
`-1' to `long long

If you get irreproducible bus errors, that means random rare pointer
corruption, and pointer corruption can cause almost arbitrary fault
behaviour.
So if you change the environment, pointer corruption will change the
fault behaviour, regardless of whether the corruption is due to hardware
or software.

Also we found that without specifying --enable-optimized, the
optimisations were still present:

-O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic
-Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter
-O3

:¬(

Can't comment on that one.

Try writing a script that populates an empty directory and does the
build. That way, you can guarantee identical environments (modulo
machine load and filesystem storage layout, but that should not
influence what batch programs like LLVM do).
If you get the same bus errors after deleting the directory and starting
over, it's probably a software problem. If the bus errors stay random,
that would incrase the probability of a hardware problem.

Regards,
Jo

Hi,

Holger Schurig wrote:
>> With 3.3.5 my first test took 5 times to produce a non "bus
>> error" build. There were no 'make cleans' in between.
>>
>> What is going on?
>
> You mean you used your bsd-ports-provided gcc to compile LLVM and
> you've got 4 times a bus-error during the build? In this case,
> it cannot be a LLVM problem.

Ok, to clarify,

I have tried the OpenBSD provided gcc-3.3.5 (which is considered the
least buggy version of gcc) and also with gcc-4.2 from ports.

Sometimes you get a clean build of llvm, sometimes you don't and instead
get a bus error.

if I understand right the problem is that you are unable to build LLVM
because your system gcc (and another gcc you tried) tends to crash during
the build?

On several different systems.

> In the linux-community, people say that bus-error's are almost
> always because of faulty hardware, e.g. problem with DRAM
> timing, overheated CPU, power-supply that cannot provide enought
> power during current surges, things like that.

That is one reason a bus error might occur, but my more common
understanding of a bus error is data not properly aligned with the byte
boundaries and/or out of range memory at the physical level.

The machine I am building on is my workstation which I use 9-4.30
mon-fri. I run all manner of apps without any problems, so if it were
bad hardware it would have shown itself by now surely.

gcc is however notorious for exposing bad memory problems.

The build also stops at exactly the same point in several different
*virtual* machines.
(the assert() in utils/TableGen/CodeGenDAGPatterns.cpp line 932)

Please stop repeating the "bad memory" mantra, that hasn't been true
for years; it is much more likely to be a bug in gcc.

As a test I got another developer to try on a different machine and he
has the same problem. In another test he also tried a more aggressive
malloc.conf (a mechanism which causes malloc to do all sorts of
randomisation and page filling to test for memory based bugs) and a
completely different error was encountered:

SelectionDAG.cpp:2602: warning: converting of negative value
`-1' to `long long

If I understand right, tweaking your system malloc caused the system
gcc to behave differently when compiling LLVM?

Sorry about that, but I wasn't very clear when passing on some error
messages to Edd, just pointing out some sloppy coding.

Changing the malloc options had no effect on the build.

Also we found that without specifying --enable-optimized, the
optimisations were still present:

-O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic
-Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter
-O3

--enable-optimized is not about whether or not compiler optimizations
are performed when building LLVM, it is about whether the built version
of LLVM performs internal checks when run.

Are you sure?
Makefile.rules specifically includes/excludes $(OPTIMIZE_OPTION) based
on whether ENABLE_OPTIMIZED ==1.

The reason they are always included (on OpenBSD anyway) is this:

ENABLE_OPTIMIZED is always 1 because there are some shell-script
syntax problems in configure script which possibly don't show up on
all shells.

It uses "${foo+set}" instead of "${foo:+set}" - see man sh(1) on almost any OS.

Edd:
I'm fairly sure this is a bug in our gcc.

After fixing the syntax errors in the configure script and doing a
build with optimizations turned off it stops at the same point.

I've tested it several times in vmware and virtualbox so far, I can
try qemu if you like, but it isn't hardware related.

Perhaps refactoring the function
TreePatternNode::ApplyTypeConstraints() into several smaller functions
would help.

Regards,
Andrew Dalgleish

Sometimes you get a clean build of llvm, sometimes you don't and instead
get a bus error.

gcc makes a excellent systems test. Try this, while :; do make boostrap && make clean; done with the FSF top of tree gcc. Let it run for 2 weeks. If it ever built once, it should never fail to build. If it does, I'd install a good linux distribution on the same hardware and try again, if it still fails, look to replace the hardware. If linux works, I'd look to replace the OS. If it failed everytime in the exact same spot in the exact same way, try the last FSF release for gcc.

If it fails deterministically, that could be a gcc bug (or very bad hardware). If it fails non-deterministically, you're most likely looking at bad hardware.

That is one reason a bus error might occur, but my more common
understanding of a bus error is data not properly aligned with the byte
boundaries and/or out of range memory at the physical level.

Absent a bad version of gcc and a bad OS, the usual culprit is bad hardware.

The machine I am building on is my workstation which I use 9-4.30
mon-fri. I run all manner of apps without any problems, so if it were
bad hardware it would have shown itself by now surely.

No. I've seen machines that work flawlessly, pass all manner of memory tests including 24+ hours of standalone memtest86, right up to the point you ask them to boostrap gcc, then they fail, 100% of the time. Find someone with good hardware, same OS. See if the testcase that fails for you, fails for them. Try and use the same gcc binaries for the test. If it passes for someone else, again, probably bad hardware.

If you live in California, I'd ask if you bought you memory at Fry's? They test all their memory to ensure it is bad, unless you buy the namebrand memory. :frowning:

The build also stops at exactly the same point in several different
*virtual* machines.
(the assert() in utils/TableGen/CodeGenDAGPatterns.cpp line 932)

*That* makes a software problem much more probable.

Please stop repeating the "bad memory" mantra,

Bus errors with almost-random behaviour are typical of marginal
hardware.
No mantra here, just sticking with the explanation that's the most
probably one matching a specific description.

A VM abort doesn't fully exclude bad hardware.
To convince the last sceptic, you could run the same VM on two different
hardwares (make them as different as possible, i.e. different build
years, different vendors, different models of main board / RAM).
That would exclude the possibility that the VM maps its virtual RAM to
the always same faulty real RAM cell (or some other unknown faulty
behaviour). It would certainly exclude all hardware failures that I
could imagine :slight_smile:

Regards,
Jo

Hi guys,

Edd Barrett wrote:

Hi there,

I am a student considering a compiler design based dissertation with
llvm. I am having problems building llvm on OpenBSD-current. I hope to
make a port of llvm for OpenBSD once I have figured out how to build
it.

We still have not had any luck building llvm. Since last time, we have rebuilt gcc with -O0 incase of gcc optimisation bugs to no avail.

I believe Andrew stress tested our gcc-3.3.5 by building itself many times over. What wer the results Andrew? FWIW gcc-4.2 shows the same error as 3.3.5 (see below).

Here is a backtrace (courtesy of Andrew) of the tblgen run, which aborts about 50% of the time when it feels like it:

#1 0x04e521a3 in abort () at /usr/src/lib/libc/stdlib/abort.c:68
#2 0x04df29d7 in __assert2 (file=0x3c0018e1 "CodeGenDAGPatterns.cpp",
line=934, func=0x3c0020a6 "ApplyTypeConstraints",
     failedexpr=0x3c002400
"getOperator()->isSubClassOf(\"SDNodeXForm\") && \"Unknown node
type!\"") at /usr/src/lib/libc/gen/assert.c:52
#3 0x1c0d0bcd in llvm::TreePatternNode::ApplyTypeConstraints
(this=0x7dfc0700, TP=@0x7c0fb6c0, NotRegisters=false) at
CodeGenDAGPatterns.cpp:934
#4 0x1c0cfb6b in llvm::TreePatternNode::ApplyTypeConstraints
(this=0x7dfc0740, TP=@0x7c0fb6c0, NotRegisters=false) at
CodeGenDAGPatterns.cpp:829
#5 0x1c0cfb6b in llvm::TreePatternNode::ApplyTypeConstraints
(this=0x7c0fb7c0, TP=@0x7c0fb6c0, NotRegisters=false) at
CodeGenDAGPatterns.cpp:829
#6 0x1c0d407c in llvm::TreePattern::InferAllTypes (this=0x7c0fb6c0)
at CodeGenDAGPatterns.cpp:1182
#7 0x1c0d66ac in llvm::CodeGenDAGPatterns::ParsePatternFragments
(this=0xcfbd1330) at CodeGenDAGPatterns.cpp:1368
#8 0x1c0d47b9 in CodeGenDAGPatterns (this=0xcfbd1330, R=@0x3c04465c)
at CodeGenDAGPatterns.cpp:1225
#9 0x1c174c65 in DAGISelEmitter (this=0xcfbd1328, R=@0x3c04465c) at
DAGISelEmitter.h:30
#10 0x1c173e40 in main (argc=11, argv=0xcfbd15a4) at TableGen.cpp:178

(I believe Andrew added a line to insert a breakpoint for gdb in the above backtrace)

gmake[3]: Entering directory `/usr/ports/devel/llvm/w-llvm-2.3/llvm-2.3/lib/Target/ARM'
llvm[3]: Building ARM.td register information header with tblgen
llvm[3]: Building ARM.td register names with tblgen
llvm[3]: Building ARM.td register info implementation with tblgen
llvm[3]: Building ARM.td instruction names with tblgen
llvm[3]: Building ARM.td instruction information with tblgen
assertion "getOperator()->isSubClassOf("SDNodeXForm") && "Unknown node type!"" failed: file "CodeGenDAGPatterns.cpp", line 934, function "ApplyTypeConstraints"
gmake[3]: *** [/usr/ports/devel/llvm/w-llvm-2.3/llvm-2.3/lib/Target/ARM/Debug/ARMGenInstrInfo.inc.tmp] Abort trap (core dumped)

Does this shed any light on the situation?

Thanks

As a further bit of info, the way to trigger this bug is to cd into
"lib/Target/ARM" and do "make clean && make". Around 12 files are then
built; at any given point, there's a high likelihood that one of these fails
with the assertion error Edd mentioned. Every so often all 12 compile in one
go; repeat and rinse, and the bug will trigger sooner rather than later.
Although nothing is certain, my guess (based on my own experience porting
the Converge language to various platforms) would be that this bug is quite
possibly cross platform, but is more like to be triggered by OpenBSD's
malloc (which has a few tricks, such as allocating pages randomly across the
entire VM space, which tend to highlight such things).

If all the files compile successfull then the simple examples (Fibonacci
etc.) included in LLVM seem to work fine. Most of the test suite seems to
pass too, although I haven't yet fully understood why some tests fail (I
don't think it's related to the above bug).

I am happy to give an account on one of my OpenBSD boxes to any LLVM
developer(s) who would like to look into this problem.

Laurie