[This is a rather long mail because I feel the topic is extremely
important. I've never before experienced a C or C++ compiler that silently
outputs crash-burn-and-die instructions and I think I've used more than 10
of the sort.]
Hmm, I am mostly thinking of this in terms of an LLVM IR generator who
does not have the benefit of an expertly written front-end that can add
LLVM IR is not a place to put language-level diagnostics, and the users of
a compiler frontend are only interested in language-level diagnostics.
I just tried this command:
clang -c -S -O2 -fsanitize=undefined test.ll
And it didn't change anything. The ud2 instructions are still there and
there are no checks. And on Windows, this seems to yield nothing but a
mouse cursor that blinks once and then the program exits as if nothing had
happened. This is possibly caused by the fact that I always operate with
Windows Error Reporting disabled.
UBSan works at the C/C++ language level. It doesn't work at the IR level.
I am in the process of reading Regehr's and Lattner's articles on the
undefined behavior of C, C++, and Objective-C. I just don't understand why
a tool has to sort of work against you when it can easily work for you.
To a compiler writer, UB means "I don't have to handle that case". It's not
"I have to detect this case and emit a ud2". In many cases (most?), the
problem of deciding whether the program actually has undefined behavior
would require solving the halting problem. That is why UBSan is a runtime
Is there any way at all that I can be informed of the appearance of "ud2"
(and similar on other platforms) code in my program by the compiler or an
associated tool? Okay, I could do two compilations, one with -S and then
grep for "ud2", and then a second without -C, but that seems both
non-portable and slow. I tried the above command with -Wall and got no
diagnostic whatsoever even though my hand-crafted program appears to be
mostly useless junk, which can be seen by the fact that two "ud2"
instructions are output, in different places, so that the program literally
has no chance of running to completion no matter what input it gets (the
first "ud2" appears right after the frame pointer has been set up in
The program could have set up a SIGILL handler in a static initializer,
then catch SIGILL upon executing the ud2, and then patch its code to
replace that instruction with a nop and continue.
Yes, that's crazy, but notice that in order to reason that "the program
literally has no chance of running to completion", you had to *assume* that
the scenario I just described would not happen. That sort of assumption
making is basically the same thing that the optimizer does w.r.t. UB.
A silently emitted and later executed "ud2" would make any program go
astray and chances are that the programmer incorrectly believes that
everything is almost okay, if for no other reason than that the compiler
has not told him otherwise. And, yes, I do know about module tests and so
on, but they are not perfect either. The "ud2" instruction sort of reduces
the user of LLVM IR to a user of an interpreted language - the program must
be checked as if it was written in Python or PHP, because you never know
when an "ud2" instruction might be emitted, not as if it was statically
checked by an advanced compiler. Not exactly what he or she had in mind
when adopting LLVM in the first place, I suspect.
It's not LLVM's job to ensure that your program has "correct semantics".
That is the language frontend's job; it's not the job of "an advanced
compiler": it's the most basic responsibility of a language frontend. All
LLVM knows is the correct semantics for its own IR, not your language.
I am perfectly aware that *I* am the person to ultimately blame for "ud2"
instructions in my code, but as a non-omniscient entity, I'd like to be
told when my tool discovers something that I have missed. That's the main
reason I prefer statically checked languages - the assurance that I have
been told about the little errors and that I can focus my attention on the
big things when testing. As far as I can tell, the "ud2" invalidates all
sensible assumptions regarding the static checks of the compiler. You
might get a few "ud2"s or not. Sort of like throwing a dice and hoping for
the best. The above is also part of why I am personally not very fond of C
and C++, this to such an extent that I stay away from coding in these
languages if I can at all do so. But is LLVM IR C/C++ specific?
The optimizer is only obligated to preserve a set of defined behaviors. If
you stay within those defined behaviors, your program will function
correctly, if not, then all bets are off. It is a bug in your frontend (not
your users' code) if your language claims to be "safe" (i.e. has no
undefined behaviors) and yet the program runs into LLVM IR level UB.
At the very least, I think that an early stage of the bitcode processor
ought to issue a warning if any undefined behavior, whatsoever, is
detected, so that people can learn to code differently and rest assured
that all is not as bad as it could be.
You aren't understanding that detecting undefined behavior is basically
asking "will the program do X", and in general that is equivalent to the
halting problem, meaning that there is nothing that can be done statically
which will answer the question. Thus, you have to add runtime checks if you
want to detect it. The fact that it reduces to the halting problem means
that detecting it statically is not an issue of "detect or don't detect
it", but rather one of "how many heuristics can you tack on for detecting
certain limited cases of it".
It may be my years with Ada, but I definitely think you want to tell
people about any and all undefined things they do in their source code.
The sooner, the better!
The only way to get rid of undefined behavior is to define the result of
all constructs in a language. For the reasons I mentioned above, that will
entail adding runtime checks in a Turing-complete language (i.e. any
interesting one). LLVM needs to be able to efficiently compile languages
like C/C++ without runtime checks, so it has to support the notion of
undefined behavior, which is basically an escape hatch for simplifying
certain static reasoning about the runtime behavior of a program, and
without which those languages cannot be competitively optimized.
-- Sean Silva