LLVM Language Reference Strictness

Hello,

I'd like write a program that performs static analysis of code at the LLVM assembly/bitcode level, and to do so I plan on extensively referencing the language reference. As I hope to eventually use this tool as part of a security analysis of untrusted code, I need to be rather strict in my interpretation of the document. As such, I have some questions about how the implementers interpret the document (each question assumes we're considering a single fixed release version):

1. Is http://www.llvm.org/releases/<version>/docs/LangRef.html the most authoritative reference for a given version aside from the source code itself?

2. Are target-specific behaviors documented for each supported target?

3. Does undefined behavior semantically invalidate the entire program or is its unpredictable effect limited in scope somehow?

4. Are any behaviors undefined by virtue of not being specified in the reference, or are all scenarios that lead to undefined behavior explicitly identified as such?

5. Are there any language features with non-performance related semantic import (e.g annotations, instructions, intrinsic functions, types, etc.) that are not specified by the reference but are nevertheless implemented in the build system?

6. Are all deviations from the reference, no matter how minor, considered bugs (either in the implementation or the spec)? If not, what deviations are considered acceptable? If so, is it expected that all such discovered and possibly corrected deviations will have associated bug reports, or might some be corrected in the development repository without documentation of the issue outside of a commit message? In other words, if I'm working with, say, llvm 2.9 and want to find all deviations known to upstream, can I just browse bug reports or will I have to go through commit logs as well?

These are the questions I have for now, but I may have more as I go along. Is this the appropriate place to ask this kind of thing?

Thanks,
Shea Levy

Hello,

I'd like write a program that performs static analysis of code at the
LLVM assembly/bitcode level, and to do so I plan on extensively
referencing the language reference. As I hope to eventually use this
tool as part of a security analysis of untrusted code, I need to be
rather strict in my interpretation of the document. As such, I have some
questions about how the implementers interpret the document (each
question assumes we're considering a single fixed release version):

1. Is http://www.llvm.org/releases/<version>/docs/LangRef.html the most
authoritative reference for a given version aside from the source code
itself?

Yes.

2. Are target-specific behaviors documented for each supported target?

When anything has target-specific behavior, that fact should be
documented. Beyond that, if you have a question about what some
construct is supposed to do, please ask.

3. Does undefined behavior semantically invalidate the entire program or
is its unpredictable effect limited in scope somehow?

There is no limit to the scope of undefined behavior.

4. Are any behaviors undefined by virtue of not being specified in the
reference, or are all scenarios that lead to undefined behavior
explicitly identified as such?

We really want to explicitly identify them all in the reference; if
you have a question about some specific case, please ask.

5. Are there any language features with non-performance related semantic
import (e.g annotations, instructions, intrinsic functions, types, etc.)
that are not specified by the reference but are nevertheless implemented
in the build system?

You should be able to analyze the semantics of IR accurately based
purely on information encoded into the IR. Every instruction, type,
attribute etc. should be documented in LangRef. Platform-specific
intrinsics are not documented, but can generally be treated like a
call to an external function.

6. Are all deviations from the reference, no matter how minor,
considered bugs (either in the implementation or the spec)? If not, what
deviations are considered acceptable?

If the reference doesn't describe the implementation accurately, we
consider it a bug. Granted, some bugs are relatively low-priority.

If so, is it expected that all
such discovered and possibly corrected deviations will have associated
bug reports, or might some be corrected in the development repository
without documentation of the issue outside of a commit message? In other
words, if I'm working with, say, llvm 2.9 and want to find all
deviations known to upstream, can I just browse bug reports or will I
have to go through commit logs as well?

LLVM Bugzilla doesn't contain an entry for every bug; to find every
fix, you'll have to go through commit logs. Not sure what you're
trying to do here, though.

These are the questions I have for now, but I may have more as I go
along. Is this the appropriate place to ask this kind of thing?

Yes.

-Eli

2. Are target-specific behaviors documented for each supported target?

When anything has target-specific behavior, that fact should be
documented. Beyond that, if you have a question about what some
construct is supposed to do, please ask.

What I meant was: for a given target-specific behavior, is there anywhere I can look to see what the behavior specifically is for, say, i686-pc-linux, like you are supposed to be able to for implementation-defined behaviors in C?

5. Are there any language features with non-performance related semantic
import (e.g annotations, instructions, intrinsic functions, types, etc.)
that are not specified by the reference but are nevertheless implemented
in the build system?

You should be able to analyze the semantics of IR accurately based
purely on information encoded into the IR. Every instruction, type,
attribute etc. should be documented in LangRef. Platform-specific
intrinsics are not documented, but can generally be treated like a
call to an external function.

Platform-specific intrinsics are not documented anywhere, or just not in the language reference?

If so, is it expected that all
such discovered and possibly corrected deviations will have associated
bug reports, or might some be corrected in the development repository
without documentation of the issue outside of a commit message? In other
words, if I'm working with, say, llvm 2.9 and want to find all
deviations known to upstream, can I just browse bug reports or will I
have to go through commit logs as well?

LLVM Bugzilla doesn't contain an entry for every bug; to find every
fix, you'll have to go through commit logs. Not sure what you're
trying to do here, though.

Some more detail on my project: I'm mostly doing this so I can get introduced to the field of static analysis, learn what it's big problems are and what's just impossible with it, etc. To that end, however, I've decided to try to implement a set of checks that might actually be useful, to me at least. In particular, I want to see how many of the run-time checks made in hardware when a CPU is in user-mode and memory is segmented can be proven to be unnecessary at compile-time. The (probably impossible) end-goals to this project would be a) that every program which passes its checks would be as safe to run in kernel mode with full memory access as it would be in user mode and b) that a not-insignificant subset of well-written programs passes its checks. If I ever reach the point that I'm actually using this thing to run untrusted code in kernel mode, I'll want to know about as many deviations from the spec as possible to know if they might affect the reasoning my program uses.

Thanks for the help,
Shea Levy

That would be a very useful thing to have for embedded systems. Some
such as uCLinux run ports of "safe" operating systems with the safety
stripped out, whereas others like Texas Instruments' DSP/BIOS run
entirely as a single operating system kernel.

2. Are target-specific behaviors documented for each supported target?

When anything has target-specific behavior, that fact should be
documented. Beyond that, if you have a question about what some
construct is supposed to do, please ask.

What I meant was: for a given target-specific behavior, is there
anywhere I can look to see what the behavior specifically is for, say,
i686-pc-linux, like you are supposed to be able to for
implementation-defined behaviors in C?

For the level of specificity you're looking for, just the source code itself. The LLVM IR language documentation is not, and isn't intended to be, a true language standard document in the same way that the C or C++ standards are. For any given case, check the docs first, and if your question isn't answered there, check the source code of the target(s) you're interested in.

Regards,
  Jim

For the level of specificity you're looking for, just the source code itself. The LLVM IR language documentation is not, and isn't intended to be, a true language standard document in the same way that the C or C++ standards are. For any given case, check the docs first, and if your question isn't answered there, check the source code of the target(s) you're interested in.

And once you've understood, submit a doc patch explaining it :slight_smile:

Ciao, Duncan.

You may want to read the early SAFECode papers (http://sva.cs.illinois.edu/pubs.html). The paper "Memory Safety without Run-time Checks or Garbage Collection" (http://llvm.org/pubs/2003-05-05-LCTES03-CodeSafety.html) and "Ensuring Code Safety Without Runtime Checks for Real-Time Control Systems" (http://llvm.org/pubs/2002-08-08-CASES02-ControlC.html) are particularly relevant and, I believe, would allow you to run user code in kernel space safely without resorting to run-time checks.

You might also want to read the SVA paper from Usenix Security 2009 (http://llvm.org/pubs/2009-08-12-UsenixSecurity-SafeSVAOS.html) and the HyperSafe paper (http://www.csc.ncsu.edu/faculty/jiang/pubs/OAKLAND10.pdf) to get an idea of other memory safety concerns beyond your standard compiler loads and stores. Despite the fact that these papers describe memory safety issues for OS kernel/hypervisor code, these issues also effect user-space code (e.g., threading libraries, mmap(), etc.).

As an FYI, SAFECode later evolved into a system that could support general C programs by using a combination of static analysis, an optional memory-region transform, and run-time checks; that is the system available today (http://sva.cs.illinois.edu). The code for those older systems should still be in the safecode SVN repository, though, so I think you could rebuild the original system which rejected type-unsafe programs if you wanted to do so.

On a final note, as long as you have the whole program to analyze, using something like SAFECode in its modern form should permit you to run user-space code in the kernel as long as you're willing to accept having run-time checks. That said, there is still a fair amount of work to make sure that the memory safety is airtight (potentially enough to warrant a research paper). The two issues that come to mind off-hand are:

1) There's an issue with using the points-to analysis (DSA) on C++ programs and C programs that mimic vtables; the points-to analysis cannot always tell when it has analyzed the complete program, and that can cause SAFECode's checks to loose completeness.

2) There's still some work left for the run-time checks and static analysis. For example, some of the C standard library still needs run-time checks (or be processed with SAFECode). Special checks are needed on calls to mmap(). Inline assembly needs to be handled somehow. Of course, you can choose to write a static analysis that detects use of those features and rejects the program if those features are found.

If you're interested in chatting further on this topic, please feel free to email either me or the svadev@cs.illinois.edu mailing list.

-- John T.

I'm in the process of writing a formal spec for LLVM IR.

I have a lot of the grammar done and a tool for checking the grammar for completeness
and generating cross reference and such. I'm using a nice extended regular expression form of BNF.

My intent is to open source it at google code when it's done but if other people want to help me with this project I could do that now.

About 80% of it is done.

I have documented it mostly from reading the source code.

Beyond just using the grammar to document things I have some tools in mind later for specifying various optimizations using grammatical transformations that are then translated in C++ code for LLVM.

I think also that I could replace the adhoc parser in LLVM with something better once we have a clean grammar for it using a parser generator tool.

My little tool can be expanded to be a parser generator tool. I've written regular expression YACC type equivalents and can do any of those type. I probably will do a special one for this project because YACC is a dinosaur and some of the newer ones are not exactly what I want either.

I would like to see the many adhoc parsers in LLVM get replaced by ones generated from grammars.

Reed

However, there is a ton of stuff that’s not explicitly identified today.

For example, consider a call to a function address bitcasted to a typeincompatible with the type of the function. Most of us around here intuitively
know this gets undefined behavior because we know how to think like a C
compiler. But LangRef doesn’t discuss this. It doesn’t even have a concept
of “compatible” types with which to discuss it.

What should the rules be? If we look through LLVM’s source code, we
find that the inliner has code for smoothing over caller/callee mismatches.
However, we can’t translate this logic into LangRef because it does things
that are impossible to do for non-inlined calls in most backends. If we
dig through every backend, we could come up with a minimal set of
functionality that could be broadly supported. However, this set would be
too minimal for clang, for example, which regularly bitcasts objc_msgSend
in ways that it knows will work, but only for non-obvious reasons.

You could spend weeks researching all the nuances of just this problem.
In practice, LLVM just doesn’t worry about it. Problems like this tend to be
edge cases that don’t cause trouble for most people most of the time.
However, you can find them all over the place if you go looking.

Dan

Reed,

I’m not planing to use Bison or YACC or Antlr. I stated that in my email.

Don’t assume that I’m an idiot although it’s a possibility. I’ve been doing compilers off and on for 35 years now. Ideas in computer science are like fashion in clothes; if you wait around long enough then everything comes back into fashion. All the web and flow of how people do things in compilers has already come and gone many times.

Right now it’s pretty easy to crash tablegen if you have table errors, for example. That is not exactly giving reasonable diagnostics.

It’s very hard to write hand written parsers even for simple languages and not have lots of problems.
This is compounded over time as things change.

Hopefully what I do will meet with acceptance. I’ll take that chance.

Reed

I don’t mean to imply about your skillset, or about the implementation of your grammar (which I know nothing about) or generator. I just thought it was important to make sure you’re aware that you may be trying to sell the concept to an audience that has rejected similar things in the past.

–Owen

Reed, as I’m sure you know and can appreciate… the cause of most of these problems is not the parser itself, it’s the semantic analysis and other actions that are fired by the parser. I don’t see how a machine generated parser helps solve the important problems here.

-Chris