GSoC and SAFECode

Hi, everyone.

I'm a senior at Swarthmore College and would love to work with LLVM this
summer. I'm interested in systems languages and security, and I'll start
a PhD on these topics this fall. I also do a good deal of open source
development and auditing with OpenBSD and a variety of other projects.

I spent last year's GSoC doing security auditing for Pidgin/libpurple.
GSoC seems like a great way to spend this summer as well. I'm
particularly interested in the SAFECode project and LLVM's general
security and auditing features. I haven't worked a ton with LLVM in the
past, but I have a mostly finished strnlen(3) optimization I've been
meaning to resubmit:

https://marc.info/?l=llvm-commits&m=145485679322322&w=2

I also worked with Martin Natano to port the integer overflow checker to
OpenBSD and build a working kernel and userland with it. We now have a
patch that integrates it into the full system build through libc, and
fixed a number of bugs in the process.

Because of my background in auditing, I like to think I have an
intuition for which compiler and scanner features developers will find
useful and usable. I also have a good understanding of the more
theoretical aspects of language and compiler design, and I'm very
familiar with the ANSI C and POSIX specs.

In regards to potential projects, I'd like to rewrite the SAFECode
static array bounds check pass and add check optimizations (both to
remove statically unnecessary checks and improve the generated code for
remaining ones). In the process, I'd refactor and simplify what already
exists, fixing bugs as I encounter them. New checks at the libc API
level could also be interesting, if they're within the scope of the
project.

This work would almost certainly lead to an OpenBSD port. John would
probably be helpful on both accounts, as he's integrated SAFECode with
FreeBSD. I'm also competent with the OpenBSD port system's bulk build
infrastructure, so I'm confident that I could test SAFECode on a handful
of important projects or even en masse.

I'm open to other project ideas as well. If anyone else is mentoring a
project that seems like a good fit for me, please share.

Thanks for your time,
Michael McConville

Dear Michael,

If you're interested in SAFECode, the first step is to get SAFECode working with a newer version of LLVM. A Master's student did some work on this last summer with LLVM 3.7 but didn't finish. It would now need to be updated to LLVM 3.8 (though I suppose a completed LLVM 3.7 port would be fine with me).

After that, there are some interesting projects on which to work. One would be static array bounds checking. That could be interesting, but it doesn't really address my immediate research needs. Right now, I'm more interested in getting the Baggy Bounds with Accurate Checking (BBAC) feature enabled so that we can use it in research. For example, we could try to get faster enforcement of memory safety on operating system kernels, examine the use of combined safe/unsafe languages for OS kernels (without letting C code violate the safety provided by the safe language), and enforce dynamic security policies on kernel modules (to thwart rootkits).

If you're interested in security projects on the kernel, you could enhance the KCoFI prototype to use a more accurate control-flow graph or to use code pointer integrity, or you could write optimizations for the software-fault isolation instrumentation (which would improve both KCoFI and Virtual Ghost, if you are familiar with those papers of mine).

Does any of these projects sound interesting to you?

Regards,

John Criswell

Hi, everyone.

I'm a senior at Swarthmore College and would love to work with LLVM this
summer. I'm interested in systems languages and security, and I'll start
a PhD on these topics this fall. I also do a good deal of open source
development and auditing with OpenBSD and a variety of other projects.

I spent last year's GSoC doing security auditing for Pidgin/libpurple.
GSoC seems like a great way to spend this summer as well. I'm
particularly interested in the SAFECode project and LLVM's general
security and auditing features. I haven't worked a ton with LLVM in the
past, but I have a mostly finished strnlen(3) optimization I've been
meaning to resubmit:

'[PATCH] strlen -> strnlen optimization' - MARC

I also worked with Martin Natano to port the integer overflow checker to
OpenBSD and build a working kernel and userland with it. We now have a
patch that integrates it into the full system build through libc, and
fixed a number of bugs in the process.

Because of my background in auditing, I like to think I have an
intuition for which compiler and scanner features developers will find
useful and usable. I also have a good understanding of the more
theoretical aspects of language and compiler design, and I'm very
familiar with the ANSI C and POSIX specs.

Can you clarify what you mean by "theoretical aspects of language and compiler design?" Does that mean that you understand Kam/Ullman (i.e., classical) data-flow analysis and SSA-based compiler analysis algorithms?

John Criswell wrote:

If you're interested in SAFECode, the first step is to get SAFECode
working with a newer version of LLVM. A Master's student did some
work on this last summer with LLVM 3.7 but didn't finish. It would
now need to be updated to LLVM 3.8 (though I suppose a completed LLVM
3.7 port would be fine with me).

After that, there are some interesting projects on which to work. One
would be static array bounds checking. That could be interesting, but
it doesn't really address my immediate research needs. Right now, I'm
more interested in getting the Baggy Bounds with Accurate Checking
(BBAC) feature enabled so that we can use it in research. For
example, we could try to get faster enforcement of memory safety on
operating system kernels, examine the use of combined safe/unsafe
languages for OS kernels (without letting C code violate the safety
provided by the safe language), and enforce dynamic security policies
on kernel modules (to thwart rootkits).

If you're interested in security projects on the kernel, you could
enhance the KCoFI prototype to use a more accurate control-flow graph
or to use code pointer integrity, or you could write optimizations for
the software-fault isolation instrumentation (which would improve both
KCoFI and Virtual Ghost, if you are familiar with those papers of
mine).

Does any of these projects sound interesting to you?

Yeah, definitely. Porting to LLVM 3.8 or finishing the 3.7 port would be
a good way to get more familiar with LLVM internals.

BBAC looks very interesting. I, like you (according to the BBAC paper's
intro), am a little frustrated by the fact that these sorts of checkers
still aren't used in standard software builds, so I find optimizing for
performance and simplicity particularly interesting. Also, this is an
anecdote, but have you considered writing pseudo-random data to the
padding area and using its checksum as a canary? Alternately, you could
even just use the first few bytes of the padding directly. We recently
added optional canaries to OpenBSD and it's been useful in finding bugs.

I'll have to read more about the kernel projects before I can comment.

Thanks,
Michael

Hi Michael,

In regards to potential projects, I'd like to rewrite the SAFECode
static array bounds check pass and add check optimizations (both to
remove statically unnecessary checks and improve the generated code for
remaining ones).

I'm part of team building a production quality Java JIT compiler using LLVM, and
we're very interested in making upstream LLVM work better with array bounds
checks and null checks, since IR from a Java frontend tends to have a lot of
these. This involves better symbolic analysis (e.g. in ScalarEvolution and
LazyValueInfo); code-generation cleverness (e.g "implicit null checks" [1]); and
code duplication/specialization (e.g. the somewhat misleadingly named "inductive
range check elimination" [2]). You can find some more detail on our work at
[3].

While I cannot mentor a project specifically aimed at SAFECode, I'm very willing
and able to mentor a project aimed at making LLVM better at generally handling
array bounds checks and null checks. Ideally this will involve improving
existing optimization passes in LLVM, and in ways that will work well with both
Java-style array bounds / null checks and SAFECode style integrity checks. We
can also consider building a dedicated range-analysis based bound check elision
pass if that makes more sense.

Is this something you're interested in?

-- Sanjoy

[1]:

[2]:

[3]: https://www.youtube.com/watch?v=3G2Rg6GBXqA

John Criswell wrote:

If you're interested in SAFECode, the first step is to get SAFECode
working with a newer version of LLVM. A Master's student did some
work on this last summer with LLVM 3.7 but didn't finish. It would
now need to be updated to LLVM 3.8 (though I suppose a completed LLVM
3.7 port would be fine with me).

After that, there are some interesting projects on which to work. One
would be static array bounds checking. That could be interesting, but
it doesn't really address my immediate research needs. Right now, I'm
more interested in getting the Baggy Bounds with Accurate Checking
(BBAC) feature enabled so that we can use it in research. For
example, we could try to get faster enforcement of memory safety on
operating system kernels, examine the use of combined safe/unsafe
languages for OS kernels (without letting C code violate the safety
provided by the safe language), and enforce dynamic security policies
on kernel modules (to thwart rootkits).

If you're interested in security projects on the kernel, you could
enhance the KCoFI prototype to use a more accurate control-flow graph
or to use code pointer integrity, or you could write optimizations for
the software-fault isolation instrumentation (which would improve both
KCoFI and Virtual Ghost, if you are familiar with those papers of
mine).

Does any of these projects sound interesting to you?

Yeah, definitely. Porting to LLVM 3.8 or finishing the 3.7 port would be
a good way to get more familiar with LLVM internals.

BBAC looks very interesting. I, like you (according to the BBAC paper's
intro), am a little frustrated by the fact that these sorts of checkers
still aren't used in standard software builds, so I find optimizing for
performance and simplicity particularly interesting. Also, this is an
anecdote, but have you considered writing pseudo-random data to the
padding area and using its checksum as a canary?

No, I have not considered canaries, and I'd be very wary of doing so. Canaries are (IMHO) a hack; Stephen Checkoway has his students defeat stack canaries as a homework assignment. I'd need to see a strong argument that a heap object canary would not be defeated easily.

I'm more interested in storing information like the following in the padding:

o The exact length of the memory object (BBAC)
o The points-to set(s) to which the memory object belongs (useful for finding casting errors, dangling pointer errors, and bugs in the compiler's points-to analysis)
o Policy information on which part of a program can modify which fields in the object (useful for restricting the behavior of kernel modules within a monolithic kernel)

I'm rather hoping that there's a research paper within the latter two projects.

  Alternately, you could
even just use the first few bytes of the padding directly. We recently
added optional canaries to OpenBSD and it's been useful in finding bugs.

Bug finding and online protection make very different tradeoffs that I won't get into right now due to lack of time. If you're interested, we could probably meet up at a conference sometime (or discuss it if your GSoC proposal is accepted :slight_smile: ).

Regards,

John Criswell