DataFlowSanitizer design discussion


I am starting a thread to discuss the design of DataFlowSanitizer,
a compiler instrumentation based analysis tool which I am hoping to
bring into LLVM. As a starting point, I have included the current
version of the design document below. Comments are appreciated.


DataFlowSanitizer Design Document

Could you maybe give some example use cases?

Also, “sanitizer” may not be the best name for this, since it doesn’t really sanitize anything.

– Sean Silva

While dfsan as proposed isn’t an error checking tool, the goal is to build domain specific error checking tools with it, which is pretty sanitizer-like.

The big question is basically how much utility the LLVM community thinks there is in having a taint analysis framework available upstream. I imagine there are many researchers in program analysis out there who would love to have some standard, easy-to-use taint framework for native code. This is the kind of thing that’s trivial to implement for Java but so far intractable for native code.

Could you maybe give some example use cases?

A use case I am interested in is to take a large application and use
this instrumentation as a tool to help monitor how data flows from its
inputs (sources) to its outputs (sinks). This has applications from
a privacy/security perspective in that one can audit how a sensitive
data item is used within a program and ensure it isn't exiting the
program anywhere it shouldn't be.

An ASPLOS paper from a few years ago discusses this problem and a
solution based on dynamic binary instrumentation using QEMU:

Among other things, I hope to address a number of deficiencies of
the tool described by that paper, in terms of efficiency (the other
sanitizer tools have shown that compiler-based instrumentation can be
much more efficient than binary instrumentation), and also in terms
of accuracy (unlike the system described in that paper, we track data
accurately through join points using union labels).

There are other applications outside of security. For example,
one could use this instrumentation pass (or a variant of it) to tag
opposite-endian integers in memory, and check that no opposite-endian
integer is loaded or otherwise used directly without first going
through a conversion.

Also, "sanitizer" may not be the best name for this, since it doesn't
really sanitize anything.

As Reid mentioned, a goal is to build sanitizer-like tools on top of
this instrumentation. Not only that, but one of the things that an
application can do is turn on its own sources and sinks in response
to the instrumentation being enabled (via the __has_feature macro).
So really, -fsanitize=dataflow would be the flag that turns on
data-flow sanitization for an application designed for it. And should
the component of the compiler that allows this data-flow sanitization
be named any differently?


It is interesting. I can see some use cases with such a tool. To me, source-level implementation

is not as accurate as binary translation. For instance, it is hard to check the taint for return addresses
since there is no concept of return instructions on source level. The stack does not appear until later.

This tool isn't for stack protection; there are other tools for that.
In general the tool isn't currently focused on defending against
adversaries -- it would be trivial to write a program that accesses
shadow memory directly in order to produce incorrect results, not
to mention "tag scrubbers" which use control flow to remove tags
(see section 6 of the ASPLOS paper).

Excellent point. I agree with your reasoning.

-- Sean Silva

Any further comments on the below? I've updated the design document
to add a use case at Kostya's request, but I'd appreciate any further
review of the design.


DataFlowSanitizer Design Document


If there are no further comments on the design below I intend to commit
my DFSan patches in a week.


I think it would be good to get Kostya's explicit sign-off on this before
committing it, as he has been directing and overseeing the sanitizer work
as a whole over the past year. CC-ing him directly to see if he can take
time to look through this. I think he's back from vacation at this point.

I’ve seen the following patches, and I am ok with them: (clang driver, design doc) – LGTM (then (compiler-rt) – LGTM (synchronization part also reviewed by dvyukov@) (llvm) – LGTM-ed by eugenis@


Well, on many architectures there is no concept of return instruction on ISA level too :slight_smile:

That is true. I was referring to the program counter on ISA level. C/C++ abstractions do not expose that. It is not the intended use case for DFSan I guess. ;]