DataFlowSanitizer design discussion

pcc · June 13, 2013, 10:00pm

Hi,

I am starting a thread to discuss the design of DataFlowSanitizer,
a compiler instrumentation based analysis tool which I am hoping to
bring into LLVM. As a starting point, I have included the current
version of the design document below. Comments are appreciated.

Thanks,
Peter

DataFlowSanitizer Design Document

Sean_Silva · June 13, 2013, 10:13pm

Could you maybe give some example use cases?

Also, “sanitizer” may not be the best name for this, since it doesn’t really sanitize anything.

– Sean Silva

rnk · June 13, 2013, 10:50pm

While dfsan as proposed isn’t an error checking tool, the goal is to build domain specific error checking tools with it, which is pretty sanitizer-like.

The big question is basically how much utility the LLVM community thinks there is in having a taint analysis framework available upstream. I imagine there are many researchers in program analysis out there who would love to have some standard, easy-to-use taint framework for native code. This is the kind of thing that’s trivial to implement for Java but so far intractable for native code.

pcc · June 14, 2013, 5:43pm

Could you maybe give some example use cases?

A use case I am interested in is to take a large application and use
this instrumentation as a tool to help monitor how data flows from its
inputs (sources) to its outputs (sinks). This has applications from
a privacy/security perspective in that one can audit how a sensitive
data item is used within a program and ensure it isn't exiting the
program anywhere it shouldn't be.

An ASPLOS paper from a few years ago discusses this problem and a
solution based on dynamic binary instrumentation using QEMU:

Among other things, I hope to address a number of deficiencies of
the tool described by that paper, in terms of efficiency (the other
sanitizer tools have shown that compiler-based instrumentation can be
much more efficient than binary instrumentation), and also in terms
of accuracy (unlike the system described in that paper, we track data
accurately through join points using union labels).

There are other applications outside of security. For example,
one could use this instrumentation pass (or a variant of it) to tag
opposite-endian integers in memory, and check that no opposite-endian
integer is loaded or otherwise used directly without first going
through a conversion.

Also, "sanitizer" may not be the best name for this, since it doesn't
really sanitize anything.

As Reid mentioned, a goal is to build sanitizer-like tools on top of
this instrumentation. Not only that, but one of the things that an
application can do is turn on its own sources and sinks in response
to the instrumentation being enabled (via the __has_feature macro).
So really, -fsanitize=dataflow would be the flag that turns on
data-flow sanitization for an application designed for it. And should
the component of the compiler that allows this data-flow sanitization
be named any differently?

Thanks,

Bin_Tzeng · June 14, 2013, 8:23pm

It is interesting. I can see some use cases with such a tool. To me, source-level implementation

is not as accurate as binary translation. For instance, it is hard to check the taint for return addresses
since there is no concept of return instructions on source level. The stack does not appear until later.

pcc · June 14, 2013, 8:48pm

This tool isn't for stack protection; there are other tools for that.
In general the tool isn't currently focused on defending against
adversaries -- it would be trivial to write a program that accesses
shadow memory directly in order to produce incorrect results, not
to mention "tag scrubbers" which use control flow to remove tags
(see section 6 of the ASPLOS paper).

Sean_Silva · June 14, 2013, 11:10pm

Excellent point. I agree with your reasoning.

-- Sean Silva

pcc · June 26, 2013, 1:13am

Any further comments on the below? I've updated the design document
to add a use case at Kostya's request, but I'd appreciate any further
review of the design.

Thanks,
Peter

DataFlowSanitizer Design Document

pcc · August 7, 2013, 12:55am

Hi,

If there are no further comments on the design below I intend to commit
my DFSan patches in a week.

Thanks,
Peter

Chandler_Carruth · August 7, 2013, 1:00am

I think it would be good to get Kostya's explicit sign-off on this before
committing it, as he has been directing and overseeing the sanitizer work
as a whole over the past year. CC-ing him directly to see if he can take
time to look through this. I think he's back from vacation at this point.

Kostya_Serebryany · August 7, 2013, 11:20am

I’ve seen the following patches, and I am ok with them:
http://llvm-reviews.chandlerc.com/D966 (clang driver, design doc) – LGTM

https://codereview.googleplex.com/204001 (then http://llvm-reviews.chandlerc.com/D967) (compiler-rt) – LGTM (synchronization part also reviewed by dvyukov@)
http://llvm-reviews.chandlerc.com/D965 (llvm) – LGTM-ed by eugenis@

–kcc

Konstantin_Tokarev · August 7, 2013, 3:01pm

Well, on many architectures there is no concept of return instruction on ISA level too

Bin_Tzeng · August 7, 2013, 6:17pm

That is true. I was referring to the program counter on ISA level. C/C++ abstractions do not expose that. It is not the intended use case for DFSan I guess. ;]

Topic		Replies	Views
Dataflow Sanitizer Design Question Clang Frontend	1	57	October 5, 2014
RFC: EfficiencySanitizer LLVM Dev List Archives	39	142	April 21, 2016
How does sanitizers in compiler-rt work? LLVM Dev List Archives	2	79	June 16, 2017
external libraries and dataflow sanitizer Clang Frontend	1	73	January 28, 2015
[RFC] Tooling for parsing and symbolication of Sanitizer reports LLVM Dev List Archives	11	101	October 19, 2020

DataFlowSanitizer design discussion

Related topics