Making MSAN Easier to Use: Providing a Sanitized Libc++

Sanitizers such as MSAN require the entire program to be instrumented, anything less leads to plenty of false positives. Unfortunately this can be difficult to achieve, especially for the C and C++ standard libraries. To work around this the sanitizers provide interceptors for common C functions, but the same solution doesn’t work as well for the C++ STL. Instead users are forced to manually build and link a custom sanitized libc++. This is a huge PITA and I would like to improve the situation, not just for MSAN but all sanitizers. I’m working on a proposal to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of Libc++ and a mechanism to easily link them, as if they were a Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by using an un-sanitized STL.
(2) Allow sanitizers to catch user bugs that occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each sanitized libc++ version along side its other runtimes.
(2) Add options to the Clang driver to support linking/using these libraries.

I think this proposal is likely to be contentious, so I would like to focus on the details it. Once I have some feedback on these details I’ll put together a formal proposal, including a plan for implementing it.
The details I would like input on are:

(A) What kind and how many sanitized versions of libc++ should we provide?

From: "Eric Fiselier via cfe-dev" <cfe-dev@lists.llvm.org>
To: "clang developer list" <cfe-dev@lists.llvm.org>, "Chandler
Carruth" <chandlerc@gmail.com>, "Kostya Serebryany"
<kcc@google.com>, "Evgenii Stepanov" <eugenis@google.com>
Sent: Sunday, August 14, 2016 5:05:57 PM
Subject: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized
Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for the C
and C++ standard libraries. To work around this the sanitizers
provide interceptors for common C functions, but the same solution
doesn't work as well for the C++ STL. Instead users are forced to
manually build and link a custom sanitized libc++. This is a huge
PITA and I would like to improve the situation, not just for MSAN
but all sanitizers.

I've not thought deeply about the deployment model here, but this is certainly an important problem. Thanks for working on this. We need to figure out a way of automatically providing users with a sanitized STL in a straightforward manner. I'd prefer that they automatically get the appropriately-instrumented runtime, by default, just by providing the -fsanitize=... flag. The same issue comes up for other runtimes, such as the OpenMP runtime library.

-Hal

Sanitizers such as MSAN require the entire program to be instrumented,
anything less leads to plenty of false positives. Unfortunately this can
be difficult to achieve, especially for the C and C++ standard
libraries. To work around this the sanitizers provide interceptors for
common C functions, but the same solution doesn't work as well for the
C++ STL. Instead users are forced to manually build and link a custom
sanitized libc++. This is a huge PITA and I would like to improve the
situation, not just for MSAN but all sanitizers. I'm working on a
proposal to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of Libc++ and a
mechanism to easily link them, as if they were a Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by using an
un-sanitized STL.
(2) Allow sanitizers to catch user bugs that occur within the STL
library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each sanitized
libc++ version along side its other runtimes.
(2) Add options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would like to
focus on the details it. Once I have some feedback on these details I'll
put together a formal proposal, including a plan for implementing it.
The details I would like input on are:

(A) What kind and how many sanitized versions of libc++ should we provide?
---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak), Memory
(With origin tracking?), Thread, and Undefined.
Once we get into combinations of sanitizers things get more complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing sanitized
versions of libc++ for every possible configuration is out of the question.
Instead we should figure out what subset of UBSAN checks we want to
enable in sanitized libc++ versions. I suspect we want to disable the
following checks.

* -fsanitize=vptr
* -fsanitize=function
* -fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer group (ie
Address, Memory, Thread).
Do we want to provide a combination of UBSAN on/off for every group, or
can we simply provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries to the users?
-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and
'-fsanitize-stdlib=<sanitizer>'.
The first version deduces the best sanitized version to use, the second
allows it to be explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang deduces
which version.
* -stdlib=libc++-<sanitizer>: Explicitly turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?
-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized and
non-sanitized libc++ version?
Does the sanitized version replace the non-sanitized version, or should
both versions be loaded into the program?

Essentially I'm asking if the sanitized versions of libc++ should have
the "soname" libc++ so they can
replace non-sanitized version, or if they should have a different
"soname" so the linker treats them as a separate library.

I haven't looked into the consequences of either approach in depth, but
any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would be to make all the soname's the same, and just stick them in appropriately named subfolders relative to their normal location.

Jon

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Eric Fiselier" <eric@efcs.ca>, "clang developer list" <cfe-dev@lists.llvm.org>, "Chandler Carruth"
<chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov" <eugenis@google.com>
Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

> Sanitizers such as MSAN require the entire program to be
> instrumented,
> anything less leads to plenty of false positives. Unfortunately
> this can
> be difficult to achieve, especially for the C and C++ standard
> libraries. To work around this the sanitizers provide interceptors
> for
> common C functions, but the same solution doesn't work as well for
> the
> C++ STL. Instead users are forced to manually build and link a
> custom
> sanitized libc++. This is a huge PITA and I would like to improve
> the
> situation, not just for MSAN but all sanitizers. I'm working on a
> proposal to change this. The basis of my proposal is:
>
> Clang should install/provide multiple sanitized versions of Libc++
> and a
> mechanism to easily link them, as if they were a Compiler-RT
> runtime.
>
> The goal of this proposal is:
>
> (1) Greatly reduce the number of false positives caused by using an
> un-sanitized STL.
> (2) Allow sanitizers to catch user bugs that occur within the STL
> library, not just its headers.
>
> The basic steps I would like to take to achieve this are:
>
> (1) Teach the compiler-rt CMake how to build and install each
> sanitized
> libc++ version along side its other runtimes.
> (2) Add options to the Clang driver to support linking/using these
> libraries.
>
> I think this proposal is likely to be contentious, so I would like
> to
> focus on the details it. Once I have some feedback on these details
> I'll
> put together a formal proposal, including a plan for implementing
> it.
> The details I would like input on are:
>
> (A) What kind and how many sanitized versions of libc++ should we
> provide?
> ---------------------------------------------------------------------------------------------------------------
>
> I think the minimum set would be Address (which includes Leak),
> Memory
> (With origin tracking?), Thread, and Undefined.
> Once we get into combinations of sanitizers things get more
> complicated.
> What other sanitizer combinations should we provide?
>
> (B) How should we handle UBSAN?
> ---------------------------------------------------
>
> UBSAN is really just a collection of sanitizers and providing
> sanitized
> versions of libc++ for every possible configuration is out of the
> question.
> Instead we should figure out what subset of UBSAN checks we want to
> enable in sanitized libc++ versions. I suspect we want to disable
> the
> following checks.
>
> * -fsanitize=vptr
> * -fsanitize=function
> * -fsanitize=float-divide-by-zero
>
> Additionally UBSAN can be combined with every other sanitizer group
> (ie
> Address, Memory, Thread).
> Do we want to provide a combination of UBSAN on/off for every
> group, or
> can we simply provide an over-sanitized version with UBSAN on?
>
> (C) How should the Clang driver expose the sanitized libraries to
> the users?
> -------------------------------------------------------------------------------------------------------------
>
> I would like to propose the driver option '-fsanitize-stdlib' and
> '-fsanitize-stdlib=<sanitizer>'.
> The first version deduces the best sanitized version to use, the
> second
> allows it to be explicitly specified.
>
> A couple of other options are:
>
> * -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
> deduces
> which version.
> * -stdlib=libc++-<sanitizer>: Explicitly turn on and choose a
> sanitized STL.
>
> (D) Should sanitized libc++ versions override libc++.so?
> -------------------------------------------------------------------------------------------
>
> For example, what happens when a program links to both a sanitized
> and
> non-sanitized libc++ version?
> Does the sanitized version replace the non-sanitized version, or
> should
> both versions be loaded into the program?
>
> Essentially I'm asking if the sanitized versions of libc++ should
> have
> the "soname" libc++ so they can
> replace non-sanitized version, or if they should have a different
> "soname" so the linker treats them as a separate library.
>
> I haven't looked into the consequences of either approach in depth,
> but
> any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would be to
make all the soname's the same, and just stick them in appropriately
named subfolders relative to their normal location.

I'm not sure that's true; there's no property of the environment that determines which library path you need. As a practical matter, I can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the right thing in this context. Moreover, it is really a property of how you compiled, so I think using an alternate library name is natural.

-Hal

As a practical matter, I can’t set $PLATFORM and/or $LIB in my rpath and have ld.so do the right thing in this context.

Can’t Clang compile the sanitized executable with a special RPATH pointing to the correct libc++ folder?

Moreover, it is really a property of how you compiled, so I think using an alternate library name is natural.

Using an alternatively library names will likely cause problems if a non-sanitized libc++ is also present, since both libraries
provide the exact same symbols it’s possible that symbols in the non-sanitized libc++ will replace the sanitized versions.

Eric,

thanks for bringing this up! This is indeed one of the biggest issues
for sanitizer adoption right now.

I think that the same-soname approach is correct, mainly because the
sanitized library is just a version of the same library. Loading both
versions in one process would usually be an error.

RPATH does not work because if only affects immediate dependencies.
The following would refer to two different versions of libc++:
Executable (with asan) -> library A (without asan) -> libc++
        >
         -> libc++
I think even in this case, if the two libc++'s have the same soname,
only one will be loaded. Linux does breadth-first search, so it should
end up with the direct dependency of the main executable, which is
good.

Another problem is what happens when the program is installed/copied
somewhere, and the toolchain build directory is gone. We would need
help from the dynamic loader.

We have something like this set up on Android for ASan, see
https://source.android.com/devices/tech/debug/asan.html#sanitize_target
The dynamic loader adds directories to the default library search path
when it loads an instrumented executable. The directory with the ASan
libraries is added at the start of the list. I think this is similar
to how multilib works.

On Android we use the linker name itself (PT_INTERP field) to identify
ASan executables. It would probably be better to use a .note section
or even something else.

From: "Evgenii Stepanov" <eugenis@google.com>
To: "Eric Fiselier" <eric@efcs.ca>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Jonathan Roelofs" <jonathan@codesourcery.com>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>
Sent: Monday, August 15, 2016 12:46:39 AM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

Eric,

thanks for bringing this up! This is indeed one of the biggest issues
for sanitizer adoption right now.

I think that the same-soname approach is correct, mainly because the
sanitized library is just a version of the same library.

Given that, for the case of msan at least, some of the symbols have, effectively, additional semantic requirements, it seems appropriate to use a different name somewhere (i.e. the library name, symbol names).

Loading both
versions in one process would usually be an error.

While it would be undesirable to have multiple versions of libc++ in a single process, this generally works regardless. Most of the global state is in libcxxabi (or whatever ABI library is being used), and while having multiple versions of std::cout (etc.) can certainly be observable, it seems rare for this to come up in practice.

RPATH does not work because if only affects immediate dependencies.
The following would refer to two different versions of libc++:
Executable (with asan) -> library A (without asan) -> libc++
        >
         -> libc++
I think even in this case, if the two libc++'s have the same soname,
only one will be loaded. Linux does breadth-first search, so it
should
end up with the direct dependency of the main executable, which is
good.

Yes, I believe that Linux's loader does the right thing in this case. If you have the executable without asan, and the library with asan, then we should devise a scheme that does not silently break.

Another problem is what happens when the program is installed/copied
somewhere, and the toolchain build directory is gone. We would need
help from the dynamic loader.

This, IMHO, is a key problem with the rpath approach.

We have something like this set up on Android for ASan, see
DirecciónDesinfectante  |  Android Open Source Project
The dynamic loader adds directories to the default library search
path
when it loads an instrumented executable. The directory with the ASan
libraries is added at the start of the list. I think this is similar
to how multilib works.

On Android we use the linker name itself (PT_INTERP field) to
identify
ASan executables. It would probably be better to use a .note section
or even something else.

Interesting. I don't understand what you're proposing here, however.

-Hal

From: "Hal Finkel via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Evgenii Stepanov" <eugenis@google.com>
Cc: "Jonathan Roelofs" <jonathan@codesourcery.com>, "clang developer list" <cfe-dev@lists.llvm.org>
Sent: Monday, August 15, 2016 1:42:47 AM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

> From: "Evgenii Stepanov" <eugenis@google.com>
> To: "Eric Fiselier" <eric@efcs.ca>
> Cc: "Hal Finkel" <hfinkel@anl.gov>, "Jonathan Roelofs"
> <jonathan@codesourcery.com>, "clang developer list"
> <cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>,
> "Kostya Serebryany" <kcc@google.com>
> Sent: Monday, August 15, 2016 12:46:39 AM
> Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
> Sanitized Libc++
>
> Eric,
>
> thanks for bringing this up! This is indeed one of the biggest
> issues
> for sanitizer adoption right now.
>
> I think that the same-soname approach is correct, mainly because
> the
> sanitized library is just a version of the same library.

Given that, for the case of msan at least, some of the symbols have,
effectively, additional semantic requirements, it seems appropriate
to use a different name somewhere (i.e. the library name, symbol
names).

> Loading both
> versions in one process would usually be an error.

While it would be undesirable to have multiple versions of libc++ in
a single process, this generally works regardless. Most of the
global state is in libcxxabi (or whatever ABI library is being
used), and while having multiple versions of std::cout (etc.) can
certainly be observable, it seems rare for this to come up in
practice.

I'll add, however, that there are other libraries for which having multiply copies in the same process more-easily becomes a usability problem; the OpenMP runtime library is a good example.

-Hal

> From: "Evgenii Stepanov" <eugenis@google.com>
> To: "Eric Fiselier" <eric@efcs.ca>
> Cc: "Hal Finkel" <hfinkel@anl.gov>, "Jonathan Roelofs" <
jonathan@codesourcery.com>, "clang developer list"
> <cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>
> Sent: Monday, August 15, 2016 12:46:39 AM
> Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized
Libc++
>
> Eric,
>
> thanks for bringing this up! This is indeed one of the biggest issues
> for sanitizer adoption right now.
>
> I think that the same-soname approach is correct, mainly because the
> sanitized library is just a version of the same library.

Given that, for the case of msan at least, some of the symbols have,
effectively, additional semantic requirements, it seems appropriate to use
a different name somewhere (i.e. the library name, symbol names).

> Loading both
> versions in one process would usually be an error.

While it would be undesirable to have multiple versions of libc++ in a
single process, this generally works regardless. Most of the global state
is in libcxxabi (or whatever ABI library is being used), and while having
multiple versions of std::cout (etc.) can certainly be observable, it seems
rare for this to come up in practice.

> RPATH does not work because if only affects immediate dependencies.
> The following would refer to two different versions of libc++:
> Executable (with asan) -> library A (without asan) -> libc++
> >
> -> libc++
> I think even in this case, if the two libc++'s have the same soname,
> only one will be loaded. Linux does breadth-first search, so it
> should
> end up with the direct dependency of the main executable, which is
> good.

Yes, I believe that Linux's loader does the right thing in this case. If
you have the executable without asan, and the library with asan, then we
should devise a scheme that does not silently break.

>
> Another problem is what happens when the program is installed/copied
> somewhere, and the toolchain build directory is gone. We would need
> help from the dynamic loader.

This, IMHO, is a key problem with the rpath approach.

I don't think this will be an issue. Assuming the user has a system libc++
installed then the program
should simply fall back to that unsanitized version since it won't be able
to find the rpath. I don't
see anything more we could do.

One way to support this case would be to provide additional static versions
of libc++, since
statically linked executable's don't depend on the toolchain build
directory.

Did you have other fallback behavior in mind?
How does Compiler-rt handle this problem with shared sanitizer runtimes?

Nevermind I forgot you provided a link to the android implementation
details.
I'll take a look for similar possibilities on Linux.

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org> To:
"Eric Fiselier" <eric@efcs.ca>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
<eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
Sanitized Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for
the C and C++ standard libraries. To work around this the
sanitizers provide interceptors for common C functions, but the
same solution doesn't work as well for the C++ STL. Instead users
are forced to manually build and link a custom sanitized libc++.
This is a huge PITA and I would like to improve the situation,
not just for MSAN but all sanitizers. I'm working on a proposal
to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of
Libc++ and a mechanism to easily link them, as if they were a
Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by using
an un-sanitized STL. (2) Allow sanitizers to catch user bugs that
occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each
sanitized libc++ version along side its other runtimes. (2) Add
options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would
like to focus on the details it. Once I have some feedback on
these details I'll put together a formal proposal, including a
plan for implementing it. The details I would like input on are:

(A) What kind and how many sanitized versions of libc++ should
we provide?
---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak),

Memory (With origin tracking?), Thread, and Undefined. Once we
get into combinations of sanitizers things get more complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing
sanitized versions of libc++ for every possible configuration is
out of the question. Instead we should figure out what subset of
UBSAN checks we want to enable in sanitized libc++ versions. I
suspect we want to disable the following checks.

* -fsanitize=vptr * -fsanitize=function *
-fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer
group (ie Address, Memory, Thread). Do we want to provide a
combination of UBSAN on/off for every group, or can we simply
provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries
to the users?
-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and

'-fsanitize-stdlib=<sanitizer>'. The first version deduces the
best sanitized version to use, the second allows it to be
explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?
-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized

and non-sanitized libc++ version? Does the sanitized version
replace the non-sanitized version, or should both versions be
loaded into the program?

Essentially I'm asking if the sanitized versions of libc++
should have the "soname" libc++ so they can replace non-sanitized
version, or if they should have a different "soname" so the
linker treats them as a separate library.

I haven't looked into the consequences of either approach in
depth, but any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would be
to make all the soname's the same, and just stick them in
appropriately named subfolders relative to their normal location.

I'm not sure that's true; there's no property of the environment that
determines which library path you need. As a practical matter, I
can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
right thing in this context. Moreover, it is really a property of how
you compiled, so I think using an alternate library name is natural.

Multilibs solve exactly the problem of "it's a property of how you compiled". The thing that's subtly different here is that the usual thing that people do with multilibs is to provide ABI incompatible versions of the same library (which are made incompatible via compiler flags, -msoft-float, for example), whereas these libraries just so happen to be ABI compatible with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and $LIB, but I /think/ it's a red herring: the compiler takes care of adding in the multilib suffixes where appropriate, so shouldn't the answer to "which library do I stick in the rpath?" include said suffix (when compiled with Eric's proposed flag)?

Jon

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org> To:
"Eric Fiselier" <eric@efcs.ca>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
<eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
Sanitized Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for
the C and C++ standard libraries. To work around this the
sanitizers provide interceptors for common C functions, but the
same solution doesn't work as well for the C++ STL. Instead users
are forced to manually build and link a custom sanitized libc++.
This is a huge PITA and I would like to improve the situation,
not just for MSAN but all sanitizers. I'm working on a proposal
to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of
Libc++ and a mechanism to easily link them, as if they were a
Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by using
an un-sanitized STL. (2) Allow sanitizers to catch user bugs that
occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each
sanitized libc++ version along side its other runtimes. (2) Add
options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would
like to focus on the details it. Once I have some feedback on
these details I'll put together a formal proposal, including a
plan for implementing it. The details I would like input on are:

(A) What kind and how many sanitized versions of libc++ should
we provide?

---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak),

Memory (With origin tracking?), Thread, and Undefined. Once we
get into combinations of sanitizers things get more complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing
sanitized versions of libc++ for every possible configuration is
out of the question. Instead we should figure out what subset of
UBSAN checks we want to enable in sanitized libc++ versions. I
suspect we want to disable the following checks.

* -fsanitize=vptr * -fsanitize=function *
-fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer
group (ie Address, Memory, Thread). Do we want to provide a
combination of UBSAN on/off for every group, or can we simply
provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries
to the users?

-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and

'-fsanitize-stdlib=<sanitizer>'. The first version deduces the
best sanitized version to use, the second allows it to be
explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?

-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized

and non-sanitized libc++ version? Does the sanitized version
replace the non-sanitized version, or should both versions be
loaded into the program?

Essentially I'm asking if the sanitized versions of libc++
should have the "soname" libc++ so they can replace non-sanitized
version, or if they should have a different "soname" so the
linker treats them as a separate library.

I haven't looked into the consequences of either approach in
depth, but any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would be
to make all the soname's the same, and just stick them in
appropriately named subfolders relative to their normal location.

I'm not sure that's true; there's no property of the environment that
determines which library path you need. As a practical matter, I
can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
right thing in this context. Moreover, it is really a property of how
you compiled, so I think using an alternate library name is natural.

Multilibs solve exactly the problem of "it's a property of how you
compiled". The thing that's subtly different here is that the usual thing
that people do with multilibs is to provide ABI incompatible versions of the
same library (which are made incompatible via compiler flags, -msoft-float,
for example), whereas these libraries just so happen to be ABI compatible
with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and $LIB, but I
/think/ it's a red herring: the compiler takes care of adding in the
multilib suffixes where appropriate, so shouldn't the answer to "which
library do I stick in the rpath?" include said suffix (when compiled with
Eric's proposed flag)?

What are these suffixes and where are they added?

Note that right now if I build with -stdlib=libc++ (and libc++ is part
of llvm checkout), I don't get any RPATH. So the binary is linked
against the libc++.so in the toolchain build directory, but it would
not find it at runtime without some extra help. This is the price you
pay for running out of temp location, and we should probably keep it
like this for sanitizer builds, too, i.e. put the sanitized libc++ in
lib/msan and let the user set their own RPATH.

The other part of the problem is how to install sanitized libc++
system-wide and have apps use it. That's where we need the loader
support, and I think it should follow the multilib design as close as
possible.

From: "Evgenii Stepanov" <eugenis@google.com>
To: "Eric Fiselier" <eric@efcs.ca>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Jonathan Roelofs" <jonathan@codesourcery.com>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>
Sent: Monday, August 15, 2016 12:46:39 AM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

Eric,

thanks for bringing this up! This is indeed one of the biggest issues
for sanitizer adoption right now.

I think that the same-soname approach is correct, mainly because the
sanitized library is just a version of the same library.

Given that, for the case of msan at least, some of the symbols have, effectively, additional semantic requirements, it seems appropriate to use a different name somewhere (i.e. the library name, symbol names).

That would mean enforcing that every library in a process is built
with MSan. This is the path DFSan takes by mangling symbol names
during instrumentation. I think MSan does not need to be as strict -
it would make it harder, not easier to use.

Loading both
versions in one process would usually be an error.

While it would be undesirable to have multiple versions of libc++ in a single process, this generally works regardless. Most of the global state is in libcxxabi (or whatever ABI library is being used), and while having multiple versions of std::cout (etc.) can certainly be observable, it seems rare for this to come up in practice.

Good point. I was thinking about a general solution that can be
applied to a larger set of system libraries. Libc++ is the most
visible source of MSan false positives, but the problem is definitely
not limited to it. In general, it's hard to say if a random library is
ok to be loaded twice, and it feels like it is not ok in the majority
of cases.

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org> To:
"Eric Fiselier" <eric@efcs.ca>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth" <chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
<eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
Sanitized Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for
the C and C++ standard libraries. To work around this the
sanitizers provide interceptors for common C functions, but the
same solution doesn't work as well for the C++ STL. Instead users
are forced to manually build and link a custom sanitized libc++.
This is a huge PITA and I would like to improve the situation,
not just for MSAN but all sanitizers. I'm working on a proposal
to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of
Libc++ and a mechanism to easily link them, as if they were a
Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by using
an un-sanitized STL. (2) Allow sanitizers to catch user bugs that
occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each
sanitized libc++ version along side its other runtimes. (2) Add
options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would
like to focus on the details it. Once I have some feedback on
these details I'll put together a formal proposal, including a
plan for implementing it. The details I would like input on are:

(A) What kind and how many sanitized versions of libc++ should
we provide?

---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak),

Memory (With origin tracking?), Thread, and Undefined. Once we
get into combinations of sanitizers things get more complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing
sanitized versions of libc++ for every possible configuration is
out of the question. Instead we should figure out what subset of
UBSAN checks we want to enable in sanitized libc++ versions. I
suspect we want to disable the following checks.

* -fsanitize=vptr * -fsanitize=function *
-fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer
group (ie Address, Memory, Thread). Do we want to provide a
combination of UBSAN on/off for every group, or can we simply
provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries
to the users?

-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and

'-fsanitize-stdlib=<sanitizer>'. The first version deduces the
best sanitized version to use, the second allows it to be
explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?

-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized

and non-sanitized libc++ version? Does the sanitized version
replace the non-sanitized version, or should both versions be
loaded into the program?

Essentially I'm asking if the sanitized versions of libc++
should have the "soname" libc++ so they can replace non-sanitized
version, or if they should have a different "soname" so the
linker treats them as a separate library.

I haven't looked into the consequences of either approach in
depth, but any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would be
to make all the soname's the same, and just stick them in
appropriately named subfolders relative to their normal location.

I'm not sure that's true; there's no property of the environment that
determines which library path you need. As a practical matter, I
can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
right thing in this context. Moreover, it is really a property of how
you compiled, so I think using an alternate library name is natural.

Multilibs solve exactly the problem of "it's a property of how you
compiled". The thing that's subtly different here is that the usual thing
that people do with multilibs is to provide ABI incompatible versions of the
same library (which are made incompatible via compiler flags, -msoft-float,
for example), whereas these libraries just so happen to be ABI compatible
with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and $LIB, but I
/think/ it's a red herring: the compiler takes care of adding in the
multilib suffixes where appropriate, so shouldn't the answer to "which
library do I stick in the rpath?" include said suffix (when compiled with
Eric's proposed flag)?

What are these suffixes and where are they added?

To be clear: the suffixes aren't something that exist yet, but rather they're something I'm proposing.

Strawman:

flag(s) suffix
------- ------
-fsanitize=address /asan
-fsanitize=address,memory /asan/msan

Then with `-fsanitize=address`:

    /usr/lib/libc++.so

becomes:

    /usr/lib/asan/libc++.so

And with `-fsanitize=memory`, you get:

    /usr/lib/asan/msan/libc++.so

because an msan'd but not asan'd build of the library was not supplied by the vendor (for whatever hypothetical reason). Then the validation problem of having an exponential number of combinations to test becomes the vendor's problem: they can ship as many or as few of the flavors of the libraries as they want.

Here you'd have some notion of "satisfies the constraints the user asked for" (which is usually "is ABI compatible with" as far as normal multilib stuff goes) and another to rank the choices and break ties when all else is the same.

Note that right now if I build with -stdlib=libc++ (and libc++ is part
of llvm checkout), I don't get any RPATH. So the binary is linked
against the libc++.so in the toolchain build directory, but it would
not find it at runtime without some extra help. This is the price you
pay for running out of temp location, and we should probably keep it
like this for sanitizer builds, too, i.e. put the sanitized libc++ in
lib/msan and let the user set their own RPATH.

Yeah, that's my inclination also. We could of course provide some flag to support querying the compiler for what the sanitizer lib suffix is (or re-use/hijack the existing one for normal multilibs). That'd allow build scripts to append the suffix in a principled way.

The other part of the problem is how to install sanitized libc++
system-wide and have apps use it. That's where we need the loader
support, and I think it should follow the multilib design as close as
possible.

An idea for this: assuming they're all ABI compatible, stick them in their suffixed folders as appropriate, but add a symlink from the no suffix location to whatever one you want to be used system-wide.

Jon

From: "Jonathan Roelofs" <jonathan@codesourcery.com>
To: "Evgenii Stepanov" <eugenis@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Eric Fiselier" <eric@efcs.ca>, "clang developer list" <cfe-dev@lists.llvm.org>,
"Chandler Carruth" <chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>
Sent: Monday, August 15, 2016 1:37:11 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

>>
>>
>>>
>>>>
>>>> From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org>
>>>> To:
>>>> "Eric Fiselier" <eric@efcs.ca>, "clang developer list"
>>>> <cfe-dev@lists.llvm.org>, "Chandler Carruth"
>>>> <chandlerc@gmail.com>,
>>>> "Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
>>>> <eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
>>>> Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
>>>> Sanitized Libc++
>>>>
>>>>
>>>>
>>>>>
>>>>> Sanitizers such as MSAN require the entire program to be
>>>>> instrumented, anything less leads to plenty of false positives.
>>>>> Unfortunately this can be difficult to achieve, especially for
>>>>> the C and C++ standard libraries. To work around this the
>>>>> sanitizers provide interceptors for common C functions, but the
>>>>> same solution doesn't work as well for the C++ STL. Instead
>>>>> users
>>>>> are forced to manually build and link a custom sanitized
>>>>> libc++.
>>>>> This is a huge PITA and I would like to improve the situation,
>>>>> not just for MSAN but all sanitizers. I'm working on a proposal
>>>>> to change this. The basis of my proposal is:
>>>>>
>>>>> Clang should install/provide multiple sanitized versions of
>>>>> Libc++ and a mechanism to easily link them, as if they were a
>>>>> Compiler-RT runtime.
>>>>>
>>>>> The goal of this proposal is:
>>>>>
>>>>> (1) Greatly reduce the number of false positives caused by
>>>>> using
>>>>> an un-sanitized STL. (2) Allow sanitizers to catch user bugs
>>>>> that
>>>>> occur within the STL library, not just its headers.
>>>>>
>>>>> The basic steps I would like to take to achieve this are:
>>>>>
>>>>> (1) Teach the compiler-rt CMake how to build and install each
>>>>> sanitized libc++ version along side its other runtimes. (2) Add
>>>>> options to the Clang driver to support linking/using these
>>>>> libraries.
>>>>>
>>>>> I think this proposal is likely to be contentious, so I would
>>>>> like to focus on the details it. Once I have some feedback on
>>>>> these details I'll put together a formal proposal, including a
>>>>> plan for implementing it. The details I would like input on
>>>>> are:
>>>>>
>>>>> (A) What kind and how many sanitized versions of libc++ should
>>>>> we provide?
>>>>>
>>>>> ---------------------------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>> I think the minimum set would be Address (which includes Leak),
>>>>>
>>>>> Memory (With origin tracking?), Thread, and Undefined. Once we
>>>>> get into combinations of sanitizers things get more
>>>>> complicated.
>>>>> What other sanitizer combinations should we provide?
>>>>>
>>>>> (B) How should we handle UBSAN?
>>>>> ---------------------------------------------------
>>>>>
>>>>> UBSAN is really just a collection of sanitizers and providing
>>>>> sanitized versions of libc++ for every possible configuration
>>>>> is
>>>>> out of the question. Instead we should figure out what subset
>>>>> of
>>>>> UBSAN checks we want to enable in sanitized libc++ versions. I
>>>>> suspect we want to disable the following checks.
>>>>>
>>>>> * -fsanitize=vptr * -fsanitize=function *
>>>>> -fsanitize=float-divide-by-zero
>>>>>
>>>>> Additionally UBSAN can be combined with every other sanitizer
>>>>> group (ie Address, Memory, Thread). Do we want to provide a
>>>>> combination of UBSAN on/off for every group, or can we simply
>>>>> provide an over-sanitized version with UBSAN on?
>>>>>
>>>>> (C) How should the Clang driver expose the sanitized libraries
>>>>> to the users?
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>> I would like to propose the driver option '-fsanitize-stdlib' and
>>>>>
>>>>> '-fsanitize-stdlib=<sanitizer>'. The first version deduces the
>>>>> best sanitized version to use, the second allows it to be
>>>>> explicitly specified.
>>>>>
>>>>> A couple of other options are:
>>>>>
>>>>> * -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
>>>>> deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
>>>>> turn on and choose a sanitized STL.
>>>>>
>>>>> (D) Should sanitized libc++ versions override libc++.so?
>>>>>
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>> For example, what happens when a program links to both a sanitized
>>>>>
>>>>> and non-sanitized libc++ version? Does the sanitized version
>>>>> replace the non-sanitized version, or should both versions be
>>>>> loaded into the program?
>>>>>
>>>>> Essentially I'm asking if the sanitized versions of libc++
>>>>> should have the "soname" libc++ so they can replace
>>>>> non-sanitized
>>>>> version, or if they should have a different "soname" so the
>>>>> linker treats them as a separate library.
>>>>>
>>>>> I haven't looked into the consequences of either approach in
>>>>> depth, but any input is appreciated.
>>>>
>>>>
>>>> In a sense, these are /just/ multilibs, so my inclination would
>>>> be
>>>> to make all the soname's the same, and just stick them in
>>>> appropriately named subfolders relative to their normal
>>>> location.
>>>
>>>
>>> I'm not sure that's true; there's no property of the environment
>>> that
>>> determines which library path you need. As a practical matter, I
>>> can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
>>> right thing in this context. Moreover, it is really a property of
>>> how
>>> you compiled, so I think using an alternate library name is
>>> natural.
>>
>>
>> Multilibs solve exactly the problem of "it's a property of how you
>> compiled". The thing that's subtly different here is that the
>> usual thing
>> that people do with multilibs is to provide ABI incompatible
>> versions of the
>> same library (which are made incompatible via compiler flags,
>> -msoft-float,
>> for example), whereas these libraries just so happen to be ABI
>> compatible
>> with their non-instrumented variants.
>>
>> I'm not sure I understand what you're saying about $PLATFORM and
>> $LIB, but I
>> /think/ it's a red herring: the compiler takes care of adding in
>> the
>> multilib suffixes where appropriate, so shouldn't the answer to
>> "which
>> library do I stick in the rpath?" include said suffix (when
>> compiled with
>> Eric's proposed flag)?
>
> What are these suffixes and where are they added?

To be clear: the suffixes aren't something that exist yet, but rather
they're something I'm proposing.

Strawman:

flag(s) suffix
------- ------
-fsanitize=address /asan
-fsanitize=address,memory /asan/msan

Then with `-fsanitize=address`:

    /usr/lib/libc++.so

becomes:

    /usr/lib/asan/libc++.so

This kind of scheme sounds great, but is this something we can implement on our own, or something that requires changes to the dynamic loader (e.g. glibc's ld.so)?

-Hal

From: "Jonathan Roelofs" <jonathan@codesourcery.com>
To: "Evgenii Stepanov" <eugenis@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Eric Fiselier" <eric@efcs.ca>, "clang developer list" <cfe-dev@lists.llvm.org>,
"Chandler Carruth" <chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>
Sent: Monday, August 15, 2016 1:37:11 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org>
To:
"Eric Fiselier" <eric@efcs.ca>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth"
<chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
<eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
Sanitized Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for
the C and C++ standard libraries. To work around this the
sanitizers provide interceptors for common C functions, but the
same solution doesn't work as well for the C++ STL. Instead
users
are forced to manually build and link a custom sanitized
libc++.
This is a huge PITA and I would like to improve the situation,
not just for MSAN but all sanitizers. I'm working on a proposal
to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of
Libc++ and a mechanism to easily link them, as if they were a
Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by
using
an un-sanitized STL. (2) Allow sanitizers to catch user bugs
that
occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each
sanitized libc++ version along side its other runtimes. (2) Add
options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would
like to focus on the details it. Once I have some feedback on
these details I'll put together a formal proposal, including a
plan for implementing it. The details I would like input on
are:

(A) What kind and how many sanitized versions of libc++ should
we provide?

---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak),

Memory (With origin tracking?), Thread, and Undefined. Once we
get into combinations of sanitizers things get more
complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing
sanitized versions of libc++ for every possible configuration
is
out of the question. Instead we should figure out what subset
of
UBSAN checks we want to enable in sanitized libc++ versions. I
suspect we want to disable the following checks.

* -fsanitize=vptr * -fsanitize=function *
-fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer
group (ie Address, Memory, Thread). Do we want to provide a
combination of UBSAN on/off for every group, or can we simply
provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries
to the users?

-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and

'-fsanitize-stdlib=<sanitizer>'. The first version deduces the
best sanitized version to use, the second allows it to be
explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?

-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized

and non-sanitized libc++ version? Does the sanitized version
replace the non-sanitized version, or should both versions be
loaded into the program?

Essentially I'm asking if the sanitized versions of libc++
should have the "soname" libc++ so they can replace
non-sanitized
version, or if they should have a different "soname" so the
linker treats them as a separate library.

I haven't looked into the consequences of either approach in
depth, but any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would
be
to make all the soname's the same, and just stick them in
appropriately named subfolders relative to their normal
location.

I'm not sure that's true; there's no property of the environment
that
determines which library path you need. As a practical matter, I
can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
right thing in this context. Moreover, it is really a property of
how
you compiled, so I think using an alternate library name is
natural.

Multilibs solve exactly the problem of "it's a property of how you
compiled". The thing that's subtly different here is that the
usual thing
that people do with multilibs is to provide ABI incompatible
versions of the
same library (which are made incompatible via compiler flags,
-msoft-float,
for example), whereas these libraries just so happen to be ABI
compatible
with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and
$LIB, but I
/think/ it's a red herring: the compiler takes care of adding in
the
multilib suffixes where appropriate, so shouldn't the answer to
"which
library do I stick in the rpath?" include said suffix (when
compiled with
Eric's proposed flag)?

What are these suffixes and where are they added?

To be clear: the suffixes aren't something that exist yet, but rather
they're something I'm proposing.

Strawman:

flag(s) suffix
------- ------
-fsanitize=address /asan
-fsanitize=address,memory /asan/msan

Then with `-fsanitize=address`:

    /usr/lib/libc++.so

becomes:

    /usr/lib/asan/libc++.so

This kind of scheme sounds great, but is this something we can implement on our own, or something that requires changes to the dynamic loader (e.g. glibc's ld.so)?

Isn't it entirely up to what the user sticks in the rpath of the binaries that they build?

Jon

From: "Jonathan Roelofs" <jonathan@codesourcery.com>
To: "Evgenii Stepanov" <eugenis@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Eric Fiselier" <eric@efcs.ca>, "clang developer list" <cfe-dev@lists.llvm.org>,
"Chandler Carruth" <chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>
Sent: Monday, August 15, 2016 1:37:11 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org>
To:
"Eric Fiselier" <eric@efcs.ca>, "clang developer list"
<cfe-dev@lists.llvm.org>, "Chandler Carruth"
<chandlerc@gmail.com>,
"Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
<eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
Sanitized Libc++

Sanitizers such as MSAN require the entire program to be
instrumented, anything less leads to plenty of false positives.
Unfortunately this can be difficult to achieve, especially for
the C and C++ standard libraries. To work around this the
sanitizers provide interceptors for common C functions, but the
same solution doesn't work as well for the C++ STL. Instead
users
are forced to manually build and link a custom sanitized
libc++.
This is a huge PITA and I would like to improve the situation,
not just for MSAN but all sanitizers. I'm working on a proposal
to change this. The basis of my proposal is:

Clang should install/provide multiple sanitized versions of
Libc++ and a mechanism to easily link them, as if they were a
Compiler-RT runtime.

The goal of this proposal is:

(1) Greatly reduce the number of false positives caused by
using
an un-sanitized STL. (2) Allow sanitizers to catch user bugs
that
occur within the STL library, not just its headers.

The basic steps I would like to take to achieve this are:

(1) Teach the compiler-rt CMake how to build and install each
sanitized libc++ version along side its other runtimes. (2) Add
options to the Clang driver to support linking/using these
libraries.

I think this proposal is likely to be contentious, so I would
like to focus on the details it. Once I have some feedback on
these details I'll put together a formal proposal, including a
plan for implementing it. The details I would like input on
are:

(A) What kind and how many sanitized versions of libc++ should
we provide?

---------------------------------------------------------------------------------------------------------------

I think the minimum set would be Address (which includes Leak),

Memory (With origin tracking?), Thread, and Undefined. Once we
get into combinations of sanitizers things get more
complicated.
What other sanitizer combinations should we provide?

(B) How should we handle UBSAN?
---------------------------------------------------

UBSAN is really just a collection of sanitizers and providing
sanitized versions of libc++ for every possible configuration
is
out of the question. Instead we should figure out what subset
of
UBSAN checks we want to enable in sanitized libc++ versions. I
suspect we want to disable the following checks.

* -fsanitize=vptr * -fsanitize=function *
-fsanitize=float-divide-by-zero

Additionally UBSAN can be combined with every other sanitizer
group (ie Address, Memory, Thread). Do we want to provide a
combination of UBSAN on/off for every group, or can we simply
provide an over-sanitized version with UBSAN on?

(C) How should the Clang driver expose the sanitized libraries
to the users?

-------------------------------------------------------------------------------------------------------------

I would like to propose the driver option '-fsanitize-stdlib' and

'-fsanitize-stdlib=<sanitizer>'. The first version deduces the
best sanitized version to use, the second allows it to be
explicitly specified.

A couple of other options are:

* -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
turn on and choose a sanitized STL.

(D) Should sanitized libc++ versions override libc++.so?

-------------------------------------------------------------------------------------------

For example, what happens when a program links to both a sanitized

and non-sanitized libc++ version? Does the sanitized version
replace the non-sanitized version, or should both versions be
loaded into the program?

Essentially I'm asking if the sanitized versions of libc++
should have the "soname" libc++ so they can replace
non-sanitized
version, or if they should have a different "soname" so the
linker treats them as a separate library.

I haven't looked into the consequences of either approach in
depth, but any input is appreciated.

In a sense, these are /just/ multilibs, so my inclination would
be
to make all the soname's the same, and just stick them in
appropriately named subfolders relative to their normal
location.

I'm not sure that's true; there's no property of the environment
that
determines which library path you need. As a practical matter, I
can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
right thing in this context. Moreover, it is really a property of
how
you compiled, so I think using an alternate library name is
natural.

Multilibs solve exactly the problem of "it's a property of how you
compiled". The thing that's subtly different here is that the
usual thing
that people do with multilibs is to provide ABI incompatible
versions of the
same library (which are made incompatible via compiler flags,
-msoft-float,
for example), whereas these libraries just so happen to be ABI
compatible
with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and
$LIB, but I
/think/ it's a red herring: the compiler takes care of adding in
the
multilib suffixes where appropriate, so shouldn't the answer to
"which
library do I stick in the rpath?" include said suffix (when
compiled with
Eric's proposed flag)?

What are these suffixes and where are they added?

To be clear: the suffixes aren't something that exist yet, but rather
they're something I'm proposing.

Strawman:

flag(s) suffix
------- ------
-fsanitize=address /asan
-fsanitize=address,memory /asan/msan

Then with `-fsanitize=address`:

    /usr/lib/libc++.so

becomes:

    /usr/lib/asan/libc++.so

This kind of scheme sounds great, but is this something we can implement on our own, or something that requires changes to the dynamic loader (e.g. glibc's ld.so)?

Isn't it entirely up to what the user sticks in the rpath of the binaries that they build?

It is my understanding that rpath only really helps with executables. If I want to build a dynamic library and sanitize it, without rebuilding my executable, then an rpath won't help.

Also, it would be awfully nice if -fsanitize=address were the only flag necessary to add to the build to make everything work. Requiring users to add another flag isn't particularly kind.

If we "only" need to provide sanitizers for libc++, and not libc++abi, then I would be more of a fan of providing a different .so name, along with a lot of version tagging. If version tagging isn't used, then having multiple libc++ versions in the same process would cause all sorts of interposition problems. I don't know how widely available version tagging is in practice though. GNU ld has it ( Using LD, the GNU linker - Version Script ).

From: "Jonathan Roelofs" <jonathan@codesourcery.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Eric Fiselier" <eric@efcs.ca>, "clang developer list" <cfe-dev@lists.llvm.org>, "Chandler Carruth"
<chandlerc@gmail.com>, "Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov" <eugenis@google.com>
Sent: Monday, August 15, 2016 9:24:17 AM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

>> From: "Jonathan Roelofs via cfe-dev" <cfe-dev@lists.llvm.org> To:
>> "Eric Fiselier" <eric@efcs.ca>, "clang developer list"
>> <cfe-dev@lists.llvm.org>, "Chandler Carruth"
>> <chandlerc@gmail.com>,
>> "Kostya Serebryany" <kcc@google.com>, "Evgenii Stepanov"
>> <eugenis@google.com> Sent: Sunday, August 14, 2016 7:07:00 PM
>> Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a
>> Sanitized Libc++
>>
>>
>>
>>> Sanitizers such as MSAN require the entire program to be
>>> instrumented, anything less leads to plenty of false positives.
>>> Unfortunately this can be difficult to achieve, especially for
>>> the C and C++ standard libraries. To work around this the
>>> sanitizers provide interceptors for common C functions, but the
>>> same solution doesn't work as well for the C++ STL. Instead users
>>> are forced to manually build and link a custom sanitized libc++.
>>> This is a huge PITA and I would like to improve the situation,
>>> not just for MSAN but all sanitizers. I'm working on a proposal
>>> to change this. The basis of my proposal is:
>>>
>>> Clang should install/provide multiple sanitized versions of
>>> Libc++ and a mechanism to easily link them, as if they were a
>>> Compiler-RT runtime.
>>>
>>> The goal of this proposal is:
>>>
>>> (1) Greatly reduce the number of false positives caused by using
>>> an un-sanitized STL. (2) Allow sanitizers to catch user bugs that
>>> occur within the STL library, not just its headers.
>>>
>>> The basic steps I would like to take to achieve this are:
>>>
>>> (1) Teach the compiler-rt CMake how to build and install each
>>> sanitized libc++ version along side its other runtimes. (2) Add
>>> options to the Clang driver to support linking/using these
>>> libraries.
>>>
>>> I think this proposal is likely to be contentious, so I would
>>> like to focus on the details it. Once I have some feedback on
>>> these details I'll put together a formal proposal, including a
>>> plan for implementing it. The details I would like input on are:
>>>
>>> (A) What kind and how many sanitized versions of libc++ should
>>> we provide?
>>> ---------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>
I think the minimum set would be Address (which includes Leak),
>>> Memory (With origin tracking?), Thread, and Undefined. Once we
>>> get into combinations of sanitizers things get more complicated.
>>> What other sanitizer combinations should we provide?
>>>
>>> (B) How should we handle UBSAN?
>>> ---------------------------------------------------
>>>
>>> UBSAN is really just a collection of sanitizers and providing
>>> sanitized versions of libc++ for every possible configuration is
>>> out of the question. Instead we should figure out what subset of
>>> UBSAN checks we want to enable in sanitized libc++ versions. I
>>> suspect we want to disable the following checks.
>>>
>>> * -fsanitize=vptr * -fsanitize=function *
>>> -fsanitize=float-divide-by-zero
>>>
>>> Additionally UBSAN can be combined with every other sanitizer
>>> group (ie Address, Memory, Thread). Do we want to provide a
>>> combination of UBSAN on/off for every group, or can we simply
>>> provide an over-sanitized version with UBSAN on?
>>>
>>> (C) How should the Clang driver expose the sanitized libraries
>>> to the users?
>>> -------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>
I would like to propose the driver option '-fsanitize-stdlib' and
>>> '-fsanitize-stdlib=<sanitizer>'. The first version deduces the
>>> best sanitized version to use, the second allows it to be
>>> explicitly specified.
>>>
>>> A couple of other options are:
>>>
>>> * -fsanitize=foo: Implicitly turn on a sanitized STL. Clang
>>> deduces which version. * -stdlib=libc++-<sanitizer>: Explicitly
>>> turn on and choose a sanitized STL.
>>>
>>> (D) Should sanitized libc++ versions override libc++.so?
>>> -------------------------------------------------------------------------------------------
>>>
>>>
>>>
For example, what happens when a program links to both a sanitized
>>> and non-sanitized libc++ version? Does the sanitized version
>>> replace the non-sanitized version, or should both versions be
>>> loaded into the program?
>>>
>>> Essentially I'm asking if the sanitized versions of libc++
>>> should have the "soname" libc++ so they can replace non-sanitized
>>> version, or if they should have a different "soname" so the
>>> linker treats them as a separate library.
>>>
>>> I haven't looked into the consequences of either approach in
>>> depth, but any input is appreciated.
>>
>> In a sense, these are /just/ multilibs, so my inclination would be
>> to make all the soname's the same, and just stick them in
>> appropriately named subfolders relative to their normal location.
>
> I'm not sure that's true; there's no property of the environment
> that
> determines which library path you need. As a practical matter, I
> can't set $PLATFORM and/or $LIB in my rpath and have ld.so do the
> right thing in this context. Moreover, it is really a property of
> how
> you compiled, so I think using an alternate library name is
> natural.

Multilibs solve exactly the problem of "it's a property of how you
compiled". The thing that's subtly different here is that the usual
thing that people do with multilibs is to provide ABI incompatible
versions of the same library (which are made incompatible via
compiler
flags, -msoft-float, for example), whereas these libraries just so
happen to be ABI compatible with their non-instrumented variants.

I'm not sure I understand what you're saying about $PLATFORM and
$LIB,
but I /think/ it's a red herring: the compiler takes care of adding
in
the multilib suffixes where appropriate, so shouldn't the answer to
"which library do I stick in the rpath?" include said suffix (when
compiled with Eric's proposed flag)?

I'm not sure what color herring it is :wink: -- I'm trying to understand the system you're proposing:

1. User A compiles/installs Clang/LLVM/libc++ on system A in /local/clang, and so we get a /local/clang/lib/libc++.so and a /local/clang/lib/msan/libc++.so. User A compiles a program, foo, with msan enabled, and foo gets an rpath of /local/clang/lib/msan. User A also compiles another program, prod, without any sanitizers, and those get an rpath of /local/clang/lib.

2. User B compiles/installs Clang/LLVM/libc++ on system B in /soft/clang, and so we get a /soft/clang/lib/libc++.so and a /soft/clang/lib/msan/libc++.so. User A sends User B the executables foo and prod. Those executables have rpaths with /local/clang/..., but those don't help User B. User B has an environment with LD_LIBRARY_PATH=/soft/clang/lib so that the executables compiled by User A will run.

3. User B has no good option, because if LD_LIBRARY_PATH is set to /soft/clang/lib, then prod will behave as expected (i.e. not be sanitized), but foo will not. If LD_LIBRARY_PATH is set to /soft/clang/lib/msan, then foo will be sanitized as expected, but prod will run slower than usual.

4. User B compiles programs to send to User A. User A then sets LD_LIBRARY_PATH to /local/clang/lib. User A has the same problem as User B, and moreover, if User A compiles using -W,--enable-new-dtags, then the linker will use DT_RUNPATH (instead of, or in addition to, DT_RPATH; effect is the same), which is the recommended default on many systems, the rpath scheme won't even work for User A on User A's own executables (because LD_LIBRARY_PATH overrides DT_RUNPATH).

There are a few things, other than pure directory paths, that can appear in, or otherwise affect, LD_LIBRARY_PATH and DT_RPATH/DT_RUNPATH, but I don't think any of them help us here:

1. Pseudo variables $ORIGIN, $LIB and $PLATFORM - These are expanded by ld.so based on properties of the current execution environment (e.g. whether you're loading a 32-bit or 64-bit executable, the hardware architecture).

2. Hardware-capability strings - There are a fixed set of hardware capabilities, such as sse, sse2, altivec, etc. that are appended to the directory name to form alternate search paths.

3. The multilib suffix. This, AFAIK, is baked into the dynamic loader. The path to the loader itself has the multilib suffix, and that's specified in PT_INTERP.

Unfortunately, I don't think that any of these help us.

-Hal

Also, it would be awfully nice if -fsanitize=address were the only flag necessary to add to the build to make everything work. Requiring users to add another flag isn’t particularly kind.

It may not be particularly kind, but replacing your STL is something we might want consent to do. Especially if it has all these complexities.

From: "Eric Fiselier via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Ben Craig" <ben.craig@codeaurora.org>
Cc: "clang developer list" <cfe-dev@lists.llvm.org>
Sent: Monday, August 15, 2016 2:51:38 PM
Subject: Re: [cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

> Also, it would be awfully nice if -fsanitize=address were the only
> flag necessary to add to the build to make everything work.
> Requiring users to add another flag isn't particularly kind.

It may not be particularly kind, but replacing your STL is something
we might want consent to do. Especially if it has all these
complexities.

I don't disagree. If we can't come up with a system that 'just works', then we should ask for consent (or at least provide an opt-out).

-Hal