Handling of unsafe functions

Hello,

We have identified functions in LLVM sources using a static code analyzer which are marked as a “security vulnerability”[1][2]. There has been work already done to address some of them for Linux (e.g. snprintf). We are attempting to solve this issue in a comprehensive fashion across all platforms. Most of the functions identified are for manipulating strings. Memcpy is the most commonly used of all these unsecure methods. The following table lists all these functions are their recommended secure alternatives.

Recommended alternatives:

Functions Windows Unix/Mac OS

Memcpy memcpy_s -

Sprint sprintf_s snprintf

Sscanf scanf_s -

_alloca _malloca -

Strcat strcat_s strlcat

Strcpy strcpy_s strlcpy

Strtok strtok_s -

The proposal is to add secure versions of these functions. These functions will be implemented in LLVM Support module and be used by all other LLVM modules. The interface of these methods will be platform independent while their implementation will be platform specific (like the Mutex class in Support module). In cases where the platform does not support the functionality natively, we are writing an implementation of these functions. For example, in the case of memcpy the secure function will look like llvm::memcpy_secure.

Some secure functions require additional data that needs to be passed (like buffer sizes). That information has to be added in all places of invocation. In some cases, this requires an extra size_t argument to be passed through. Hence, this change would not just be a one to one function refactoring. The attached patch helps illustrate how an instance of memcpy would be modified.

Is this proposal of interest to the LLVM community? Can you also comment if the approach specified is good to address this issue?

References:

[1] http://msdn.microsoft.com/en-us/library/ms235384(v=vs.80).aspx

[2] Documentation Archive

llvm_secure_function.patch (6.2 KB)

I generally disagree with the approach.

Generally char* strings aren't recommended for use in LLVM and this
kind of string manipulation in LLVM shouldn't be done with the
primitive C library functions. The Programmer's Manual gives the
preferred types to use for strings [1] and all of them keep track of
length. There are also safe routines for creating and formatting
strings, such as raw_ostream which is used pervasively in LLVM.

The example routine in your patch probably should just use
raw_string_ostream or raw_svector_ostream, instead of relying on
C-style string routines. That way, the correctness is enforced by the
compiler, instead of manually laboring over these things (like
checking the return code, which your patch doesn't do...).

In other words, there are completely safe alternatives for these
functions for almost all cases.

One particular use case that usually pertains to memcpy though is when
performance is of significant concern and hence the author "knows what
they are doing" and aren't willing to sacrifice performance calling
into some "secure" version when they have other assurances that the
target buffer has sufficient space. The performance difference can be
significant, since usually memcpy will be turned into a compiler
builtin that the compiler recognizes and optimizes specially, whereas
with the suggested approach, a regular call into a "llvm::*_secure"
wrapper which then calls into the OS-provided general-purpose "secure"
version will happen.

I think that it would be useful if you used the output of your static
analyzer to provide a list of the places where C-style string
manipulation is being done, so that these places can be migrated to
using modern, safe LLVM interfaces for these operations.

[1] http://llvm.org/docs/ProgrammersManual.html#ds_string

--Sean Silva

Martinez, Javier E wrote:

Hello,

We have identified functions in LLVM sources using a static code
analyzer which are marked as a “security vulnerability”[1][2]. There has
been work already done to address some of them for Linux (e.g.
snprintf). We are attempting to solve this issue in a comprehensive
fashion across all platforms. Most of the functions identified are for
manipulating strings. Memcpy is the most commonly used of all these
unsecure methods. The following table lists all these functions are
their recommended secure alternatives.

Recommended alternatives:

Functions Windows Unix/Mac OS

Memcpy memcpy_s -

Sprint sprintf_s snprintf

Sscanf scanf_s -

_alloca _malloca -

Strcat strcat_s strlcat

Strcpy strcpy_s strlcpy

Strtok strtok_s -

The proposal is to add secure versions of these functions. These
functions will be implemented in LLVM Support module and be used by all
other LLVM modules. The interface of these methods will be platform
independent while their implementation will be platform specific (like
the Mutex class in Support module). In cases where the platform does not
support the functionality natively, we are writing an implementation of
these functions. For example, in the case of memcpy the secure function
will look like llvm::memcpy_secure.

Some secure functions require additional data that needs to be passed
(like buffer sizes). That information has to be added in all places of
invocation. In some cases, this requires an extra size_t argument to be
passed through. Hence, this change would not just be a one to one
function refactoring. The attached patch helps illustrate how an
instance of memcpy would be modified.

Is this proposal of interest to the LLVM community? Can you also comment
if the approach specified is good to address this issue?

Personally, I'm not particularly interested in blanket replacement of memcpy with memcpy_s in the hopes that it might close a security hole. I am very interested in fixing any actual bugs. If it's easier to fix real bugs by aggressively using this additional layer, then that may well be the way to go, but before I agree to that, I've got a ton of questions to answer first.

What's the current error rate? How often are we seeing bugs in llvm that would be fixed if only we were calling "secure" functions?

What's the impact of calling the secure function? On Release builds and on Debug builds? On size and performance?

Why not rely on platforms to secure these functions? For instance, Linux and Darwin both have FORTIFY_SOURCE, and I'm too ignorant of Windows to know what the equivalent is there. What about existing tools like valgrind or ASAN?

What happens if memcpy_secure does detect an insecure memcpy? It's considered very rude for LLVM to terminate on the spot since it's often used as a library, so how do we handle the error? By calling llvm::report_fatal_error and hoping we don't recurse? What if it's a debug build and we'd like to see where the code went wrong?

How do you plan to enforce that the insecure functions aren't called?

Nick

Hi Nick,

Thanks for taking the time to review the proposal. I'd like to stress out that the purpose of the changes in the proposal are not to improve performance or to fix existing bugs. The purpose is to catch instances where a buffer overrun would happen with specific function calls. With the appropriate input the buffer overrun could lead to undefined behavior or a crash.

What's the current error rate? How often are we seeing bugs in llvm that would be fixed if only we were calling "secure" functions?
[JM] Zero, the motivation is not to fix exiting bugs but potential hidden ones.

What's the impact of calling the secure function? On Release builds and on Debug builds? On size and performance?
[JM] I haven't compared the performance of the secure functions and performance is not mentioned much on the pages I've looked at. However I found at [1] a performance comparison between strcpy and strcpy_s. There appears to be a performance penalty for the use of strcpy_s. In a real case it's very likely that the penalty is hidden by bottlenecks in other places. If the performance degradation is apparent then the proposal can be modified. For example, the *_secure version could do only parameter validation and size checks.

Why not rely on platforms to secure these functions? For instance, Linux and Darwin both have FORTIFY_SOURCE, and I'm too ignorant of Windows to know what the equivalent is there. What about existing tools like valgrind or ASAN?
[JM] I'm not too familiar with Linux specifics and the tools you mentioned but from what I could gather FORTIFY_SOURCE doesn't cover all buffer overflow cases. The cases it covers are exactly the ones I described above, where the size is known at compile time. I don't know of a FORTIFY_SOURCE equivalent for Windows. The proposed solution would work for both Windows and Linux and cover more general cases of buffer overflows.

What happens if memcpy_secure does detect an insecure memcpy? It's considered very rude for LLVM to terminate on the spot since it's often used as a library, so how do we handle the error? By calling llvm::report_fatal_error and hoping we don't recurse? What if it's a debug build and we'd like to see where the code went wrong?
[JM] We use LLVM inside a DLL so I completely sympathize with you about terminating execution on the spot. Although the patch doesn't address this I think calling llvm::report_fatal_error() if the secure functions fail is a good idea because instead of crashing now LLVM can exit gracefully or if an error handler is available then a controlled situation. In Windows the secure functions allow the use of an error handler to be called if parameter validation fails. The custom error handler can dirently call report_fata_error().

How do you plan to enforce that the insecure functions aren't called?
[JM] How about modifying the LLVM programmer's manual should to add a section about the use of secure functions? I can provide a blurb for it.

[1] What is the fastest string copy function in C++ sprintf, strcpy, strncpy, strcpy_s ~ Code Ketchup

Thanks,
Javier

Hi Sean,

Thanks for the valued feedback. I agree with you that the containers available in LLVM are preferable to char buffers but I want to point out that the proposal doesn't add any new uses of char buffer and merely works with existing ones. Changing existing uses of char buffers to other objects is beyond the scope of this proposal. It makes more sense to do that when changes to code that uses string manipulation functions are made as it could incur in larger design changes.

I'm unsure of the performance impact of using the secure functions and how to balance it with the benefit of improving the code quality. If the proposal gets support I can gather performance data to make the determination of whether there is a performance hit and if it's acceptable. Hoping that authors "know what they are doing" is not enough. If that were the case there wouldn't be bugs to fix and code to review.

I don't have the output of the static analyzer at hand but will provide it on a follow up email.

Thanks,
Javier

Hi Javier,

Thanks for the valued feedback. I agree with you that the containers available in LLVM are preferable to char buffers but I want to point out that the proposal doesn't add any new uses of char buffer and merely works with existing ones. Changing existing uses of char buffers to other objects is beyond the scope of this proposal.

I agree with Sean that it would be far better to get rid of uses of unsafe
functions by using higher level (and safe) abstractions instead. Can you
please provide a list of places using unsafe functions so that people here
can try to eliminate them.

Ciao, Duncan.

  It makes more sense to do that when changes to code that uses string manipulation functions are made as it could incur in larger design changes.

I am strongly opposed to using *_s functions. The issue is that they
are no more "secure" than original functions. One can still pass the
destination buffer length incorrectly, especially if it is not known
at compile time and should be computed.

I agree with Sean that we should move the code using C strings to LLVM
safe data types.

And one more thing: it is interesting that the "unsafe"
APFloat::convertToHexString (from your patch) is not used anywhere.

Dmitri

We have identified functions in LLVM sources using a static code analyzer
which are marked as a “security vulnerability”[1][2]. There has been work
already done to address some of them for Linux (e.g. snprintf). We are
attempting to solve this issue in a comprehensive fashion across all
platforms. Most of the functions identified are for manipulating strings.
Memcpy is the most commonly used of all these unsecure methods. The
following table lists all these functions are their recommended secure
alternatives.

I am strongly opposed to using *_s functions. The issue is that they
are no more "secure" than original functions. One can still pass the
destination buffer length incorrectly, especially if it is not known
at compile time and should be computed.

I agree with Sean that we should move the code using C strings to LLVM
safe data types.

I agree.

And one more thing: it is interesting that the "unsafe"
APFloat::convertToHexString (from your patch) is not used anywhere.

Zap it! Oh wait, is it used by Clang or something else?

-Chris

Clang doesn’t use it, but LLDB does, with an arbitrary-seeming 256-character buffer. Perhaps we should change it to take an llvm::SmallVectorImpl, rather than relying on a character buffer “which must be of sufficient size”.

From the responses it’s pretty clear that the preference is to avoid using C string functions altogether. I’ve attached at list of calls in Clang/LLVM. The EASY/MEDIUM/DIFFICULT tag is an estimate of the effort to replace the call based on the location of the source buffer. If there are no objections I’ll prepare a patch that replaces the string manipulation functions an appropriate string object.

The proposal comments have largely centered on the string functions. Do people feel the same way about memcpy_s? What about those of you building LLVM on Windows with Visual Studio?

I know both functions can be used the wrong way but at least the “secure” version makes one think about the value passed for destination size. Very likely most of the 1691 uses of memcpy in Clang/LLVM are correct but with such a high number of uses there are likely a few that are not. I’m willing to plow through all the calls to check the parameters while making the change to the memcpy_secure version from the proposal.

Thanks,

Javier

CStringFunctionsUse.txt (3.94 KB)

Makes sense to me,

-Chris

Thank you for the list. Fixed the Clang non-platform specific bits in
r164554. Didn't dare to fix other occurrences in Clang because they
are #ifdefed for Windows and I don't have a Windows machine to test.

Dmitri

I’ll prepare a patch that replaces the string manipulation functions an
appropriate string object.

Please break the patch up into focused chunks, one per logical change.
We try to keep all LLVM development as incremental as possible [1]. I
recommend fixing a single logical occurrence (such as fixing
APFloat::convertToHexString()) and then mailing the patch to
llvm-commits. It's likely that the feedback you get from the review of
the first patch will help inform future patches and simplify later
patch reviews.

[1] http://llvm.org/docs/DeveloperPolicy.html#incremental-development

I know both functions can be used the wrong way but at least the “secure”
version makes one think about the value passed for destination size. Very
likely most of the 1691 uses of memcpy in Clang/LLVM are correct but with
such a high number of uses there are likely a few that are not. I’m willing
to plow through all the calls to check the parameters while making the
change to the memcpy_secure version from the proposal.

My inclination is that it will be redundant. For example, consider
this usage, grabbed randomly from the source code:

      char *Buf = static_cast<char *>(Allocate(Directory.size()));
      memcpy(Buf, Directory.data(), Directory.size());

it's not that clear to me that just adding an extra argument of
Directory.size() buys anything. If anything, it seems like it
introduces _another_ place to make an error.

I went through a couple other random uses of memcpy() in the LLVM tree
and none of them seemed like they would benefit.

[...] I’m willing
to plow through all the calls to check the parameters while making the
change to the memcpy_secure version from the proposal.

By all means, please take a look at them. However, it might be overall
more beneficial to go through and report "sketchy" uses, where the
copy is not obviously correct, then discuss on the list how to ensure
the code is safe. Once all of the uses are audited, you can then just
write a little script using something like git's "pickaxe"
functionality (see `-S` option to git-diff(1), and other commands that
take "diff" options) to send you an email every time a new use of
memcpy gets introduced into the codebase. For example, you could have
a daily cron job that runs `git log -p -S'memcpy('
origin/master@{yesterday}..origin/master` and if there is any output
then it mails that output to you.

-- Sean Silva

Sean,

Thanks for the suggestion on the patch submission. I started with the first patch to address a couple of calls to C string functions in Errno.cpp. If that goes well I'll prepare a patch for each function occurrence.

I went through a couple other random uses of memcpy() in the LLVM tree and none of them seemed like they would benefit.

I understand that not all uses of memcpy will clearly benefit from a replacement to the proposed memcpy_secure. For people using LLVM under Windows using Visual Studio there's an additional benefit; LLVM can be compiled without deprecating the functions marked as unsecure. When this is a requirement the alternative is to make the changes locally which becomes a problem when upgrading LLVM versions. I still think that abstracting the implementation of memcpy facilitates the work for users in Windows without having a significant impacting users in other platforms.

Thanks,
Javier

...

Please fill bug reports for your tool. memcpy operates on explicitly
bounded objects, unlikely e.g. strcat/strcpy. Marking them as deprecated
is just as buggy. From the rest of your list, strtok has some issues,
but it is generally safe to use too. The replacements are not an
improvement at all. First time I saw the annex K (?) from C11, I was
thinking like "Who pushed this crap into the standard, Microsoft?".

Joerg

Is memcmp_s (or a variant thereof) a win in practice? It covers the case pretty well where you try to copy a dynamically sized buffer to the start of a statically sized one. I don't want to say that it doesn't happen, but it seems to be rather rare, and I expect most calls to memcpy_s would use the same expression for both sizes.

Florian,

You have a valid point, the majority of the times memcpy is used correctly and the destination buffer size is redundant information. I think we also agree that there are cases where the correct use of memcpy is not clear cut and the 4-parameter version adds value.

I've modified the original proposal to strike the middle of the road. Instead of only providing a 4-parameter version of memcpy_secure now a 3-parameter version also exists. The latter maps to memcpy or memcpy_s depending on what's available as determined by the CMake scripts. The 4-parameter version replaces those calls to memcpy where safe use can't be determined. All other calls will be replaced to the 3-parameter version.

This solution avoids passing redundant information, covers the situation where additional checking is of value and is platform independent (except for the implementation of memcpy_secure). The attached patch demonstrates the idea.

Thanks,
Javier

memcpy_patch.patch (5.21 KB)

Sorry, but I think this change only makes understanding of the code more difficult.