createCXString reads one past end byte

Hello,

Nakamura-san has drawn my attention to the following valgrind failure:

http://lab.llvm.org:8011/builders/clang-x86_64-linux-vg/builds/643/steps/check-all/logs/Clang%20%3A%3A%20Index___comment-to-html-xml-conversion.cpp

The issue is that createCXString has an optimization to check if the
original string is nul-terminated. (See CXString.cpp:46) If the
original StringRef is nul-terminated, then it does not duplicate the
string. This is a good optimization in general, but it requires
reading one past end byte of the string. In this case, this byte just
happens to be uninitialized, because there was no need to initialize
it to create a StringRef that was passed to createCXString.

But this access can, in fact, trap, if it happens to be on the next
page that is unallocated.

Another possible case is when that byte happens to be nul during
createCXString, but is modified afterwards. When the libclang user
code reads the string, it will read garbage at the end.

The proper fix, I think, is to remove this optimization. Only this
way we can guarantee correctness. But I brought up this discussion
because doing so can have a big performance impact. (And somehow we
got away with reading one past end for a long time...)

The immediate fix for this failure is easy and it just proves that the
real issue is inside createCXString:

Index: lib/AST/CommentLexer.cpp

Hello,

Nakamura-san has drawn my attention to the following valgrind failure:

http://lab.llvm.org:8011/builders/clang-x86_64-linux-vg/builds/643/steps/check-all/logs/Clang%20%3A%3A%20Index___comment-to-html-xml-conversion.cpp

The issue is that createCXString has an optimization to check if the
original string is nul-terminated. (See CXString.cpp:46) If the
original StringRef is nul-terminated, then it does not duplicate the
string. This is a good optimization in general, but it requires
reading one past end byte of the string. In this case, this byte just
happens to be uninitialized, because there was no need to initialize
it to create a StringRef that was passed to createCXString.

But this access can, in fact, trap, if it happens to be on the next
page that is unallocated.

Another possible case is when that byte happens to be nul during
createCXString, but is modified afterwards. When the libclang user
code reads the string, it will read garbage at the end.

The proper fix, I think, is to remove this optimization.

I, for one, agree.

Only this
way we can guarantee correctness. But I brought up this discussion
because doing so can have a big performance impact. (And somehow we
got away with reading one past end for a long time...)

The immediate fix for this failure is easy and it just proves that the
real issue is inside createCXString:

Index: lib/AST/CommentLexer.cpp

--- lib/AST/CommentLexer.cpp (revision 172983)
+++ lib/AST/CommentLexer.cpp (working copy)
@@ -53,6 +53,7 @@
   }

   char *Resolved = Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
+ memset(Resolved, 0, UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
   char *ResolvedPtr = Resolved;
   if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr))

Wouldn't it be more efficient to put: *ResolvedPtr = '\0' here, rather
than memsetting the whole array above?

(moot point, since I agree we should just remove such an optimization
entirely because it's unreliable/undefined)

Hello,

Nakamura-san has drawn my attention to the following valgrind failure:

http://lab.llvm.org:8011/builders/clang-x86_64-linux-vg/builds/643/steps/check-all/logs/Clang%20%3A%3A%20Index___comment-to-html-xml-conversion.cpp

The issue is that createCXString has an optimization to check if the
original string is nul-terminated. (See CXString.cpp:46) If the
original StringRef is nul-terminated, then it does not duplicate the
string. This is a good optimization in general, but it requires
reading one past end byte of the string. In this case, this byte just
happens to be uninitialized, because there was no need to initialize
it to create a StringRef that was passed to createCXString.

But this access can, in fact, trap, if it happens to be on the next
page that is unallocated.

Another possible case is when that byte happens to be nul during
createCXString, but is modified afterwards. When the libclang user
code reads the string, it will read garbage at the end.

Yes, this is bad.

The proper fix, I think, is to remove this optimization. Only this
way we can guarantee correctness. But I brought up this discussion
because doing so can have a big performance impact. (And somehow we
got away with reading one past end for a long time...)

We could change how CXString works; instead of eagerly malloc'ing in case of a StringRef, have it stored in a "StringRef-kind" form and malloc when clang_getCString is called.
But also add a new function like clang_getStringWithLength, or something, for callers that can use that to avoid the extra malloc.

For now, I have marked this test as xfail in r173121.

Please remove it when resolved.

...Takumi

Hello Argyrios,

We could change how CXString works; instead of eagerly malloc'ing in case of a StringRef, have it stored in a "StringRef-kind" form and malloc when clang_getCString is called.

This will add an extra backing storage mode to CXString, let's call
that CXS_UnmanagedWithLength. In this mode we will store the pointer
in 'data', and length will be stored in the upper bits of
'private_flags'. But who will own the memory returned by
clang_getCString? We can not change the original CXString from
CXS_UnmanagedWithLength to CXS_Malloc because CXString is a value
type. Or did I misunderstand something?

Dmitri

Hello Argyrios,

We could change how CXString works; instead of eagerly malloc'ing in case of a StringRef, have it stored in a "StringRef-kind" form and malloc when clang_getCString is called.

This will add an extra backing storage mode to CXString, let's call
that CXS_UnmanagedWithLength. In this mode we will store the pointer
in 'data', and length will be stored in the upper bits of
'private_flags'. But who will own the memory returned by
clang_getCString? We can not change the original CXString from
CXS_UnmanagedWithLength to CXS_Malloc because CXString is a value
type. Or did I misunderstand something?

You are right, it is so easy to get a copy of it and trigger a memory leak unintentionally.

I'll do some measurements to see if it makes much difference to malloc on every StringRef (in the meantime any suggestions on how to avoid the malloc would be greatly appreciated).

-Argyrios

We can add an indirection layer. Let's add a mode CXS_Lazy. 'data'
would point to a struct:

struct {
  const char *Str;
  unsigned Length:31;
  unsigned IsNulTerminated:1;
};

Then CXString would remain a value type, and all copies would point to
the same underlying data.

Dmitri

Hello Argyrios,

We could change how CXString works; instead of eagerly malloc'ing in case of a StringRef, have it stored in a "StringRef-kind" form and malloc when clang_getCString is called.

This will add an extra backing storage mode to CXString, let's call
that CXS_UnmanagedWithLength. In this mode we will store the pointer
in 'data', and length will be stored in the upper bits of
'private_flags'. But who will own the memory returned by
clang_getCString? We can not change the original CXString from
CXS_UnmanagedWithLength to CXS_Malloc because CXString is a value
type. Or did I misunderstand something?

You are right, it is so easy to get a copy of it and trigger a memory leak unintentionally.

I'll do some measurements to see if it makes much difference to malloc on every StringRef (in the meantime any suggestions on how to avoid the malloc would be greatly appreciated).

We can add an indirection layer. Let's add a mode CXS_Lazy. 'data'
would point to a struct:

struct {
const char *Str;
unsigned Length:31;
unsigned IsNulTerminated:1;
};

Then CXString would remain a value type, and all copies would point to
the same underlying data.

I like it! We can also enhance the mechanism a bit to be able to use a free list of those so that we avoid malloc'ing them for the normal use pattern of having a CXString as a temporary object to get at the underlying string, what do you think ?

A good malloc implementation should already do that. Anyway, we do
have ArrayRecycler, so we can use it here. But I don't see a good
place to put an instance of ArrayRecycler. We can not make it static
because of thread safety.

Dmitri

Hello Argyrios,

We could change how CXString works; instead of eagerly malloc'ing in case of a StringRef, have it stored in a "StringRef-kind" form and malloc when clang_getCString is called.

This will add an extra backing storage mode to CXString, let's call
that CXS_UnmanagedWithLength. In this mode we will store the pointer
in 'data', and length will be stored in the upper bits of
'private_flags'. But who will own the memory returned by
clang_getCString? We can not change the original CXString from
CXS_UnmanagedWithLength to CXS_Malloc because CXString is a value
type. Or did I misunderstand something?

You are right, it is so easy to get a copy of it and trigger a memory leak unintentionally.

I'll do some measurements to see if it makes much difference to malloc on every StringRef (in the meantime any suggestions on how to avoid the malloc would be greatly appreciated).

We can add an indirection layer. Let's add a mode CXS_Lazy. 'data'
would point to a struct:

struct {
const char *Str;
unsigned Length:31;
unsigned IsNulTerminated:1;
};

Then CXString would remain a value type, and all copies would point to
the same underlying data.

I like it! We can also enhance the mechanism a bit to be able to use a free list of those so that we avoid malloc'ing them for the normal use pattern of having a CXString as a temporary object to get at the underlying string, what do you think ?

A good malloc implementation should already do that. Anyway, we do
have ArrayRecycler, so we can use it here. But I don't see a good
place to put an instance of ArrayRecycler. We can not make it static
because of thread safety.

How about putting the instance (or whoever owns it) in the indirection struct ?