RFC: proper string handling in CXString

I opened this issue: https://bugs.llvm.org/show_bug.cgi?id=35896

MSAN tests fail because CXString is trying to reuse string data from
LLVM::StringRef. However the latter doesn't guarantee that strings
are NUL-terminated, but CXStrings need to be. Calling
`createRef(llvm::StringRef)` then selectively reuses (if there's a
zero terminator) or copies (otherwise) the bytes referenced in
StringRef.

To do this, CXString needs to possibly look past the end of the string
allocation, e.g. `if (str[str.size()] != 0)`. If the allocation of
the string bytes ends at &str[str.size()-1], then MSAN fails because
of the out-of-bounds read of those data.

So, I was hoping to get some feedback or ideas on how to fix it. I'm
trying to make LLVM and Clang MSAN-clean, as I'm trying to debug other
issues and preexisting problems are getting in the way, and I figure
that fixing these issues, however minor some of them are, are a good
thing anyway. :slight_smile:

Some ideas:

(1) Always copy StringRef data. Let `createRef(StringRef)` simply
jump to `createDup(StringRef)` instead, or codemod the former away.
Safe but involves more "memcpy"ing.
(2) Let CXString have only data and size fields, without worrying
about internal representation. Then when using `clang_getCString`, do
the allocation + string copy + extra NUL byte. Still does a "memcpy",
but only on strings the caller truly needs as C strings.
(3) My least favorite option, selectively ignore this particular MSAN
violation in some blacklist, and get on with life.

Other ideas?

Thanks!

--steveo

I opened this issue: https://bugs.llvm.org/show_bug.cgi?id=35896

MSAN tests fail because CXString is trying to reuse string data from
LLVM::StringRef. However the latter doesn't guarantee that strings
are NUL-terminated, but CXStrings need to be. Calling
`createRef(llvm::StringRef)` then selectively reuses (if there's a
zero terminator) or copies (otherwise) the bytes referenced in
StringRef.

To do this, CXString needs to possibly look past the end of the string
allocation, e.g. `if (str[str.size()] != 0)`. If the allocation of
the string bytes ends at &str[str.size()-1], then MSAN fails because
of the out-of-bounds read of those data.

So, I was hoping to get some feedback or ideas on how to fix it. I'm
trying to make LLVM and Clang MSAN-clean, as I'm trying to debug other
issues and preexisting problems are getting in the way, and I figure
that fixing these issues, however minor some of them are, are a good
thing anyway. :slight_smile:

Some ideas:

(1) Always copy StringRef data. Let `createRef(StringRef)` simply
jump to `createDup(StringRef)` instead, or codemod the former away.
Safe but involves more "memcpy"ing.
(2) Let CXString have only data and size fields, without worrying
about internal representation. Then when using `clang_getCString`, do
the allocation + string copy + extra NUL byte. Still does a "memcpy",
but only on strings the caller truly needs as C strings.

Of these options, I like (2) the most, provided there's some way to access the contents of a CXString without calling clang_getCString() on it (for performance-sensitive clients).

(3) My least favorite option, selectively ignore this particular MSAN
violation in some blacklist, and get on with life.

IMO this is a serious issue. E.g it seems plausible that a thread could store a non-zero value into the null-terminator byte after the CXString is constructed.

best,
vedant