raw c strings in lldb

A patch is up which fixes a number of issues with strings not being null terminated in LLDB. The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].

I realize that’s a tall order, and I don’t expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.

LLVM has a very robust string manipulation library, and while it’s not perfect, it would have prevented almost all of these issues from ever happening in the first place. Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:

Types:
char stack_buffer[n]; // Use llvm::SmallString.
const char* foo; // Use llvm::StringRef foo.

Passing arguments;
void foo(char *, int len); // Use void foo(llvm::SmallVectorImpl &buf);
void foo(const char * foo); // Use void foo(llvm::StringRef foo);

Operations:
strcpy, strncpy, etc // Use member functions of SmallString / StringRef
sprintf, snprintf, etc // Use llvm::raw_ostream

We can fix existing occurrences gradually / as time permits, but I would prefer to see a trend towards new code using llvm’s string classes and formatting library, and when you are touching a piece of code that is messing with raw c-strings, that’s a perfect time to change that code to migrate to LLVM strings.

[1] There are a couple of places where we can’t ban them entirely. This relates to the public API (i.e. can’t change an existing signature), and va_args off the top of my head. The va_args stuff though, if generally useful, if stuff we can sink into LLVM, there’s nothing debuggery about formatting strings.

Thoughts? Suggestions? Comments?

I’d also point out that if you hate writing “llvm::” or using decls everywhere, you can import StringRef into the lldb namespace. Clang does this in include/clang/Basic/LLVM.h for StringRef and other common classes.

Also:

Find all “using namespace llvm;”, Subfolders, Find Results 1, “d:\src\llvm”, “*.cpp”

Matching lines: 1418 Matching files: 1392 Total files searched: 5811

So “using namespace llvm” is certainly not without precedent either.

Ahh, I just noticed you mentioned that in your message. Sorry for the noise (and the additional noise from this response :-/)

A patch is up which fixes a number of issues with strings not being null terminated in LLDB. The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].

I realize that's a tall order, and I don't expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.

LLVM has a very robust string manipulation library, and while it's not perfect, it would have prevented almost all of these issues from ever happening in the first place. Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:

Types:
char stack_buffer[n]; // Use llvm::SmallString<n>.

That would be fine.

const char* foo; // Use llvm::StringRef foo.

This may be fine as long as there is absolutely _no_ string manipulation on "foo" (like "strcat(...) etc. Use std::string for anything that requires storage. And we don't want to start using any code that does "let me assign an llvm::StringRef() with a ConstString("hello").GetCString() so I don't have to worry about ownership. Adding string to the constant string pool should be reserved for things that actually need it (any const names like namespaces, variable names, type names, paths, etc) and also for when we return "const char *" through the public API.

This also means that we can't trust that "foo" is NULL terminated (llvm::StringRef objects have a data pointer and a length and aren't required to be terminated) and it means we must call "foo.str().c_str()" any time we require NULL termination. So I actually don't like llvm::StringRef as a replacement "const char *"...

Passing arguments;
void foo(char *, int len); // Use void foo(llvm::SmallVectorImpl<char> &buf);

I would prefer to use std::string here. Clearer and everyone knows how to use it. SmallVectors are nice, but they have the overhead of storing more data on the stack. Clang currently can create quite large stack frames when you have some code that has large switch statements where each switch statement has local variables. If those switch statements how SmallVector<char, PATH_MAX>, then you now have a case where if this code is called recursively we end up blowing the stack. We have seen this a lot in the DWARF parser, so I don't want to encourage using SmallVector for strings, since many places we use strings are for paths and other larger buffers. Any current std::string implementation can store up to 24 characters in the std::string without needing to do any allocations on the heap, so that essentially gets us a SmallVector<char, 24>. So I vote std::string.

void foo(const char * foo); // Use void foo(llvm::StringRef foo);

Again, I don't like this because you must call "foo.str().c_str()" in order to guarantee NULL termination.

Operations:
strcpy, strncpy, etc // Use member functions of SmallString / StringRef
sprintf, snprintf, etc // Use llvm::raw_ostream

We have StringStream, people should be using that. raw_ostream doesn't support all printf style formats. So all sprintf and snprintf function calls should switch to using lldb_private::StreamString.

We can fix existing occurrences gradually / as time permits, but I would prefer to see a trend towards new code using llvm's string classes and formatting library, and when you are touching a piece of code that is messing with raw c-strings, that's a perfect time to change that code to migrate to LLVM strings.

[1] There are a couple of places where we can't ban them entirely. This relates to the public API (i.e. can't change an existing signature), and va_args off the top of my head. The va_args stuff though, if generally useful, if stuff we can sink into LLVM, there's nothing debuggery about formatting strings.

Thoughts? Suggestions? Comments?

I don't agree for my reasons mentioned above.

My 2 cents is:
1 - keep using "const char *" since they can guarantee NULL termination
2 - Use "std::string" for functions that have (char *, len)
3 - Switch all sprintf and snprintf over to use StreamString

A patch is up which fixes a number of issues with strings not being null terminated in LLDB. The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].

I realize that’s a tall order, and I don’t expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.

LLVM has a very robust string manipulation library, and while it’s not perfect, it would have prevented almost all of these issues from ever happening in the first place. Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:

Types:
char stack_buffer[n]; // Use llvm::SmallString.

That would be fine.

const char* foo; // Use llvm::StringRef foo.

This may be fine as long as there is absolutely no string manipulation on “foo” (like "strcat(…) etc. Use std::string for anything that requires storage. And we don’t want to start using any code that does "let me assign an llvm::StringRef() with a ConstString(“hello”).GetCString() so I don’t have to worry about ownership. Adding string to the constant string pool should be reserved for things that actually need it (any const names like namespaces, variable names, type names, paths, etc) and also for when we return “const char *” through the public API.

This also means that we can’t trust that “foo” is NULL terminated (llvm::StringRef objects have a data pointer and a length and aren’t required to be terminated) and it means we must call “foo.str().c_str()” any time we require NULL termination. So I actually don’t like llvm::StringRef as a replacement “const char *”…

We can’t assume that const chars are null terminated either. They’re just pointers to raw buffers, which you may or may not have null terminated. So we have the same issue. By replacing all uses of const char with StringRef(), we end up with exactly the same semantics. null terminated strings are null terminated, non null terminated strings aren’t.

Passing arguments;
void foo(char *, int len); // Use void foo(llvm::SmallVectorImpl &buf);

I would prefer to use std::string here. Clearer and everyone knows how to use it. SmallVectors are nice, but they have the overhead of storing more data on the stack. Clang currently can create quite large stack frames when you have some code that has large switch statements where each switch statement has local variables. If those switch statements how SmallVector<char, PATH_MAX>, then you now have a case where if this code is called recursively we end up blowing the stack. We have seen this a lot in the DWARF parser, so I don’t want to encourage using SmallVector for strings, since many places we use strings are for paths and other larger buffers. Any current std::string implementation can store up to 24 characters in the std::string without needing to do any allocations on the heap, so that essentially gets us a SmallVector<char, 24>. So I vote std::string.

passing SmallVectorImpl<> is intended as a replacement for the case where you have this:

void foo(char *buf);
void bar() {
char buffer[n];
foo(buffer);
}

The equivalent LLVM code is this:

void foo(llvm::SmallVectorImpl &buf);
void bar() {
llvm::SmallString buffer;
foo(buffer);
}

Exactly the same amount of stack space. It’s not intended to be a general replacement for std::string, which stores its data entirely on the heap. I do think that many places in LLDB that currently use stack buffers could be re-written just as well using heap buffers, and for that I agree std::string is fine.

void foo(const char * foo); // Use void foo(llvm::StringRef foo);

Again, I don’t like this because you must call “foo.str().c_str()” in order to guarantee NULL termination.

As mentioned previously, in the const char * version you have to guarantee that it was initialized with a null terminated string in the first place. The burden is the same either way. Either we’re already making the assumption in the callee, in which case we can continue to make it with the StringRef(), or we aren’t making the assumption in the callee, in which case we can continue not making the assumption but have the added benefit of having the length field passed to us.

Operations:
strcpy, strncpy, etc // Use member functions of SmallString / StringRef
sprintf, snprintf, etc // Use llvm::raw_ostream

We have StringStream, people should be using that. raw_ostream doesn’t support all printf style formats. So all sprintf and snprintf function calls should switch to using lldb_private::StreamString.

I actually disagree here. Printf() has plenty of problems, such as non-portable and confusing arguments, lack of type safety, possibility of buffer overruns. I would actually prefer to use raw_ostream instead of StringStream.

What if llvm::SmallString<> had Format methods that delegated to snprintf etc? Then everything would just work, and it would integrate nicely with the rest of my suggested points such as converting things to StringRef, SmallVectorImpl, etc.

>
> A patch is up which fixes a number of issues with strings not being null terminated in LLDB. The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].
>
> I realize that's a tall order, and I don't expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.
>
> LLVM has a very robust string manipulation library, and while it's not perfect, it would have prevented almost all of these issues from ever happening in the first place. Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:
>
> Types:
> char stack_buffer[n]; // Use llvm::SmallString<n>.

That would be fine.

> const char* foo; // Use llvm::StringRef foo.

This may be fine as long as there is absolutely _no_ string manipulation on "foo" (like "strcat(...) etc. Use std::string for anything that requires storage. And we don't want to start using any code that does "let me assign an llvm::StringRef() with a ConstString("hello").GetCString() so I don't have to worry about ownership. Adding string to the constant string pool should be reserved for things that actually need it (any const names like namespaces, variable names, type names, paths, etc) and also for when we return "const char *" through the public API.

This also means that we can't trust that "foo" is NULL terminated (llvm::StringRef objects have a data pointer and a length and aren't required to be terminated) and it means we must call "foo.str().c_str()" any time we require NULL termination. So I actually don't like llvm::StringRef as a replacement "const char *"...

We can't assume that const char*s are null terminated either.

Yes we can, that is was is assumed by const char * right now for every call we use it with.

They're just pointers to raw buffers, which you may or may not have null terminated. So we have the same issue.

No we don't, it isn't part of the the way you normally use c-strings. llvm::StringRef doesn't need to NULL terminate as it is part of its design.

By replacing all uses of const char* with StringRef(), we end up with exactly the same semantics. null terminated strings are null terminated, non null terminated strings aren't.

But that isn't what is currently meant by _any_ of the functions that take a "const char *" with no length parameter, so it means that any switch to using llvm::StringRef requires we use ".str().c_str()" which I would like to avoid.

> Passing arguments;
> void foo(char *, int len); // Use void foo(llvm::SmallVectorImpl<char> &buf);

I would prefer to use std::string here. Clearer and everyone knows how to use it. SmallVectors are nice, but they have the overhead of storing more data on the stack. Clang currently can create quite large stack frames when you have some code that has large switch statements where each switch statement has local variables. If those switch statements how SmallVector<char, PATH_MAX>, then you now have a case where if this code is called recursively we end up blowing the stack. We have seen this a lot in the DWARF parser, so I don't want to encourage using SmallVector for strings, since many places we use strings are for paths and other larger buffers. Any current std::string implementation can store up to 24 characters in the std::string without needing to do any allocations on the heap, so that essentially gets us a

SmallVector<char, 24>. So I vote std::string.

passing SmallVectorImpl<> is intended as a replacement for the case where you have this:

void foo(char *buf);
void bar() {
    char buffer[n];
    foo(buffer);
}

The equivalent LLVM code is this:

void foo(llvm::SmallVectorImpl<char> &buf);
void bar() {
    llvm::SmallString<n> buffer;
    foo(buffer);
}

Exactly the same amount of stack space. It's not intended to be a general replacement for std::string, which stores its data entirely on the heap.

std::string doesn't always store content on the heap. It stores the first ~20 bytes in the std::string itself (at least for any STL implementation that is worth anything). So I really don't see the need to start using a new string class when std::string is just about all we need and it will take up less space on the stack for reasons I mentioned above with switch statements (3 pointers at most).

I do think that many places in LLDB that currently use stack buffers could be re-written just as well using heap buffers, and for that I agree std::string is fine.

Agreed.

> void foo(const char * foo); // Use void foo(llvm::StringRef foo);

Again, I don't like this because you must call "foo.str().c_str()" in order to guarantee NULL termination.

As mentioned previously, in the const char * version you have to guarantee that it was initialized with a null terminated string in the first place. The burden is the same either way. Either we're already making the assumption in the callee, in which case we can continue to make it with the StringRef(), or we aren't making the assumption in the callee, in which case we can continue not making the assumption but have the added benefit of having the length field passed to us.

I disagree. Passing "const char*" assumes NULL termination unless a length field is also supplied. Passing llvm::StringRef as an object that is known to not NULL terminate, then we have can't assume termination. So I don't agree that passing llvm::StringRef makes anything better.

>
> Operations:
> strcpy, strncpy, etc // Use member functions of SmallString / StringRef
> sprintf, snprintf, etc // Use llvm::raw_ostream

We have StringStream, people should be using that. raw_ostream doesn't support all printf style formats. So all sprintf and snprintf function calls should switch to using lldb_private::StreamString.

I actually disagree here. Printf() has plenty of problems, such as non-portable and confusing arguments,

I agree on this one, but we are getting around this with using PRI macros.

lack of type safety,

most of this is taken care of when you have the printf format warnings on your class.

possibility of buffer overruns.

There are no buffer overruns in the Stream::Printf(). Check the implementation if you need to.

I would actually prefer to use raw_ostream instead of StringStream.

I don't. I have written tools using streaming style stuff before (like std::cout) and also written using printf style stuff and I find the printf style stuff more readable and maintainable. I also don't like streams that maintain state like the std::ostream does. Not sure if raw_ostream does any state management, but when they do it causes bugs later for code that doesn't always explicitly set the format, width, precision and much more:

std::ostream &os;
os << std::hex << hex_num;
dump(os);
os << hex_num2;

little did we know the stream was modified by "dump()" to set the format to decimal so now you must always use:

os << std::hex << hex_num2;

(repeat for number width, precision and many other factors).

I do also want to stress that I really do want to use llvm::StringRef where it makes sense, just not as a general replacement for "const char *". A lot of new code I have added recently, like the parsing code in FormatEntity.cpp, uses a lot of llvm::StringRef objects because it makes sense in this case. So when you are wanting to pull apart a string to get substrings, split an string, etc, from an existing longer C string, it makes a lot of sense to switch over to using llvm::StringRef. But again, not a wholesale all "const char *" args get replaced with llvm::StringRef.

Greg

A patch is up which fixes a number of issues with strings not being null terminated in LLDB. The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].

I realize that’s a tall order, and I don’t expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.

LLVM has a very robust string manipulation library, and while it’s not perfect, it would have prevented almost all of these issues from ever happening in the first place. Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:

Types:
char stack_buffer[n]; // Use llvm::SmallString.

That would be fine.

const char* foo; // Use llvm::StringRef foo.

This may be fine as long as there is absolutely no string manipulation on “foo” (like "strcat(…) etc. Use std::string for anything that requires storage. And we don’t want to start using any code that does "let me assign an llvm::StringRef() with a ConstString(“hello”).GetCString() so I don’t have to worry about ownership. Adding string to the constant string pool should be reserved for things that actually need it (any const names like namespaces, variable names, type names, paths, etc) and also for when we return “const char *” through the public API.

This also means that we can’t trust that “foo” is NULL terminated (llvm::StringRef objects have a data pointer and a length and aren’t required to be terminated) and it means we must call “foo.str().c_str()” any time we require NULL termination. So I actually don’t like llvm::StringRef as a replacement “const char *”…

We can’t assume that const char*s are null terminated either.

Yes we can, that is was is assumed by const char * right now for every call we use it with.

Right, but my point is that if we change a const char * to a StringRef(), and that const char* was assumed to be null terminated, then the StringRef is also assumed to be null terminated. You can just call StringRef::data() on it.

But that isn’t what is currently meant by any of the functions that take a “const char *” with no length parameter, so it means that any switch to using llvm::StringRef requires we use “.str().c_str()” which I would like to avoid.

I don’t see why this is necessary. Just because StringRef supports non-null terminated strings doesn’t mean we have to use non null terminated strings with it. declaring a SmallString<> and passing by SmallVectorImpl& supports null terminating strings. You push_back(0) and then pop_back() before returning. This is no different from what we’re already doing by setting foo[n] = ‘\0’ except that it’s safer since there’s no chance of a buffer overrun.

We are already taking care to make sure all of our const char*s are properly null terminated. Therefore when they are passed to functions that accept a StringRef, they will continue to be null terminated.

Operations:
strcpy, strncpy, etc // Use member functions of SmallString / StringRef
sprintf, snprintf, etc // Use llvm::raw_ostream

We have StringStream, people should be using that. raw_ostream doesn’t support all printf style formats. So all sprintf and snprintf function calls should switch to using lldb_private::StreamString.

I actually disagree here. Printf() has plenty of problems, such as non-portable and confusing arguments,

I agree on this one, but we are getting around this with using PRI macros.

lack of type safety,

most of this is taken care of when you have the printf format warnings on your class.

possibility of buffer overruns.

There are no buffer overruns in the Stream::Printf(). Check the implementation if you need to.

I would actually prefer to use raw_ostream instead of StringStream.

I don’t. I have written tools using streaming style stuff before (like std::cout) and also written using printf style stuff and I find the printf style stuff more readable and maintainable. I also don’t like streams that maintain state like the std::ostream does. Not sure if raw_ostream does any state management, but when they do it causes bugs later for code that doesn’t always explicitly set the format, width, precision and much more:

std::ostream &os;
os << std::hex << hex_num;
dump(os);
os << hex_num2;

little did we know the stream was modified by “dump()” to set the format to decimal so now you must always use:

os << std::hex << hex_num2;

(repeat for number width, precision and many other factors).

I agree, that’s not a pleasant experience. but as an aside, llvm::raw_ostream does not do this.

Also, if we move away from stack allocated buffers towards std::string, then the problem is also solved for us, because std::string::c_str() always returns a null terminated string, so creating a StringRef out of it is fine. StringRef is really just exactly what we already have, but with member functions on it to automatically perform common operations.

raw_ostream is designed to carry as little state as possible and is essentially a subset of lldb::Stream now (it also does printf-style formatting via llvm::format). At some point I pondered implementing StringStream on top of raw_ostream to improve interoperability but it required a lot of work for very little gain; the interfaces were subtly different. For example Stream prints numbers in hexadecimal when using operator<< with an unsigned integer, raw_ostream always uses decimal.

Unless someone wants to take the job of making raw_ostream and StringStream more compatible I'd recommend using Stream instead of llvm::raw_ostream in LLDB. That's what most of LLDB's code is using already and what its interfaces are built on. If all you do is pasting together strings and numbers both classes can be used in almost the same way, naming differences aside.

- Ben

Just to be clear here, the types StringRef and const char* carry the exact same amount of information regarding whether or not they’re null terminated: none. It is up to the person who creates the underlying string to ensure this. This is what i mean by “you can’t assume a const char* is null terminated either”. We enforce this at the creation sites of the backing strings (or at least try to), so we can continue to enforce this. In other words, we don’t need to do anything special. It should just work.

If someone wants to write a function that accepts a string which need not be null terminated, it should be clearly documented as such, and all other functions should be assumed to require null terminated StringRefs. But again, if we’re using SmallString, std::string, or any other form of “safe” string (i.e. No snprintf, strcat, etc) this should always be the case.

Please keep in mind that we need to be able to handle Unicode strings
and filenames, so please don't do anything that makes the handling of
Unicode strings worse. I.e. please don't add any (more) code which
assumes a string is NULL-terminated.

Thanks,
-Dawn

My impression is that the current state of LLDB with Unicode filenames and paths is pretty abysmal. I assumed that nothing would work at all in the current codebase. Are you saying that you are using LLDB to support Unicode today, or just that – as a general matter - supporting Unicode strings is something we should do someday, so we shouldn’t make our future work more difficult?

yes, pretty please? :slight_smile:

Well those single byte characters will be UTF8 encoded so I don’t think null in the string will happen legally.

We currently assume UTF8 and those strings are NULL terminated.