Correct use of StringRef and Twine

dblaikie · July 18, 2011, 7:38pm

I’m attempting to do some amount of mass migration from const std::string& (& const char*) to StringRef. In doing so I stumbled across the fact that while StringRef has no op+ to speak of (well, it has += and I added a manual StringRef+std::string and std::string+StringRef for now - though I’m thinking perhaps they can be skipped in favor of Twine operations), Twine does provide a variety of handy op+ implementations that get picked up mostly for free.

A few questions from this:

When should a StringRef parameter type be preferred over a const Twine& parameter type?
Would it be OK to implement an implicit conversion from Twine to std::string rather than using the explicit str() function? (when switching to StringRef many expressions became Twine concatenation when they were std::string concatenation - this isn’t a drop-in replacement due to the absence of std::string conversion from Twine (which should be a perf win, I’d think - delaying concatenation into a single operation instead of (((a + b) + c) + d)), so I’ve had to wrap various concatenation expressions in (…).str())
What would happen if Twine gave back a pointer/reference to some internal storage? In the common/correct use case (taking a Twine as a const ref, or using a Twine entirely in a temporary concatenation expression) wouldn’t the Twine’s internal storage live long enough to allow the caller to use that buffer within the life of the statement (as in, say, o << (aStringRef + “foo”))?

This last one is probably the bigger question & relates to the other two. It seems like Twine is nice for cases where deferred concatenation is required & might be able to be avoided altogether - if the majority case is that the concatenation ends up being thrown away, you win. But as you approach the 100% usage (not throwing away ever) are you losing perf in the simple non-concatenative case (all those callers who just pass in a straight string without any concatenation - the Twine then returns that string by value when str() is called, right?)?

At the moment, if the lifetime works out & Twine can return a reference to a member (this would mean unrealized Twines would have a bunch of empty/spare std::strings lying around, but I expect those could be optimized away fairly reliably), that Twine would be strictly superior to StringRef & could be preferred in all cases, potentially even replacing/simplifying StringRef substantially.

Thoughts?

David

Chris_Lattner · July 19, 2011, 5:07pm

I'm attempting to do some amount of mass migration from const std::string& (& const char*) to StringRef.

Great!

In doing so I stumbled across the fact that while StringRef has no op+ to speak of (well, it has += and I added a manual StringRef+std::string and std::string+StringRef for now - though I'm thinking perhaps they can be skipped in favor of Twine operations), Twine does provide a variety of handy op+ implementations that get picked up mostly for free.

Yeah, I don't think we want to support + on StringRef without going through Twine.

A few questions from this:

* When should a StringRef parameter type be preferred over a const Twine& parameter type?

Generally, here are the guidelines for return values:
1. Use std::string when the value is computed, or there are other lifetime issues.
2. Use StringRef otherwise.

And for arguments, generally always use Twine as the default, it allows construction of complex things, and is still efficient when passed the equiv of a StringRef (with the toStringRef method). The only annoying thing about it is that the API to do this requires a temporary SmallVector to scribble in, which makes it more difficult to use.

It seems that there should be a good way to fix this API problem.

* Would it be OK to implement an implicit conversion from Twine to std::string rather than using the explicit str() function? (when switching to StringRef many expressions became Twine concatenation when they were std::string concatenation - this isn't a drop-in replacement due to the absence of std::string conversion from Twine (which should be a perf win, I'd think - delaying concatenation into a single operation instead of (((a + b) + c) + d)), so I've had to wrap various concatenation expressions in (...).str())

I'd prefer not. I'd rather convert the things that want an std::string to take a Twine.

* What would happen if Twine gave back a pointer/reference to some internal storage? In the common/correct use case (taking a Twine as a const ref, or using a Twine entirely in a temporary concatenation expression) wouldn't the Twine's internal storage live long enough to allow the caller to use that buffer within the life of the statement (as in, say, o << (aStringRef + "foo"))?

This is really dangerous. I'd much rather extend raw_ostream to take twines.

This last one is probably the bigger question & relates to the other two. It seems like Twine is nice for cases where deferred concatenation is required & might be able to be avoided altogether - if the majority case is that the concatenation ends up being thrown away, you win. But as you approach the 100% usage (not throwing away ever) are you losing perf in the simple non-concatenative case (all those callers who just pass in a straight string without any concatenation - the Twine then returns that string by value when str() is called, right?)?

The major win is actually when you have clients that don't *want* an std::string. std::string is really slow in most operations (it does atomic operations and COW with common implementations, not to mention heap allocations, though libc++ is much much better about both of these). Twine is optimized to avoid creating the temporary std::string at all.

At the moment, if the lifetime works out & Twine can return a reference to a member (this would mean unrealized Twines would have a bunch of empty/spare std::strings lying around, but I expect those could be optimized away fairly reliably), that Twine would be strictly superior to StringRef & could be preferred in all cases, potentially even replacing/simplifying StringRef substantially.

The toStringRef(x) method is what you want. It is really fast and does no copy if the twine *just* contains a C string, std::string or StringRef, and in the concat cases it does no memory allocation in the common case where the SmallVector is big enough for the result.

-Chris

dblaikie · July 19, 2011, 5:24pm

And for arguments, generally always use Twine as the default, it allows construction of complex things, and is still efficient when passed the equiv of a StringRef (with the toStringRef method). The only annoying thing about it is that the API to do this requires a temporary SmallVector to scribble in, which makes it more difficult to use.

Yes, I noticed this - which was one of my concerns about migrating
lots of stuff to Twine: that efficient Twine-sinks would be tricky
and/or verbose, so I was wondering how to make Twine-sinks easier to
write (hence the ref return I mentioned below)

> * Would it be OK to implement an implicit conversion from Twine to std::string rather than using the explicit str() function? (when switching to StringRef many expressions became Twine concatenation when they were std::string concatenation - this isn't a drop-in replacement due to the absence of std::string conversion from Twine (which should be a perf win, I'd think - delaying concatenation into a single operation instead of (((a + b) + c) + d)), so I've had to wrap various concatenation expressions in (...).str())

I'd prefer not. I'd rather convert the things that want an std::string to take a Twine.

Then it's a question of what sort of final destinations/sinks there
are & how many, given the slight complexity of writing an efficient
Twine sink.

> * What would happen if Twine gave back a pointer/reference to some internal storage? In the common/correct use case (taking a Twine as a const ref, or using a Twine entirely in a temporary concatenation expression) wouldn't the Twine's internal storage live long enough to allow the caller to use that buffer within the life of the statement (as in, say, o << (aStringRef + "foo"))?

This is really dangerous.

Is it much/any more dangerous than Twine's existing functionality,
though? Twine's already referring to temporaries & will break in fun
ways if used outside that cliche.

I'd much rather extend raw_ostream to take twines.

Certainly that covers this case, but I'm not sure how many different
sinks there are that would need the nuanced efficient Twine handling
code. Perhaps it's not many - though some easy way to do "stdStr =
twine" would be great since quite often we want to take a string & put
it in a member, etc. Would it be worth having "assign(std::string&,
const Twine&)" or similar - or is going through Twine::str() &
expecting the compiler to optimize away the temporary std::string
acceptable? (I assume not, or Twine wouldn't have all those fancy
functions for getting a StringRef & providing a buffer, etc)

The major win is actually when you have clients that don't *want* an std::string.

Ah, right, that too.

std::string is really slow in most operations (it does atomic operations and COW with common implementations,

Hmm, which implementations still use COW, I wonder? I think there's
something in C++0x that makes it impossible to implement std::string
as COW.

The toStringRef(x) method is what you want. It is really fast and does no copy if the twine *just* contains a C string, std::string or StringRef, and in the concat cases it does no memory allocation in the common case where the SmallVector is big enough for the result.

curiosity question: how much more efficient (vague question, I know)
is the StringRef + SmallVector than a good (eg: libc++) std::string
implementation? I know, for example, that Visual C++ 2010's
std::string does perform the small string optimization which I guess
is what SmallVector is doing.

- David

FlyLanguage · July 19, 2011, 5:37pm

curiosity question: how much more efficient (vague question, I know)
is the StringRef + SmallVector than a good (eg: libc++) std::string
implementation? I know, for example, that Visual C++ 2010's
std::string does perform the small string optimization which I guess
is what SmallVector is doing.

- David

Problem is, there are REALLY bad std::string implementations out there in widely used c++ runtimes.

StringRef/Twine is efficient everywhere.

dblaikie · July 19, 2011, 5:44pm

Problem is, there are REALLY bad std::string implementations out there in
widely used c++ runtimes.

No doubt - though widely used to build llvm/clang? Perhaps, I suppose,
I'm not sure just how portable llvm code is.

StringRef/Twine is efficient everywhere.

Yep - I'm just pedantic about having a tidy codebase, and duplicate
code to workaround bad implementations when clang/llvm itself is a
good one & can selfhost seems a bit of a pity.

(which raises another side question on a comment I saw Chris make a
few weeks ago: that llvm/clang (or perhaps the comment was limited to
llvm only?) wouldn't be migrating to C++0x in the near/foreseeable
future. Given the ability to selfhost, I was wondering why that would
be? (My best guess is that clang/llvm can't bootstrap itself yet, so
it still needs to remain compilable with something that can be
bootstrapped))

FlyLanguage · July 19, 2011, 6:05pm

Perhaps, I suppose, I'm not sure just how portable llvm code is.

Very portable AFAICT. I'm even going to port it to my hobbyist OS once C++ rtti is totally gone in the llvm code base.

dblaikie · July 19, 2011, 6:14pm

Perhaps, I suppose, I'm not sure just how portable llvm code is.

Very portable AFAICT. I'm even going to port it to my hobbyist OS once C++
rtti is totally gone in the llvm code base.

Sorry, I meant source portable to different compilers (ie: which
compilers does llvm compile on & how many have really poor std::string
implementations).

- David

FlyLanguage · July 19, 2011, 6:39pm

Yes, that's what I thought you meant. Not *many* have poor std::string implementations, but they exist and are in use.

FlyLanguage · July 19, 2011, 6:50pm

I should elaborate... the problem with STL, and std::string in particular, is that you have to consider each implementation's performance characteristics. For stuff like LLVM, where performance is really critical, you have to code against a particular implementation.

Some string implementations does simple COW/refcounting, others do SSO, some do LSP, some do other optimizations. And there are all sorts of combinations, and future implementations are likely to change on certain platforms.

It's nasty.

Chris_Lattner · July 20, 2011, 2:01am

GCC's libstdc++ std::string implementation (as of GCC 4.2, but I doubt they've fixed it recently due to ABI concerns) is COW (and thus does atomic accesses) with no short-string optimization. It's brutally slow.

-Chris

Chris_Lattner · July 20, 2011, 2:02am

Clang certainly self hosts on unix based platforms. I have no idea if it self hosts on windows though, and relying on a cross build to windows seems like a bad idea to me

That said, I'd really like to see (e.g.) the clang static analyzer get split out into a clang plugin, and then *it* could use c++'0x features.

-Chris

dblaikie · July 21, 2011, 7:00am

And for arguments, generally always use Twine as the default, it allows construction of complex things, and is still efficient when passed the equiv of a StringRef (with the toStringRef method). The only annoying thing about it is that the API to do this requires a temporary SmallVector to scribble in, which makes it more difficult to use.

It seems that there should be a good way to fix this API problem.

This is the problem I'm still trying to figure out. It seems that
while Twine is efficient to pass, it's not easy to use efficiently & I
don't think it'd be appropriate (correct me if I'm wrong) to go adding
in temporary SmallVectors all over the place to try to consume Twine
parameters in any code that needs to handle string contents after
migrating the argument types to Twines.

One place I experimented with fixing tonight (after trying broader
goals - like changing all StringRef args in clang only, say) was to
add a Twine(char) ctor to enable llvm::Triple's ctor to take Twines
rather than StringRefs, and then do Twine op+ to build the Data
member.

The problem I see with this is that the current implementation of
Triple's ctor is still more efficient than the simple Twine version:

: Data(x)
{
Data += '-';
Data += x;
Data += '-';
Data += 'z';
}

(essentially), as opposed to:

: Data((x + '-' + y + '-' + z).str())

Which requires an extra string copy of the final value in all cases...
actually, now that I think about it, since it's returning into a ctor,
this might be nicely optimized - so perhaps this is the "right way" to
write this code. [diff attached]

So then here's another example (can't find the exact piece of code I
had been working on, but taking
tools/clang/lib/Basic/Targets.cpp::getCPUDefineSuffix as an example
anyway - just something using a StringSwitch, really). If this
function were to take a const Twine& and pass it along to
StringSwitch, ultimately StringSwitch (being the sink for the Twine -
the code needing to read the individual character elements) would be
responsible for allocating a temporary SmallVectorImpl, passing that
in, getting a StringRef & then using that for its work.

Another example of a string sink is, say, tools/llvmc/src/Hooks.cpp. I
found this while looking for uses of "+= '-'" to use the char support
for Twine I'd just added. But I can't upgrade the Arg argument from
const std::string& to const Twine& easily since it needs to actually
manipulate the string - find, access elements, and substr it.

Should this work just be done case by case? If so, I don't think I'll
end up with much Twine usage, probably just a lot of StringRef usage &
lots of str() calls.
If Twine were able to be a more drop-in replacement (by providing a
StringRef in a more convenient manner - or, ideally, subsuming
StringRef's functionality entirely I think (& simply allocating a
backing buffer when necessary - the suggesting you mentioned was
dangerous, though I'm still tossing it around as something to try))
it'd be more practical to use it as the go-to argument for anything
that needs a string.

Looking forward to hearing anyone's thoughts on this.

- David

twine_triple.diff (3.13 KB)

dblaikie · July 21, 2011, 7:30am

[diff attached]

Updated diff with test fix. (since this broke a test (printing chars
as numerical values, rather than characters) it's possible this change
is a bad idea & it could break the product code itself. Though
strangely I wasn't able to do character concatenation without my
change, so I have a sneaking suspicion that while the test passed, it
didn't actually expose this case to the common Twine use cases.
Perhaps only explicitly invoking the Twine ctor would've got the
char-as-number behavior previously)

twine_triple.diff (3.64 KB)

Chris_Lattner · July 22, 2011, 6:02pm

The dangerous part of this is that characters are integers, so "foo" + 'x' is very likely to cause serious problems. This is the reason that the integer versions of the twine ctor are marked 'explicit'. I'm ok with the Twine class changes in this patch if the ctor is marked 'explicit'. You should also probably add a ctor for signed/unsigned char as well (which reuse the existing CharKind enum).

I'll respond to Triple specific issues in response to your previous email. Thanks for pushing this forward David!

-Chris

Chris_Lattner · July 22, 2011, 8:42pm

And for arguments, generally always use Twine as the default, it allows construction of complex things, and is still efficient when passed the equiv of a StringRef (with the toStringRef method). The only annoying thing about it is that the API to do this requires a temporary SmallVector to scribble in, which makes it more difficult to use.

It seems that there should be a good way to fix this API problem.

This is the problem I'm still trying to figure out. It seems that
while Twine is efficient to pass, it's not easy to use efficiently & I
don't think it'd be appropriate (correct me if I'm wrong) to go adding
in temporary SmallVectors all over the place to try to consume Twine
parameters in any code that needs to handle string contents after
migrating the argument types to Twines.

Are you concerned about the stack size of the smallvector's, or the amount of code that needs to be written?

One place I experimented with fixing tonight (after trying broader
goals - like changing all StringRef args in clang only, say) was to
add a Twine(char) ctor to enable llvm::Triple's ctor to take Twines
rather than StringRefs, and then do Twine op+ to build the Data
member.

Nice.

The problem I see with this is that the current implementation of
Triple's ctor is still more efficient than the simple Twine version:

: Data(x)
{
Data += '-';
Data += x;
Data += '-';
Data += 'z';
}

(essentially), as opposed to:

: Data((x + '-' + y + '-' + z).str())

Which requires an extra string copy of the final value in all cases...
actually, now that I think about it, since it's returning into a ctor,
this might be nicely optimized - so perhaps this is the "right way" to
write this code. [diff attached]

The former is relying on how append works on std::string. Beyond the number of bytes copied, the former could cause (depending on the implementation of std::string) a reallocation for every append.

Well written, the Twine version could just do a resize/reserve once, then fill in the bytes. I don't know if Twine does that yet though.

So then here's another example (can't find the exact piece of code I
had been working on, but taking
tools/clang/lib/Basic/Targets.cpp::getCPUDefineSuffix as an example
anyway - just something using a StringSwitch, really). If this
function were to take a const Twine& and pass it along to
StringSwitch, ultimately StringSwitch (being the sink for the Twine -
the code needing to read the individual character elements) would be
responsible for allocating a temporary SmallVectorImpl, passing that
in, getting a StringRef & then using that for its work.

Right. Something like this could work:

foo(const Twine &T) {
...
TwineTmpBuffer Tmp;
StringSwitch(T.toString(Tmp)).....

Which doesn't seem too horrible, just needs a typedef of smallvector to TwineTmpBuffer.

Another example of a string sink is, say, tools/llvmc/src/Hooks.cpp. I
found this while looking for uses of "+= '-'" to use the char support
for Twine I'd just added. But I can't upgrade the Arg argument from
const std::string& to const Twine& easily since it needs to actually
manipulate the string - find, access elements, and substr it.

We should make ".str()" an efficient way to get a mutable std::string out.

Should this work just be done case by case? If so, I don't think I'll
end up with much Twine usage, probably just a lot of StringRef usage &
lots of str() calls.

Yes, I think it should be done carefully, one API at a time. I think that would make sense. Sticking with StringRef where it is possible is simple and good.

-Chris

Chris_Lattner · July 22, 2011, 8:43pm

All GCC's still do this AFAIK, which covers every popular free unix system that I'm aware of and the Mac.

-Chris

dblaikie · July 22, 2011, 9:59pm

The dangerous part of this is that characters are integers, so "foo" + 'x' is very likely to cause serious problems.

std::string already provides such overloads though, doesn't it? So the
code isn't any safer from accidental "foo" + 'x' expressions that
don't include Twine/StringRef/std::string than it was before. But if
the argument is that std::string's interface was poorly
designed/unsafe & we can do better/safer, I'm OK with making the ctor
explicit as you've suggested.

You should also probably add a ctor for signed/unsigned char as well (which reuse the existing CharKind enum).

Hmm - would it be safe to cast those signed/unsigned chars to straight
char? (is it guaranteed that the signed & unsigned values with the
same representation map to the same glyph?)

As a side note on Twine's design: Is there a particular reason it uses
void*s rather than unions? and chars rather than enums?

(sorry if I'm asking lots of "why is this like this" questions all
over the code base - I just don't want to assume that it's intentional
and replicate a pattern elsewhere that I don't understand only to find
it's unintentional "not fixed yet" sort of stuff. I suppose at the
very least it'll be a chance to add in some explanatory comments if I
do find things that are by design but weren't clear to me)

Thanks,
- David

dblaikie · July 22, 2011, 10:11pm

Are you concerned about the stack size of the smallvector's, or the amount of code that needs to be written?

Mostly the code that needs to be written (my suggestion of having
Twines be able to create mutable buffers internally would increase
stack size by even more than just one SmallVector per Twine sink, so I
haven't really been thinking stack space efficiency so much as
authoring efficiency). It would be nice if authors didn't have to
guess at how their callers might construct strings & could always
provide Twine argument types without making a tradeoff between
implementation simplicity/rapidity & runtime efficiency when consuming
such strings.

The former is relying on how append works on std::string. Beyond the number of bytes copied, the former could cause (depending on the implementation of std::string) a reallocation for every

append.

Ah, indeed - I can see how an implementation of std::string might not
want to do the same kind of growth mechanisms as is more common in,
say, std::vector, whereas Twine knows it's got to put a bunch of
things into the buffer so it can be designed with that in mind.

Well written, the Twine version could just do a resize/reserve once, then fill in the bytes. I don't know if Twine does that yet though.

Yep - something I'll keep in mind to take a look at.

Right. Something like this could work:

foo(const Twine &T) {
...
TwineTmpBuffer Tmp;
StringSwitch(T.toString(Tmp)).....

Which doesn't seem too horrible, just needs a typedef of smallvector to TwineTmpBuffer.

In a few choice places, maybe, but as the default way to pass string
parameters I think that'd be a hard sell as a general practice.

Another example of a string sink is, say, tools/llvmc/src/Hooks.cpp. I
found this while looking for uses of "+= '-'" to use the char support
for Twine I'd just added. But I can't upgrade the Arg argument from
const std::string& to const Twine& easily since it needs to actually
manipulate the string - find, access elements, and substr it.

We should make ".str()" an efficient way to get a mutable std::string out.

In the case of Hooks.cpp I could do a similar transformation as the
one you suggested above, using the SmallVectorImpl - but my point is
that doing that to every string sink seems like it'd be a little
annoying. If you think it's suitable/the right thing I can do that &
if we find a terse/tidier/more streamlined way to handle string sinks
later we can always update this usage.

Should this work just be done case by case? If so, I don't think I'll
end up with much Twine usage, probably just a lot of StringRef usage &
lots of str() calls.

Yes, I think it should be done carefully, one API at a time. I think that would make sense. Sticking with StringRef where it is possible is simple and good.

Hmm, this is one bit I'm not sure about. As I tried to explain above,
it seems problematic to have to choose your argument type on the basis
of how you think callers might use your API. From the perspective of a
caller, a Twine argument is at least as expressive as a StringRef
(since all StringRefs can be Twined implicitly), but it takes that
extra step to write the implementation.

Perhaps I'm aiming for some kind of purist/perfectionist argument that
isn't necessary or practical, but I hope I've been clear in explaining
my uncertainty/issue here.

Thanks,
- David

Chris_Lattner · July 23, 2011, 12:29am

The dangerous part of this is that characters are integers, so "foo" + 'x' is very likely to cause serious problems.

std::string already provides such overloads though, doesn't it? So the
code isn't any safer from accidental "foo" + 'x' expressions that
don't include Twine/StringRef/std::string than it was before. But if
the argument is that std::string's interface was poorly
designed/unsafe & we can do better/safer, I'm OK with making the ctor
explicit as you've suggested.

Yes, exactly. I'm just saying that I think the additional clarity of:
"foo" + Twine('x')

is worth the inconvenience.

You should also probably add a ctor for signed/unsigned char as well (which reuse the existing CharKind enum).

Hmm - would it be safe to cast those signed/unsigned chars to straight
char? (is it guaranteed that the signed & unsigned values with the
same representation map to the same glyph?)

Yes. I consider 'signed vs unsigned char vs char' to be a blight on the C type system. Just casting to char internally would be fine.

As a side note on Twine's design: Is there a particular reason it uses
void*s rather than unions?

I'm not sure what you're proposing specifically.

and chars rather than enums?

char vs enum is because of visual studio compatibility and because enums often are stored as 32-bit values instead of 8-bit values.

(sorry if I'm asking lots of "why is this like this" questions all
over the code base - I just don't want to assume that it's intentional
and replicate a pattern elsewhere that I don't understand only to find
it's unintentional "not fixed yet" sort of stuff. I suppose at the
very least it'll be a chance to add in some explanatory comments if I
do find things that are by design but weren't clear to me)

No problem at all, happy to help answer the questions. Forcing a reexamination of past decisions is not a bad thing at all in this case

Right. Something like this could work:

foo(const Twine &T) {
...
TwineTmpBuffer Tmp;
StringSwitch(T.toString(Tmp)).....

Which doesn't seem too horrible, just needs a typedef of smallvector to TwineTmpBuffer.

In a few choice places, maybe, but as the default way to pass string
parameters I think that'd be a hard sell as a general practice.

Yes, from a general API design perspective, I hate having to force a choice between "convenience to implementor of an API to just take a StringRef" vs "convenience to client of API for it to take a Twine". It really stinks.

I was chatting with Howard Hinnant about this and he suggested replacing Twine with a template metaprogramming expression-template system. I haven't thought through all the details, but perhaps this would allow us to get the best of both worlds?

Hmm, this is one bit I'm not sure about. As I tried to explain above,
it seems problematic to have to choose your argument type on the basis
of how you think callers might use your API. From the perspective of a
caller, a Twine argument is at least as expressive as a StringRef
(since all StringRefs can be Twined implicitly), but it takes that
extra step to write the implementation.

Perhaps I'm aiming for some kind of purist/perfectionist argument that
isn't necessary or practical, but I hope I've been clear in explaining
my uncertainty/issue here.

Yes, I'm deeply unhappy about this aspect of our string api's.

-Chris

dblaikie · July 24, 2011, 7:09am

Yes, exactly. I'm just saying that I think the additional clarity of:
"foo" + Twine('x')

is worth the inconvenience.

Ok, attached a modified version of my patch with an Twine(char),
Twine(unsigned char), and Twine(signed char). All three are explicit &
have included test cases.

Speaking of which - is there any LLVM dev policy or preference around
C-style casts versus C++-style casts? Twine is littered with the
former but I generally use the latter for safety/clarity of intend,
etc. (I'm not going to try to fix everything in Twine just now if
we're considering alternative implementations like template
metaprogramming anyway)

As a side note on Twine's design: Is there a particular reason it uses
void*s rather than unions?

I'm not sure what you're proposing specifically.

I've attached an version of my fix that includes the union use I was
referring to (sorry for being vague in the previous email). Though if
we're considering a more explicitly structured solution with
templates/subtypes/etc we could skip the whole mess of enums &
unions/casts.

and chars rather than enums?

char vs enum is because of visual studio compatibility and because enums often are stored as 32-bit values instead of 8-bit values.

Curious - seems VS >2k3 (
http://msdn.microsoft.com/en-us/library/2dzy4k6e(v=VS.80).aspx ) has
supported specifying enum backing type (so you can specify it as char:
"enum foo : char { ... }") if that was the feature you were wanting to
use but avoiding.

Yes, from a general API design perspective, I hate having to force a choice between "convenience to implementor of an API to just take a StringRef" vs "convenience to client of API for it to take a Twine". It really stinks.

Ok - glad we're on the same page there, wasn't quite sure whether we
were talking past one another in previous emails.

I have a couple of ideas about how to lower the developer cost of
using Twine ubiquitously. I'll try to describe them in some detail.

I was chatting with Howard Hinnant about this and he suggested replacing Twine with a template metaprogramming expression-template system. I haven't thought through all the details, but perhaps this would allow us to get the best of both worlds?

I think I'd prefer a template metaprogramming solution to Twine just
from a purist perspective compared to all the
casting/reinterpreting/enums (even with the union) - that sort of
stuff makes me a bit uncomfortable, but I realize it's well contained
enough that it's not too risky (& I know some of the other tricks LLVM
uses like low bit twiddling in pointers, etc, so this seems relatively
sane in the grand scheme of things). But I'm not sure it'll
fundamentally change how we might be able to solve this caller
convenience issue.

So here are my ideas:

1) the easy solution: create a StringRef subclass (or new type with a
similarly expressive API) that wraps a buffer that it may or may not
use. Give it a ctor that takes a Twine. The use case then goes from
this:

SmallVectorImpl<char> v;
StringRef s = theTwine.GetStringRef(v);

to this:

TwineString s(theTwine);

(& the ctor looks like: TwineString(const Twine& t) { *this =
t.GetStringRef(this->buffer); })
So any Twine sink now has to pay one line of code to get a usable
string out of a Twine. Cheap to implement, easy to fix existing use
cases (privatize GetStringRef(SmallVectorImpl<char>) and make
TwineString a friend - fix everything that fails to compile).

This keeps things simple & seems to be "good enough" to me, but we
could perhaps do better (at the very least, again, if we did do
better, we could go back & remove TwineString & again fix all the
places that fail to compile with whatever new hotness we invent)

Hmm - new thought: what would happen if my argument type was simply
TwineString? If I knew my function needed access to the values (like
the Host.cpp example) but didn't have any buffer of its own that it
needed to use, that seems OK. This new thought perhaps renders
suggestion (2) uninteresting. Functions that know they always need the
string contents take a TwineString. Functions that might be able to
skip realizing the string entirely - or have their own buffer (eg: a
member std::string) to populate - take a Twine & explicitly construct
a TwineString only in the path that needs the value. That seems
optimal to me.

1.1) We probably want a better way to get a Twine into a std::string -
while the change to Triple's ctor works because returning a
std::string from a function into a ctor can do fancy in-place
construction - in any other case like "SetFoo(const Twine& t) {
this->s = t; }" we'd want something better. Perhaps simply
t.AssignTo(s);

2) This one isn't quite as fleshed out, but see if the general outline
makes sense: Introduce an extra type that's Twine-esque, but is only
the top level (so it's implicitly constructible from a const Twine&),
rather than all the branches of a Twine. If a function wants to accept
a Twine that it needs to manipulate, it can mark its argument as
TwineString2 (for want of a better name, just using 2 to distinguish
from the (1) example). TwineString2 has a buffer in it, but it starts
uninitialized. Any attempt to retrieve/inspect values causes the
string to be initialized. [downside: all string operations on a
TwineString2 incur an "if not initialized" check]

Twines could be constructed from TwineString2s if the value needed to
be passed on to another function. (& still take advantage of any
initialized string buffer it might have, rather than having to
remanifest from its child nodes repeatedly)

One fancy enhancement to this is that TwineString2 could partially
manifest - eg: if you want ts[2] it could walk its children (dropping
its references to children & updating the buffer with the contents at
each step) until it found the 2nd character - then stop, with a
partially expanded tree/buffer. Though there's probably nothing
stopping us implementing such smarts in the (1) solution eventually
either.

Thanks,
- David

twine_explicit_char.patch (4.26 KB)

twine_explicit_char_union.patch (14.9 KB)

Topic		Replies	Views
Twine vs StringRef as function argument LLVM Project	13	635	October 21, 2025
RFC: Constructing StringRefs at compile time LLVM Dev List Archives	42	697	January 9, 2017
raw c strings in lldb LLDB	17	340	February 25, 2015
How does Twine work? LLVM Dev List Archives	2	252	April 29, 2019
Is it time to allow StringRef to be constructed from nullptr? LLVM Dev List Archives	15	412	September 28, 2016

Correct use of StringRef and Twine

Related topics