RFC: Constructing StringRefs at compile time

Hi all,

There is a desire to be able to create constexpr StringRefs to avoid
static initializers for global tables of/containing StringRefs.

Creating constexpr StringRefs isn't trivial as strlen isn't portably
constexpr and std::char_traits<char>::length is only constexpr in
C++17.

Alp Toker tried to create constexpr StringRefs for strings literals by
subclassing StringRef:
https://reviews.llvm.org/rL200187
This is a verbose change where needed at string literal call sites.

Mehdi AMINI tried to add a constexpr constructor for string literals
by making the constructor from const char * explicit:
https://reviews.llvm.org/D25639
This is a verbose change at every non-literal call site.
This only works with assignment syntax.

I've suggested using a user-defined literal:
https://reviews.llvm.org/D26332
This is a small change where needed at string literal call sites.
C++17 adds a UDL for std::string_view, so it's not an unusual idea.
There is resistance to using a UDL as they can introduce a surprising
and novel syntax for calling functions.

Comments?

Other options?

Thanks,

From: "Malcolm Parsons via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Thursday, November 24, 2016 8:59:25 AM
Subject: [llvm-dev] RFC: Constructing StringRefs at compile time

Hi all,

There is a desire to be able to create constexpr StringRefs to avoid
static initializers for global tables of/containing StringRefs.

Creating constexpr StringRefs isn't trivial as strlen isn't portably
constexpr and std::char_traits<char>::length is only constexpr in
C++17.

Why don't we just create our own traits class that has a constexpr length, and then we can switch over to the standard one when we switch to C++17?

-Hal

GCC and Clang treat __builtin_strlen as constexpr.
MSVC 2015 doesn't support C++14 extended constexpr. I don't know how
well it optimises a recursive strlen.

This works as an optimisation for GCC and Clang, and doesn't make
things worse for MSVC:

     /// Construct a string ref from a cstring.
     LLVM_ATTRIBUTE_ALWAYS_INLINE
+#if __has_builtin(__builtin_strlen)
+ /*implicit*/ constexpr StringRef(const char *Str)
+ : Data(Str), Length(Str ? __builtin_strlen(Str) : 0) {}
+#else
     /*implicit*/ StringRef(const char *Str)
         : Data(Str), Length(Str ? ::strlen(Str) : 0) {}
+#endif

What about going for

template<unsigned N>
constexpr StringRef(const char (&Str)[N])

and avoiding strlen entirely for string literals?

What about going for

template
constexpr StringRef(const char (&Str)[N])

and avoiding strlen entirely for string literals?

You’d at least want an assert in there (that N - 1 == strlen(Str)) in case a StringRef is ever constructed from a non-const char buffer that’s only partially filled.

But if we can write this in such a way that it performs well on good implementations - that seems sufficient. If getting good performance out of the compiler means bootstrapping - that’s pretty much the status quo already, as I understand it.

So I wouldn’t personally worry too much about performance degredation when built with MSVC - if, when building a stage 2 on Windows (building Clang with MSVC build Clang) you do end up with a compiler with the desired performance characteristics - then that’s probably sufficient.

The only reason I didn’t go with this solution was that an MSVC built clang would take a long time to startup if StringRef are present in global tables.

We are allowing currently an explicit null char in the middle of a literal.

That’s why I had the constructor you mentioned but not using N for the length here: https://reviews.llvm.org/D25639#497ba4c0

However this does not help with a construct like:

const char Arr[32];
fill(Arr);
StringRef S(Arr);

We could probably write and use our own constexpr strlen for MSVC, but we’d lose out on the CRT’s optimized implementation of strlen.

It's not as good at tail recursion optimisation as Clang, but it does
handle this:

constexpr size_t llvm_strlen(const char* s, size_t l = 0) {
  return *s ? llvm_strlen(s + 1, l + 1) : l;
}

Is stack usage in unoptimised builds an issue?

So I wouldn’t personally worry too much about performance degredation when built with MSVC - if, when building a stage 2 on Windows (building Clang with MSVC build Clang) you do end up with a compiler with the desired performance characteristics - then that’s probably sufficient.

Hold on there—we deliver an MSVC-built Clang to our licensees, and I would really rather not pessimize it.

–paulr

OK - good to know. (not sure we’re talking about pessimizing it - just not adding a new/possible optimization, to be clear)

Just out of curiosity - are there particular reasons you prefer or need to ship an MSVC built version, rather than a bootstrapped Clang?

Jumping in on Paul’s post, but we work on the same product so I can give at least one answer here, which is debugging, including post-mortem debugging of minidumps. We keep the PDBs from our build server so we can ship an executable without any embedded debug info but can still get a decent(ish) debugging experience with symbols and watch window values from minidumps.

Nothing would please me more than to switch to shipping a selfhost (subject to quite a thorough comparison/evaluation of all the factors) so I’m watching the PDB/codeview work with interest. :slight_smile:

-Greg

OK - good to know. (not sure we’re talking about pessimizing it - just not adding a new/possible optimization, to be clear)

Okay, glad to hear it. I admit I wasn’t following the thread all that closely.

Just out of curiosity - are there particular reasons you prefer or need to ship an MSVC built version, rather than a bootstrapped Clang?

We experiment with a bootstrapped Clang from time to time. The benefit has never been clearly worth the additional cost of internally supporting a Windows-target Clang. (Which is non-trivial; yes it’s still Clang, but it’s a different target OS, different object-file format, different debug-info format, etc.)

–paulr

This does not seem that clear to me. The motivation seems to be able to create global table of StringRef, which we don’t do because the lack fo constexpr of static initializers right now.
Moving forward it would mean making clang a lot slower when built with MSVC if we were going this route.

OK - good to know. (not sure we’re talking about pessimizing it - just not adding a new/possible optimization, to be clear)

Okay, glad to hear it. I admit I wasn’t following the thread all that closely.

Just out of curiosity - are there particular reasons you prefer or need to ship an MSVC built version, rather than a bootstrapped Clang?

We experiment with a bootstrapped Clang from time to time. The benefit has never been clearly worth the additional cost of internally supporting a Windows-target Clang. (Which is non-trivial; yes it’s still Clang, but it’s a different target OS, different object-file format, different debug-info format, etc.)

So if opportunities came up that provided greater benefit to a self-host, then you’d have a motivation to switch… - not sure that should motivate the rest of the project to work hard to make MSVC performance important.

(but this is just me - I doubt my opinion changes the way others will behave all that much)

  • Dave

OK - good to know. (not sure we’re talking about pessimizing it - just not adding a new/possible optimization, to be clear)

This does not seem that clear to me. The motivation seems to be able to create global table of StringRef, which we don’t do because the lack fo constexpr of static initializers right now.
Moving forward it would mean making clang a lot slower when built with MSVC if we were going this route.

Ah, fair - perhaps I misunderstood/misrepresented, apologies. Figured this was just an attempt to reduce global initializers in arrays we already have. Any pointers on where the motivation is described/discussed?

  • Dave

This thread started with: "There is a desire to be able to create constexpr StringRefs to avoid static initializers for global tables of/containing StringRefs.”

I don’t have more information, but maybe Malcolm can elaborate?

The fact that the templatized constructor falls down because of the possibility of initializing StringRef with a stack-allocated char array kills that idea in my mind.

I feel like the only two reasonable solutions are

  1. allow UDL for this case, document that this is an exception and that UDLs are still not permitted anywhere else, and require (by policy, since I don’t know of a way to have the compiler force it) that this UDL be used only in global constructors. One idea to help “enforce” this policy would be to give the UDL a ridiculously convoluted name, like string_ref_literal, so that one would have to write “foo”_string_ref_literal, and then provide a macro like #define LITERAL(x) x_string_ref_literal, so that the user writes StringRef s[] = { LITERAL("a"), LITERAL("b") }; I'm not sure if that's better or worse than StringRef s = { “a”_sr, “b”_sr };`, but at least it’s greppable this way.

  2. Don’t allow global tables of StringRefs.

What about this?

diff --git a/include/llvm/ADT/StringRef.h b/include/llvm/ADT/StringRef.h
index d8e0732…5b8503a 100644
— a/include/llvm/ADT/StringRef.h
+++ b/include/llvm/ADT/StringRef.h
@@ -84,10 +84,10 @@ namespace llvm {

/// Construct a string ref from a pointer and length.
LLVM_ATTRIBUTE_ALWAYS_INLINE

  • /implicit/ StringRef(const char *data, size_t length)
  • /implicit/ constexpr StringRef(const char *data, size_t length)
    : Data(data), Length(length) {
  • assert((data || length == 0) &&
  • “StringRef cannot be built from a NULL argument with non-null length”);
  • //assert((data || length == 0) &&
  • //“StringRef cannot be built from a NULL argument with non-null length”);
    }

/// Construct a string ref from an std::string.
@@ -839,6 +839,24 @@ namespace llvm {
/// @}
};

  • class StringLiteral {
  • public:
  • template<size_t N>
  • constexpr StringLiteral(const char(&Str)[N])
  • : Str(Str), Length(N) {

I prefer constexpr llvm_strlen() over StringLiteral because it doesn't
require code changes outside StringRef - all StringRefs constructed
from a literal can benefit. But there are concerns about MSVC.
I prefer StringLiteral over UDL because the type requires code
changes, but the values don't.
I prefer StringLiteral over explicit StringRef constructor because it's safer.