I’m planning to gradually prepare llvm::StringRef for an eventual migration to std::string_view, a C++17 feature.
Motivation
Make it easier to integrate the LLVM libraries with other projects.
Make the LLVM source code more friendly to newcomers, who might know about std::string_view but not StringRef.
Maintain less code within the LLVM codebase.
Plan
Make StringRef publicly derive from std::string_view. (#113752): The public inheritance from std::string_view allows us to:
drop features that are identical between StringRef and std::string_view (e.g. empty, size, data, begin, and end),
retain custom features, including edit_distance and trim, and
retain the ability to implicitly convert StringRef to std::string_ref .
Gradually remove methods from StringRef.StringRef contains several functions (e.g. edit_distance) that are unlikely to become part of std::string_view. We can make them non-member functions one at a time (possibly while retaining member functions in a deprecated form).
Wait until C++23 is available for our codebase: StringRef::{starts_with,ends_with,contains} are quite popular in our codebase, but std::string_view::{starts_with,ends_with} aren’t available until C++20, and std::string_view::contains isn’t available until C++23. For this reason, I think we should migrate StringRef to std::string_view after C++23 becomes available in our codebase.
Declare using StringRef = std::string_view;. Once StringRef becomes an “empty shell” doing nothing but deriving from std::string_view, we can make StringRef an alias of std::string_view. We can mix-n-match them safely between callers and callees, between our projects and downstream projects, etc.
Replace StringRef. with std::string_view in our codebase.
Additional considerations
Wait for using StringRef = std::string_view;
One thing we can do better than the migration of llvm::Optional to std::optional is to hold off the migration to StringRef in the codebase until StringRef becomes an alias of std::string_view with:
using StringRef = std::string_view;
Different external projects interface with the LLVM libraries. Changing one use of StringRef to std::string_view at a time before StringRef becomes an alias of std::string_view would make it harder for other projects to keep up with our project.
StringRef::find
StringRef::find has some performance optimizations. We need to somehow retain performance characteristics.
StringRef::substr
StringRef::substr is more lenient than std::string_view. Given:
S.substr(Pos, Count)
StringRef::substr permits Pos > S.size(), whereas that’s not allowed in std::string_view. We need to fix usage in our codebase before we disallow Pos > S.size().
I’m not sure this is well motivated.
StringRef doesn’t have the char_traits baggage of string_view so it should lead to smaller symbol names / compile times.
Beside, both class should be constructible from the same things.
I think we should make sure that StringRef offers a superset of the string_view features. But I’m not concerned at all by the maintenance cost.
In fact, maintaining StringRef is much easier than trying to migrate away from it (there are 13K uses of it)
I also think that we should let the RFC run its course before starting to land patches. Otherwise what’s the point of making an rfc to begin with?
Thanks for looking at this - I’m similarly torn on the pros/cons of moving over to string_view, although my gut says we should always try to use the C++ standard more than we do.
Does the loss of forward declarations of StringRef impact us much?
How easily can we quickly measure the differences in code size, compile time (of llvm/clang) and performance (with llvm/clang)?
How much diff is there between build with clang/msvc/gcc?
We’re likely to have a very similar discussion regarding ArrayRef vs std::span in the future as well, although I don’t know how much we should consider these migrations as both/neither.
According to our coding standard, I believe llvm::StringRef is the facility that needs to be motivated in the presence of std::string_view, and not the other way round.
Is there any evidence that those have non-trivial costs? Because I’m not sure they do.
Sure, llvm::StringRef is there to stay for foreseeable future due to sheer amount of usages if nothing else. But I don’t think we need to impede the transition to std::string_view because of this if there are people willing to put effort into that.
My gut tells me we should try to use the C++ standard library less since it’s so bad for compile time. However I do realize this is not the popular view
+1, even simplistic numbers comparing basic use of StringRef vs. string_view would be good. Maybe it’s no big deal. Maybe it is.
That measures LLVM compiling things, not the compile time of LLVM itself right?
I think 0.03% file size increase is a very small price to pay for not having to maintain StringRef anymore, especially in view of faster compile time (at least for clang).
I’m more worried about the find and substr considerations.
Perhaps a small set of functions that perform the same optimizations over the string’s data? To be used only in cases where the standard find algorithm is really impacting performance, not as a general rule.
This sounds like a security issue. I’d say we need to get rid of those regardless.
Seems like the usual case of deduplicating/removing maintenance/removing friction between our APIs and other APIs when C++ gets a vocab type that subsumes our own.
We’ve done it with optional before, and I think it’s suitable to do it with string_view.
Maybe documenting any known divergence - but, yeah, if the divergence is worthwhile I expect we’d write the extra functionality as a non-member function (I guess any divergence that can’t be handled that way would be particularly noteworthy/worth discussing potentially up-front)
But even if we get most of the way down the migration and find something that can’t be migrated or moved to a non-member and we stop there - I expect the rest of the work won’t be harmful/will be marginally helpful - making the APIs as similar as possible will reduce the friction between the APIs, even if they must both exist. (eg: the slice/substr change - using the same names as the standard API seems like a good thing to me)