An extension of libcxx

Hello libc+±dev,

In the discussion of https://reviews.llvm.org/D55517, I mentioned that we are attempting a vendor variant of libcxx that uses _VSTD differently. Eric pointed out that I should have started here, so we could talk about design goals. He’s right, I’m sorry.

Not one to bury the lede, I’d like to talk about a CUDA C++ standard library.

The ultimate goal of something like that should be that most things in C++, if not bolted too-tightly onto the operating system, should be able to be passed and used between CPU and GPU. There’s no fundamental reason why we don’t have a big chunk of C++ working like this, today, if we’re talking about contemporary HPC-friendly GPUs. The reason we don’t have much is that it’s a huge pile of work and everyone has managed to avoid doing it so far.

One exploration vehicle was shown at CppCon in September (by me, see: YouTube, and https://github.com/ogiroux/freestanding) and then we made but failed to present a more detailed poster at the LLVM dev meeting in October. And now we’re here. :blush:

After making a few exploration vehicles (2 overall, 4 for ), we now think we’ll create version 1 this way:

I’m very excited to have NVIDIA collaborate on libc++. It’s worth supporting your weirdo macro hack as a transitional tool.

I’m especially interested in working on freestanding in clang / libc++, bringing the good parts of it from the current C++ standard, and working with you and other on the Committee to make C++23 freestanding actually nice (Ben Craig has been working on wg21.link/P0829R3 <http://wg21.link/P0829R3>). I hope that we can experiment on what’s “nice” in clang / libc++ in the next few months.
One design constraint around freestanding: I want to make sure that clang can keep supporting other STL implementations.

I’d like to understand if we can have a different ABI for freestanding, given that it’s not supported in libc++ today. This might be an opportunity to fix some mistakes.

On “freestanding” macro, clang does the following today:
  if (LangOpts.Freestanding)
    Builder.defineMacro("__STDC_HOSTED__", "0");
  else
    Builder.defineMacro("__STDC_HOSTED__");
Otherwise, clang’s lib/Headers do some stuff with HOSTED as well, which might interfere with freestanding.

Good header hygiene indeed seems necessary, especially for <algorithm>. Louis mentioned that he was interested in looking into this.
Louis did a survey and found the following:

Freestanding in the current C++20 draft requires the following headers:

    <ciso646>
    <cstddef>
    <cfloat>
    <limits>
    <climits>
    <cstdint>
    <cstdlib>
    <new>
    <typeinfo>
    <exception>
    <initializer_list>
    <cstdarg>
    <type_traits>
    <atomic>

Of those headers, I think the following are easy to provide with minimal changes to libc++ and without having to ship a libc++ shared object (or compiler-rt), and they use the following parts of the C Standard Library:

    <ciso646>: nothing
    <cstddef>: stddef.h
    <cfloat> : float.h
    <limits> : stddef.h
    <climits>: limits.h
    <cstdint>: stdint.h
    <cstdlib>: stdlib.h
    <initializer_list>: stddef.h
    <cstdarg>: stdarg.h
    <type_traits>: stddef.h

As a result, I think the following are low-hanging fruit that do not require any runtime support AFAICT:

    <ciso646>
    <cstddef>
    <cfloat>
    <limits>
    <climits>
    <cstdint>
    <initializer_list>
    <type_traits>

Other things we might be able to throw in with minimal effort:

    <bit>
    <ratio>

Other things that we SHOULD be able to have, but that would require refactoring in libc++ (and most of them are not part of the current freestanding):

    <tuple>
    <pair>
    most if not all of <functional>
    most of <algorithm>
    <span>
    <array>
    <string_view>
    lock-free parts of <atomic>

Thanks!

The only part of what you propose at the bottom that I’ll take exception to is the proposed initial exclusion and later subsetting of . It’s in the Freestanding subset now, probably in the Freestanding subset in the future also, we just polled this in SG1 ~weeks ago and we don’t want to subset it. Rather, we can eliminate the library dependency by having the lock-byte strategy in libcxx itself for non-lock-free cases, achieving the desired goal of being dependency-free without having to upset SG1.

We’re good with the rest. What you wrote closely resembles what we proposed to our people.

Olivier

<string_view>

These are annoying, because they throw exceptions from , and exceptions have constructors which take a std::string. You could omit the throwing methods (as I have done in P0829), or you could patch over it by making those calls terminate instead.

I think you mean . You should be able to get all or . I’m unsure on how much internal header shuffling is required.

most if not all of

I exclude std::function and the string searchers in P0829, as they allocate on the heap. You may have different priorities here.

most of

You probably want most of as well. The places I avoided were the execution policy overloads and the algorithms that allocate temporary buffers (stable_sort, stable_partition, inplace_merge).

The “quick” rundown of what is in my paper can be found by searching for “Technical Specifications” in https://wg21.link/P0829, then stopping when you get to “Notable Omissions”. That is “merely” 3 printed pages, but a lot of it is quick to scroll through.

I suspect that NVIDIA is fine with heap allocations, and probably really wants floating point operations. My preference is to layer that on to my proposal, but I’m not the one doing the libcxx work right now :blush:

  • some more people who care about CUDA support but might not be subscribed to this list.

Trying again with a shorter message, as I think the list filtered out the last one…

“Other things that we SHOULD be able to have, but that would require refactoring in libc++ (and most of them are not part of the current freestanding):”

<string_view>

These are annoying, because they throw exceptions from , and exceptions have constructors which take a std::string. You could omit the throwing methods (as I have done in P0829), or you could patch over it by making those calls terminate instead.

I think you mean . You should be able to get all or . I’m unsure on how much internal header shuffling is required.

most if not all of

I exclude std::function and the string searchers in P0829, as they allocate on the heap. You may have different priorities here.

most of

You probably want most of as well. The places I avoided were the execution policy overloads and the algorithms that allocate temporary buffers (stable_sort, stable_partition, inplace_merge).

The “quick” rundown of what is in my paper can be found by searching for “Technical Specifications” in https://wg21.link/P0829, then stopping when you get to “Notable Omissions”. That is “merely” 3 printed pages, but a lot of it is quick to scroll through.

I suspect that NVIDIA is fine with heap allocations, and probably really wants floating point operations. My preference is to layer that on to my proposal, but I’m not the one doing the libcxx work right now :blush: