Extension for creating objects via memcpy

TLDR: we’re looking for a well-defined way to create objects with vptrs via memcpy.

We have a common code pattern on creating new message objects in protobuf parser:

// Message is almost trivial, except that it has virtual methods
class Message {
 public:
    virtual ~Message();
    // allocate memory from the arena, and call placement new.
    virtual Message* New(Arena* arena); 
    
    intptr_t meta_data_;
};

Message* create(const Message* default_instance, Arena* arena) {
  return default_instance->New(arena);
}

This code creates a new instance of Message from the default_instance via a New virtual function. It works fine and has well-defined behavior. This code has a devirtualization cost. In many cases, the overhead of invoking virtual functions outweighs their benefits, especially considering that for most protobuf messages, these functions just perform very little work.

A faster approach is to utilize memcpy, copying the bytes from the default instance to a buffer and then materializing a new object from the buffer:

Message* create(const Message* default_instance, Arena* arena) {
  unsigned size = default_instance->size(); // including vptr
  char* buffer = arena.allocate(size);
  // bitwise copy of object representation.
  memcpy(buffer, default_instance, size); // UB1: memcpy on a non-trivially copyable object
  return reinterpret_cast<Message*>(buffer); // UB2: Message object lifetime is not started
}

This implementation is significantly faster, as observed in the parsing benchmarks, with a reduction of ~10% CPU time and ~8% cpu cycles. Parsing is a critical workload for us.

However this code has undefined behavior:

  • Undefined Behavior 1 (UB1): according to the C++ basic.types.general p2: “For any object (other than a potentially-overlapping subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes ([intro.memory]) making up the object can be copied into an array of char, unsigned char, or std​::​byte ([cstddef.syn]).”, Message is not trivially copyable, so using memcpy to copy its bytes is not guaranteed to work.

  • Undefined Behavior 2 (UB2): according to the C++ intro.object p1: “An object is created by a definition, by a new-expression (expr.new), by an operation that implicitly creates objects (see below), when implicitly changing the active member of a union, or when a temporary object is created (conv.rval, class.temporary)”, the returned object is not created in any of these ways, particularly its lifetime has not started properly.

Given the significant performance improvement, we’d like to apply these changes in production. However, our main concern is these undefined behavior, even though the code works with the current clang toolchain. Some ideas:

Regarding UB1, I don’t have a good answer. The best I can say for this particular case it should work in practice (maybe it is safe to assume it always works?).

The “Making More Objects Contiguous” P1945R0 introduces a new contiguous-layout type, which covers class types with virtual functions, this should make it well-defined behavior, but this proposal doesn’t seem to have any updates.

Regarding UB2, we can provide a builtin like __builtin_start_object_lifetime in clang to bless it, the builtin is similar to C++23 std::start_lifetime_as, but permits starting lifetime of non implicit-lifetime types. (The builtin implementation is not complicated given the current way that clang lowers the code to LLVM IR, for most part it is no-op, but we need to make sure optimizers -fstrict-vtable-pointers, -fstrict-aliasing, sanitizers respect it).

(We also considered restructuring the Message to eliminate virtual methods and instead implement a hand-rolled vtable mechanism to make it trivially-copyable type, but it will require a considerable amount of effort for a complete rewrite, so this is not a great option).

Any thoughts and suggestions are very welcome.

P0593R6: Implicit creation of objects for low-level object manipulation is the most important section of the paper introducing implicit-lifetime types.

In practice, the compiler doesn’t know when implicit lifetimes begin/end, and doesn’t want to break the object’s contracts, so it needs to know:

  • it’s valid to create the object by filling in bits but not running any code (thus the requirement for trivial default/copy constructor). For classes this has a structural part (fields should not need construction) and a semantic part (no tricky class invariants). The former we can check and the latter the user could guarantee somehow.
  • it’s OK to end the object lifetime without running a destructor. A trivial destructor satisfies this, but so does “the user must run the destructor”. That contract is hard to specify if lifetimes are truly implicit, but here they’re explicit via __builtin_start_object_lifetime so user code can be required to call Obj.~Type(). ISTM the rules about reusing storage/lifetimes etc here can mirror placement-new.

Defining memcpy on non-trivially-copyable types is tricky. Maybe it even inhibits (future) optimizations. We could define __builtin_copy_object_data or whatever that provides stronger guarantees. But another option is to only provide a higher-level primitive like __clone_object(void*, size_t, char*) which copies the data and starts the lifetime, and avoid specifying exactly what memcpy needs to do on such types.

Regarding UB2, we can provide a builtin like __builtin_start_object_lifetime in clang to bless it, the builtin is similar to C++23 std::start_lifetime_as, but permits starting lifetime of non implicit-lifetime types. (The builtin implementation is not complicated given the current way that clang lowers the code to LLVM IR, for most part it is no-op, but we need to make sure optimizers -fstrict-vtable-pointers, -fstrict-aliasing, sanitizers respect it).

Without LLVM support for start_lifetime_as, TBAA can and will in fact alias objects it should not.
I think there are a lot of use cases for a __builtin_start_object_lifetime in clang and Maybe we could have more lax requirements via-a-vis destruction (but it seems like a very dangerous feature).
However this does need LLVM support (My understanding is that we would need some sort of TBAA fence intrinsics of some sort) which is why the front-end does not support these built-ins yet (and why we cannot implement std::start_lifetime_as).

Some work was tried here ⚙ D147020 [AA] Add a tbaa-fence intrinsic. (note that I work on clang so i have a very shallow understanding of the LLVM implications)

I agree that implementing this correctly needs more LLVM IR support than we currently have. But we need that same support to implement C++ (or indeed LLVM IR) semantics correctly in general – for example, LLVM will currently delete memcpys that have !tbaa.struct metadata, and DSE will discard stores that change TBAA but not in-memory bits, both of which can presumably lead to the same kinds of TBAA miscompile we’re concerned about here.

So I don’t think this should be a barrier to adding this builtin – this builtin would be broken by TBAA in the same way that the rest of Clang’s lowering is, and we seem to broadly be getting away with that. We do need to fix these lost-TBAA issues, but I don’t think coupling those fixes to the addition of this builtin is necessary.

If we want to be cautious, it might be OK to produce a warning if the builtin is used without -fno-strict-aliasing, but I don’t think we should expect TBAA issues to show up here any more than they do for the existing cases. (I mean, sure, this builtin does support an implicit in-place bitcast, which is the essence of the case that LLVM miscompiles, but the intended common use case doesn’t actually bit-cast anything.)

For the record: this is being reviewed in [clang] Add __builtin_start_object_lifetime builtin. by hokein · Pull Request #82776 · llvm/llvm-project · GitHub.
Currently it takes the simplest approach of not doing anything new for TBAA and not emitting any warnings.

The environment we’re looking to deploy this in uses -fno-strict-aliasing. So in practical terms if using this feature this slightly widens a TBAA hole that already exists, we’d still like to have it.
I do hope we can investigate turning on strict-aliasing this year, and contribute to needed infra (IR support for optimizations, and type-sanitizer to find violations on the user side).

The patch implements a __start_lifetime builtin in clang, which is useful for the implementation of C++23’s std::start_lifetime_as. However, the issue of memcpy undefined behavior still persists.

There are ongoing efforts to introduce a new type category known as trivially relocatable type in C++. I think it could solve our problem.

Conceptually, a trivially relocatable type object can be relocated by means of memmove/memcpy the bytes of object representations, starting the lifetime of dst object (without running constructor), and ending the lifetime of src object (without running destructor).

This aligns closely with the requirements for the object-creation-via-memcpy scenario (exception: the ending src object lifetime is not needed).
So one solution is to leverage the concept of trivially relocatable types, for example, providing compiler memcpy/memcpy guarantee on trivially relocatable types (in addition to trivially copyable types), or introducing a new function like T* clone(T* src, void* buffer) for “relocating” objects without ending the source object’s lifetime (similar to what suggested by Sam in comment #2).

However, the work on trivially relocatable types is still undergoing standardization processes (targeting on C++26), two ongoing proposals:

  1. P2786R4: Trivial Relocatability For C++26
  2. P1144R10: std::is_trivially_relocatable

They aim to achieve the same goal with different options (differences are stated in P2814R1). Notably, the primary distinction for our case is the treatment of virtual methods. While P2786R4 doesn’t impose restrictions on virtual methods, P1144R10 does.

Clang has preliminary support for trivially relocatable types through the __is_trivially_relocatable builtin, and types annotated with the trivial_abi are considered trivially relocatable.The trivial_abi has the same restriction on types with virtual methods.
Would it make sense to extend the trivially relocatable types (or even trivial_abi) to cover virtual types? I think the protobuf is a real motivating use case.

(We had discussions separately, and I’m writing some summaries here).

While it is technically possible to leverage trivially relocatable types, performing a trivial clone on a trivially relocatable type is not conceptually correct. Trivial relocations are about move semantics (only one single live object, one destructor call) whereas trivial clones are about copy semantics (two live object, two destructor calls). If an object owns resources that need to be deallocated, then a trivial clone will lead to a double-free of the owned resource. For example, a mutex can be trivially relocatable but definitely not trivially cloneable, creating a new object via bitwise copy should not be permitted.

Per the standard, bitwise copy is only well defined for objects that are trivially copyable. In our current case, we know that our types are safe for bitwise copy, but it cannot be made trivial copyable or implicit lifetime, similar to trivially relocatable types.
Both require the same fundamental building block from the compiler: a well-defined bitwise copy for objects that are not trivially copyable.

One idea is to provide a __is_bitwise_copyable builtin in clang. This builtin establishes a contract: for any bitwise-copyable types, it is safe to perform a bitwise memory copy. That said, we extend the memcpy function to accommodate a wider range type.

Then we can use this builtin to implement trivially_relocatable or trivially_clone function, e.g.

Foo* clone(const Foo* src, char* buffer, int size) {
  if constexpr (__is_bitwise_copyable(Foo)) {
    // bitwise copy to buffer, and implicitly create objects at the buffer
    __builtin_memcpy(buffer, src, size);
    return std::launder(reinterpret_cast<Message*>(buffer));
  }
  // Fallback to the operator new.
  return new(buffer) Message(src);
}

In the builtin implementation, we can start with certain types that are safe for bitwise copying, such as scalar types, classes that only have non-static bitwise-copyable data members. And make adjustments as necessary in the future.