[frontend-dev][beginner] Allocation of structures

Hi all,

I’m pretty new to the list, and to LLVM in general, so please excuse my extreme newbiesness.

I’m trying to figure out what would be the appropriate way to implement move semantics.
I’ve been trying to dump the IR produced by clang with some basic C++ snippet, but I’m afraid it didn’t help me much.

Here’s the example I’ve been playing with (in C++):

struct S {
S() noexcept: x(new int) {}
S(S&& other) {
x = other.x
other.x = nullptr;
}
~S() {
delete x;
}
};

S f1() {
auto s = S();
return s;
}

S f2() {
auto s = S();
return std::move(s);
}

This of course produces a lot of LLVM code (with -O0), but I think I may have figured out most of what’s what. In particular, I’ve been able to identify the IR code for f1 and f2, but to my surprise, neither of those return a value. Both take a pointer to S as parameter, which in turn gets passed to the constructor of S, and return void.

This leaves me with two main questions:

  • First, is the use of a pointer to S as parameter a specificity of clang, or generally the way to go? I’ve seen in the language reference that one could return a struct with a simple ret instruction, so I’m surprised not to see it for the version that doesn’t use move semantics.
  • Second, would I use a non-void ret instruction to return the result of an alloca, when would the latter be destroyed? Would that involve a copy from the runtime stack of the callee to that of the caller?

Thank you very much for your time and your answer,

Best,

Dimitri Racordon
CUI, Université de Genève
7, route de Drize, CH-1227 Carouge - Switzerland
Phone: +41 22 379 01 24

See https://en.wikipedia.org/wiki/Return_value_optimization

Hi all,

I’m pretty new to the list, and to LLVM in general, so please excuse my
extreme newbiesness.

I’m trying to figure out what would be the appropriate way to implement
move semantics.
I’ve been trying to dump the IR produced by clang with some basic C++
snippet, but I’m afraid it didn’t help me much.

Move semantics in C++ are just a mechanism to use overload resolution to
select a different overload (see
http://en.cppreference.com/w/cpp/language/move_constructor). For example,
if you think about how to map your example C++ code down to equivalent C,
you'll see that in that process you have fully resolved the "move
semantics". At the LLVM level (or the C level), there are no "move
semantics". std::move is basically just a cast that creates a `S&&` which
will then select the right overload.

Here’s the example I’ve been playing with (in C++):

struct S {
  S() noexcept: x(new int) {}
  S(S&& other) {
    x = other.x
    other.x = nullptr;
  }
  ~S() {
    delete x;
  }
};

S f1() {
  auto s = S();
  return s;
}

S f2() {
  auto s = S();
  return std::move(s);
}

This of course produces a lot of LLVM code (with -O0), but I think I may
have figured out most of what’s what. In particular, I’ve been able to
identify the IR code for `f1` and `f2`, but to my surprise, neither of
those return a value. Both take a pointer to `S` as parameter, which in
turn gets passed to the constructor of `S`, and return void.

This leaves me with two main questions:

   - First, is the use of a pointer to S as parameter a specificity of
   clang, or generally the way to go? I’ve seen in the language reference that
   one could return a struct with a simple ret instruction, so I’m surprised
   not to see it for the version that doesn’t use move semantics.

This a language ABI question. There's really nothing LLVM-specific about

it. I would recommend thinking about it in terms of how to map C++ to C. As
Davide pointed out, RVO is one reason to choose this particular lowering.
These decisions are made inside Clang (not LLVM) and are mandated by the
ABI (all compilers must implement them the same way for code to be able to
be linked together and work). If you want all the gory details, the itanium
C++ ABI is documented at https://itanium-cxx-abi.github.io/cxx-abi/abi.html
(this is the C++ ABI used on basically all platforms except MSVC; there's a
historical connection to itanium but nothing specific to that processor
about it).

In particular, the description of how to lower return values is
https://itanium-cxx-abi.github.io/cxx-abi/abi.html#return-value

Note that the C++ ABI is phrased in terms of the underlying C ABI (which is
processor-specific (and generally not OS-specific unless you care about
Windows vs non-Windows)), so familiarity with the C ABI is useful too;
documents for the C ABI of different processors can be found in
http://llvm.org/docs/CompilerWriterInfo.html e.g. the "X86 and X86-64 SysV
psABI". The C ABI may seem overwhelming (lots of processor details, corner
cases, etc.), but the basic gist is that there's a list of registers and
each argument is assigned in turn from that list (if the register list is
exhausted, the rest are passed through memory). This is easy as long as
everything (return value and argument types) are int, long, pointer, or
some other primitive type that fits in a register. Struct types are
decomposed into multiple primitive types according to certain rules, until
everything is in terms of primitive types.

e.g. if you understand the examples in https://godbolt.org/g/qSdzHj then
you pretty much have a basic understanding of the C parameter passing ABI.
The register lists for x86-64 are rdi, rsi, ... for arguments and rax, rdx,
... for return values . (Also look at the corresponding LLVM IR).

Very few people actually know the precise rules in detail. However,
basically all LLVM developers (or more generally toolchain developers:
compiler, linker, debugger, etc.) will know the basic rules (like the
examples I linked above) and will know how to look up specifics in the
psABI document as needed (and probably 90% of the time just looking at
Clang's output on an example will answer a particular question; for
example, I forgot that it as rdx as the second register for return values;
all I remembered was that there were two return registers and the first was
rax).

If you're interested, these are the notes I took when I was initially
learning about this:
https://github.com/chisophugis/x64-Forth/blob/master/abi.txt
(woah this takes me back)
That Forth implementation doesn't actually call any external libraries, so
the only external ABI it cares about is the Linux syscall ABI, which I
recorded in this comment for my own memory
https://github.com/chisophugis/x64-Forth/blob/master/compile4.asm#L13
(yes, this was before I learned to use version control...)
The internal "ABI" used by this small Forth implementation is described in
https://github.com/chisophugis/x64-Forth/blob/master/compile4.asm#L6
(forth is a very low-level language so it needs its own processor-specific
ABI; most languages essentially just piggy-back on the C ABI for
processor-specific stuff)

LLVM handles all of this C ABI stuff for you (and there's other aspects of
the C ABI like stack alignment/layout, TLS access, relocations, etc.).
There's quite a bit of essential complexity, but as long as you stick to
the simple cases where the mapping from C is trivial, it's easy to
understand. There's a pretty deep rabbit hole though if you start getting
into complicated cases, but it's all just a relatively simple extension of
the cases that map trivially to C. For the most part, you're only
interaction with the rabbit hole will be needing to debug when things go
wrong, which will require understanding the basic concepts (which can be
understood via simple C examples) and then drill down as needed (e.g.
looking at what clang does, looking at the standard docs, etc.). This is in
some sense simple even for complex cases in that it doesn't require any
complex insight to understand harder cases; you're just verifying
assumptions against a list of rules.

That may sound scary, but as long as the IR your frontend generates is
internally consistent at the LLVM IR level it will be ABI compatible with
itself (modulo bugs in LLVM), so you can basically ignore it. You'll have
to have some familiarity with the C ABI in order to e.g. call an external
function like malloc in libc, but again, as long as the parameter types are
"simple" then the mapping between C and LLVM IR is very simple; you'll need
to have a basic understanding though in order to verify this though and
debug any think-o's.

   - Second, would I use a non-void ret instruction to return the result
   of an alloca, when would the latter be destroyed? Would that involve a copy
   from the runtime stack of the callee to that of the caller?

The result of an alloca is a pointer to the stack frame of the current

function, so returning it doesn't make sense (like returning the address of
a local variable in C). Again, this problem isn't really related to LLVM
per se and the easiest way to think about it is in terms of how to lower to
C.
Decisions about object destruction are up to the frontend. At the C or LLVM
level they just turn into function calls or whatever you use to implement
the semantics that the frontend requires. In other words, if you want an
object to be destroyed, it's up to your frontend to emit code to perform
the destruction wherever the destruction is expected to occur.

-- Sean Silva

Thanks for this comprehensive answer!
I’m slowly but surely to understand better about all this.

That may sound scary,

Yes it does! That said, I’m not sure need to care that much about the C++ ABI, as I don’t plan on linking my IR with C++ directly. It is very interesting to see how it works though, as I’ll probably need to solve the same kind of problems for my own frontend language.

The result of an alloca is a pointer to the stack frame of the current function, so returning it doesn’t make sense (like returning the address of a local variable in C). Again, this problem isn’t really related to LLVM per se and the easiest way to think about it is in terms of how to lower to C.
Decisions about object destruction are up to the frontend. At the C or LLVM level they just turn into function calls or whatever you use to implement the semantics that the frontend requires. In other words, if you want an object to be destroyed, it’s up to your frontend to emit code to perform the destruction wherever the destruction is expected to occur.

I understand my second question was probably very unclear. When I said “destroyed”, I wasn’t talking about calls to C++ (or any other frontend language) destructors, but more about what is happening to the memory that was stack allocated. Let’s consider the following IR:

define i32 @f() {
entry:
%0 = alloca i32
store i32 123, i32* %0
%1 = load i32, i32* %0
ret i32 %1
}

define i32 @main() {
entry:
%0 = alloca i32
%1 = call i32 @f()
store i32 %1, i32* %0
%2 = load i32, i32* %0
ret i32 %2
}

As I understand it, f will allocate a new 32 bit memory space on its stack frame, and stores the value 123 in it. But what I’m not sure about is what’s happening to that alloca when f returns. If the alloca is in the frame of f, then the value stored at that location should be deallocated (which is what I meant by “destroyed”) when f returns, and so I don’t understand what makes the store i32 %1, i32* %0 successfully storing 123 in the alloca of the main function. If I had to make a guess, I would say the value pointed by %0 in f is copied to the alloca pointed by %0 in main (which is what I meant by “copy from the runtime of the callee to that of the caller”). With primitive types such as i32, this probably doesn’t matter, since they would fit in a CPU register, but what if %0 was an alloca for a very large struct? I guess that’s what RVO is all about optimising, but I’d like to confirm that assumption.

Once again, I apologise for my poor knowledge of these kind of low-level semantics.

Best,

Dimitri Racordon
CUI, Université de Genève
7, route de Drize, CH-1227 Carouge - Switzerland
Phone: +41 22 379 01 24

Sorry for this incredibly long response. It’s just that probably about 90% of all beginner questions seem to be related to unfamiliarity with the same set of basic concepts (which, to be honest, are very poorly known outside the compiler/toolchain community, so that unfamiliarity is totally expected). By providing a more elaborated answer I’m trying to write something that hopefully helps many other people understand these concepts better.

Broadly speaking, it may be useful to follow this approach when facing a question while learning about LLVM:

  1. If you are starting from C++ code and trying to understand how it is lowered, run clang++ -emit-llvm -O0 and create a C file that produces essentially the same LLVM IR when passed to clang -emit-llvm -O0. This should always be possible, except perhaps things related to exception handling and some things you probably don’t care about like how to put a function in a comdat or give it certain linkage. If you can’t figure out how to do that, the answer is probably:
    1.a. in the C++ ABI https://itanium-cxx-abi.github.io/cxx-abi/abi.html (home page: https://itanium-cxx-abi.github.io/cxx-abi/) or
    1.b. it is a general C++ language question which you can probably find information about in http://en.cppreference.com/w/ (or the language standard if you want to look at it).

Those are fairly dense reference material, but you should at least browse around them a bit (and do web searches, especially for general C++ language questions) to refine your question.

  1. If your question wasn’t related to something C++ specific, then you now have a C file and want to know something about how it is lowered to LLVM IR. To make sure you have a basic understanding of LLVM IR concepts and the relationship between C and LLVM IR, I would recommend watching this video: https://www.youtube.com/watch?v=haQ2cijhvhE

  2. It’s likely that you started with a single question at the beginning and now have broken it down into multiple more specific questions that are much more likely to be answered if you ask them on llvm-dev (or cfe-dev if the question is very narrowly C++ specific, though for general C++ language questions StackOverflow might be better).

Another useful resource might be to skim these two videos, which talk about C++ coroutines (and their implementation in LLVM), which are a feature that cuts across the C++, C, and LLVM IR levels (and a little bit about machine level in the second video):
https://www.youtube.com/watch?v=8C8NnE1Dg4A
https://www.youtube.com/watch?v=Ztr8QvMhqmQ
You don’t need to actually understand what the videos are talking about in detail, but it is useful to see how Gor talks about the same thing to two very different audiences. The first video is to an audience at CppCon which is mostly C++ language users (and people familiar with C-like aspects of C++ programming), but not necessarily familiar with compilers. The second video is to an audience of LLVM developers, and so it doesn’t worry about mentioning lots of LLVM-specific details or using compiler/toolchain terminology and concepts. By comparing the two videos, it should give some insight about which things are easier to think about at different levels of analysis, which things work similarly and can be explained at multiple levels of analysis, and which things only matter at a particular level of analysis. For example, the high-level description of the coroutine splitting process can be done pretty much the same at LLVM IR level or at the C level (with some intrinsics). You’ll also notice that even when talking to an audience of LLVM developers, many things in the presentation are still described at the C level.
Coroutines are also interesting in that part of the language-level semantic lowering (the coroutine splitting) happens inside the optimizer at LLVM IR level instead of in the frontend for various reasons. Having an optimization pass that is semantically required is unusual, but otherwise this is conceptually similar to the “alloca trick” I talk about below that Clang uses for local variables (SROA / mem2reg are not semantically required for correctness, but they are needed for any reasonable performance, so the situation is still pretty analogous).

The alloca is part of f's stack frame, so when f returns it is gone. That memory can never be accessed again (obviously at the machine level it is still there and will probably be overwritten by a future function call, but as far as LLVM IR semantics (or C) it cannot be accessed again). To phrase this in terms of C, you are essentially asking about what happens to the memory pointed to by p in the function bar after it returns (and I think you already know the answer at the C level; the answer is the same at the LLVM IR level).

If the alloca is in the frame of f, then the value stored at that location should be deallocated (which is what I meant by “destroyed”) when f returns, and so I don’t understand what makes the store i32 %1, i32* %0 successfully storing 123 in the alloca of the main function.

The store inside f does not affect the alloca inside the main function. The main function takes the value returned by the call and stores the value itself. The return value of f is an i32 value that you can think of as held in a register at the hardware level and returned by value. In the callee (@f) the value 123 happens to have been stored to and loaded back from an alloca, but that doesn’t matter after it returns (and indeed, the alloca’s memory is deallocated when @f returns). Also, you’ll notice that the optimizer will happily reduce @f to just ret i32 123 or something like that and delete the load, store, and alloca.

I’m guessing that part of this confusion is that you were seeing Clang produce a ton of alloca’s. You may wonder why clang generates all these alloca’s (and you may have noticed the optimizer will just fold them away). This is one place where the semantics of C and LLVM IR differ substantially, so thinking in terms of C won’t lead to the right conclusion.

The fundamental abstraction in C of data is essentially that every variable is assumed to be an object in memory (more precisely, this is what the term “lvalue” in C refers to; if you want to get super technical about this and related concepts in C/C++ a reference is http://en.cppreference.com/w/cpp/language/value_category). For example, you can take the address of any variable.

For example, these two functions are identical

int foo() {

int x = 123;
return x;
}

int bar() {
int x = 123;
int *p = &x;
return *p;
}

Also, in C/C++, variables are mutable. For example, this function is also identical to the previous functions:

int baz() {
int x = 123;
x = 42;
x = 123;
return x;
}

However, in LLVM IR, when you see something like %2 = add i32 %1, %0 or %2 = load i32, i32* %0, the thing designated by %2 is very different. You cannot take the address of %2 and you cannot mutate the thing designated by %2. You can think of %2 as a name for a node in a data flow graph (in fact, %2 is represented in the LLVM libraries as an llvm::Value; %2 is a name for an llvm::Value). There’s more about this perhaps subtle point from 10:40 to 11:40 in Chandler’s video https://www.youtube.com/watch?v=haQ2cijhvhE, though you will probably find the first 20 minutes or so very relevant to this conversation (and I would recommend watching the whole video).

(Sometimes it is also useful to think of LLVM’s values as “SSA registers”, i.e. registers that can only be assigned in one place. Don’t think about them as general mutable storage locations though)

So Clang has this problem: how do I accurately translate the semantics of code written in C/C++ (which semantically involves operations on objects in memory) to LLVM IR, which revolves around nodes in a data flow graph (whose SSA values have no notion of semantically being in a “storage location”; they are just names for llvm::Value’s that the LLVM libraries operate on, which are (roughly) nodes in a data flow / control flow graph).

The answer is that Clang models C/C++ semantics of local variables by explicitly using alloca to create temporary in-memory storage locations. Then the correspondence between the generated LLVM IR and the C/C++ semantics is easy to get right. This requires inserting a lot of loads and stores from allocas just to access the values (e.g. needing two loads, an add, and a store to do c = a + b), but it is worth it because it simplifies Clang’s lowering from C/C++ to LLVM IR.

You may be thinking: in foo above, at the C/C++ level it is easy to see that x could be kept in a register instead of a memory location on the stack, and it is only assigned once, so it could be directly lowered into an LLVM IR value instead of kept in memory in an alloca. It turns out that it is actually very difficult to do this analysis in Clang, at least in any reasonable generality. It turns out that there are standard compiler algorithms (generally known as “SSA construction”) that can “easily” and quickly (e.g. linear time) do the analysis and transformation needed to optimally (or close to optimally) get rid of all the allocas. The discovery of an efficient algorithm to do this was actually a revolution in compilers precisely because it allows an SSA representation like LLVM IR to be efficiently constructed from input languages with mutable storage locations (especially in the presence of control flow, which requires deciding where to insert a special type of instruction called a “phi node” to model when a local variable is mutated along multiple control flow paths).
The mem2reg pass (and it’s more sophisticated cousin, sroa) are the passes in LLVM that do this.

You are correct in the sense that @f loads the value (%1 = load i32, i32* %0), then returns it (ret i32 %1), then control returns to main and %1 = call i32 @f() finishes executing, then @main stores the value %1 into %0 store i32 %1, i32* %0.
There is no magic here connected to the alloca’s; you’re explicitly seeing every operation done on those memory locations (which are distinct memory locations). The only thing that could perhaps be considered “magic” are the details of how the LLVM backends generate machine code so that the value passed to the ret becomes available to the caller. But the semantics at the LLVM IR level are straightforward: the value of the call instruction (%1 in main) is the same as the value passed to the ret in the callee (%1 in f).

With primitive types such as i32, this probably doesn’t matter, since they would fit in a CPU register, but what if %0 was an alloca for a very large struct? I guess that’s what RVO is all about optimising, but I’d like to confirm that assumption.

Are you asking about ABI details or LLVM IR semantics?

“RVO” doesn’t enter the picture here because we are at LLVM IR level. There is no RVO at the LLVM IR level because RVO refers to a concept that exists only at the C++ language level (in the abstract; it’s not part of the ABI). In fact, if you look on the wikipedia page for RVO, you’ll see that RVO can change a well-defined program’s console output (i.e. the number of times it prints A copy was made.). RVO is a specific case where the C++ language standard (the semantics of the language in the abstract; this has nothing to do with the ABI) permit the compiler to change the output of a well-defined program, and the set of situations where this transformation is allowed is defined in terms of the abstract language semantics, which are only available to the C++ frontend (which has complete language-level semantic information). This is a case where thinking about LLVM IR in terms of C code is accurate: if you are lowering C++ to C, then as part of that conversion RVO is performed, and there is no concept of RVO at the C level.

Semantically at the LLVM IR level a struct return type is returned by value (i.e. copied) exactly like an i32. %1 in @main represents a value with the same bit pattern and size as %1 in @f; this is true regardless of the size of %1 in @f. The alloca inside main is irrelevant to this. The alloca only matters because the value of %1 inside @main is stored into the memory it points to by an explicit store instruction, but that is just a store like any other store (it sets the bits in memory at the address pointed at by %0 to the bits of the value of %1; the memory address happens to be an alloca that is allocated in the current stack frame).

If you are wondering about the low-level details of how %1 in @f ends up in %1 in @main when the call happens at runtime, then that’s a bit more complicated and depends on the processor-specific ABI. At the processor ABI level, a struct return type of an LLVM IR function is returned by value in registers (just like in C). Like I mentioned, there’s a list of designated return registers (on x86-64 there are two such registers), with specially allocated stack memory to act as extra “registers” beyond the two designated hardware registers (if necessary). This specially allocated stack memory is not explicitly visible at the LLVM IR level and comes into existence in the LLVM backend as LLVM IR is translated into a form closer to machine level.

The exact ABI details of a function call at the LLVM IR level has quite a rabbit hole that you probably don’t care about. E.g. there is sret, byval, and inalloca (http://llvm.org/docs/LangRef.html#parameter-attributes) or calling conventions (http://llvm.org/docs/LangRef.html#calling-conventions) which control very low-level details of the mapping to the processor-specific ABI.

Another related thing (specifically related to C++):

The fact that the processor specific registers (or special stack memory representing extra “registers”) is not explicitly visible at the LLVM IR level is not an issue when lowering C where all structs are POD (Plain Old Data). That is why I said that you can think about this particular issue entirely in terms of lowering C++ to C. In C++, a struct can define constructors/destructors that allow it to maintain complex internal invariants. In particular, it can maintain internal invariants that depend on it being in memory, so the C++ language ABI cannot say that such structs are passed “by value” in registers. For example,

struct Foo {
Foo(const Foo &foo) {
this->x = foo.x;
this->p = &this->x;
}
int x;
int *p;
};

As part of a simple function like the following, the struct Foo that is returned “by value”, but if you think about it in terms of lowering C++ to C (or LLVM IR), it actually involves a call to a copy constructor:

Foo getFoo(Foo &f) {
return f;
}

(look at the generated LLVM IR: https://godbolt.org/g/VbwwJr)

This struct Foo cannot be returned by value “in registers” because it maintains an internal invariant that depends on being in memory (p points at x). Due to this language-level semantic detail, what would p be set to if this struct were returned in registers? (and it does fit in registers on x86-64 actually)

Because this is a language ABI requirement, all compilers must implement it identically in order to produce machine code that can be linked together. Therefore, it cannot depend on “how smart” each compiler’s optimizer is and is defined at the C++ language level. For example, you don’t want compiler’s disagreeing about whether getFoo will return in registers or in memory depending on whether they optimize this branch (which eliminates Foo’s dependence on its own address, therefore allowing it to in theory be returned in registers):

struct Foo {
Foo(const Foo &foo) {
this->x = foo.x;
if ((foo.x ^ foo.x) != 0) // Always false.
this->p = &this->x;
else
this->p = nullptr;
}
int x;
int *p;
};

Thus, the rules frequently require things to be passed via explicit pointer out-parameters even if they in theory could be passed in registers. For example, the requirement preventing Foo from being passed in registers in https://itanium-cxx-abi.github.io/cxx-abi/abi.html#return-value applies to structs that are “non-trivial for the purposes of calls” which basically means “did you manually write a copy/move constructor or destructor for this class”.

As an interesting factoid, this prevents std::unique_ptr from being passed in registers, even though it is just a single pointer and doesn’t depend on its own address, so could in theory be passed in registers. So converting an old-style:

Qux *createQux();

to a modern-style:

std::unique_ptr createQux();

is actually a slight (probably negligible) pessimization, because the ABI requirement prevents the latter from being returned in registers. The optimizer is actually smart enough to optimize this and instead pass the std::unique_ptr in registers, but the fact that it must “agree with other compilers” about it means it has its hands tied and has to follow the ABI. However, if you compile with link-time optimization, the compiler can verify that it is seeing every possible use of createQux and change them all at once to use registers instead. If createQux is static, the compiler can also verify that it is seeing every possible user and perform this optimization. Also, if createQux is inlined, the cost will obviously go away.

The ABI actually prevents the optimizer from passing the std::unique_ptr in registers, simply because all compilers (or different versions of the same compiler, or assembly language programmers) must agree on whether it is passed in registers or not for their code to be compatible. So when people say that std::unique_ptr has “zero overhead” you can now tell them that they are technically wrong (they are right in practice though, because except for ABI limitations compilers optimize std::unique_ptr extremely well).

As you can see, this is one case where the lowest-level hardware realities (like passing things in registers or not) require consideration at the language semantic level. This is one aspect that makes ABI’s have a very deep rabbit hole and lots of details, because the lowest-level details can have high-level effects. They also want to be high-performance, but they sometimes must sacrifice that for compatibility or to avoid being infeasibly complex.

– Sean Silva