Sorry for this incredibly long response. It’s just that probably about 90% of all beginner questions seem to be related to unfamiliarity with the same set of basic concepts (which, to be honest, are very poorly known outside the compiler/toolchain community, so that unfamiliarity is totally expected). By providing a more elaborated answer I’m trying to write something that hopefully helps many other people understand these concepts better.
Broadly speaking, it may be useful to follow this approach when facing a question while learning about LLVM:
- If you are starting from C++ code and trying to understand how it is lowered, run
clang++ -emit-llvm -O0
and create a C file that produces essentially the same LLVM IR when passed to clang -emit-llvm -O0
. This should always be possible, except perhaps things related to exception handling and some things you probably don’t care about like how to put a function in a comdat or give it certain linkage. If you can’t figure out how to do that, the answer is probably:
1.a. in the C++ ABI https://itanium-cxx-abi.github.io/cxx-abi/abi.html (home page: https://itanium-cxx-abi.github.io/cxx-abi/) or
1.b. it is a general C++ language question which you can probably find information about in http://en.cppreference.com/w/ (or the language standard if you want to look at it).
Those are fairly dense reference material, but you should at least browse around them a bit (and do web searches, especially for general C++ language questions) to refine your question.
-
If your question wasn’t related to something C++ specific, then you now have a C file and want to know something about how it is lowered to LLVM IR. To make sure you have a basic understanding of LLVM IR concepts and the relationship between C and LLVM IR, I would recommend watching this video: https://www.youtube.com/watch?v=haQ2cijhvhE
-
It’s likely that you started with a single question at the beginning and now have broken it down into multiple more specific questions that are much more likely to be answered if you ask them on llvm-dev (or cfe-dev if the question is very narrowly C++ specific, though for general C++ language questions StackOverflow might be better).
Another useful resource might be to skim these two videos, which talk about C++ coroutines (and their implementation in LLVM), which are a feature that cuts across the C++, C, and LLVM IR levels (and a little bit about machine level in the second video):
https://www.youtube.com/watch?v=8C8NnE1Dg4A
https://www.youtube.com/watch?v=Ztr8QvMhqmQ
You don’t need to actually understand what the videos are talking about in detail, but it is useful to see how Gor talks about the same thing to two very different audiences. The first video is to an audience at CppCon which is mostly C++ language users (and people familiar with C-like aspects of C++ programming), but not necessarily familiar with compilers. The second video is to an audience of LLVM developers, and so it doesn’t worry about mentioning lots of LLVM-specific details or using compiler/toolchain terminology and concepts. By comparing the two videos, it should give some insight about which things are easier to think about at different levels of analysis, which things work similarly and can be explained at multiple levels of analysis, and which things only matter at a particular level of analysis. For example, the high-level description of the coroutine splitting process can be done pretty much the same at LLVM IR level or at the C level (with some intrinsics). You’ll also notice that even when talking to an audience of LLVM developers, many things in the presentation are still described at the C level.
Coroutines are also interesting in that part of the language-level semantic lowering (the coroutine splitting) happens inside the optimizer at LLVM IR level instead of in the frontend for various reasons. Having an optimization pass that is semantically required is unusual, but otherwise this is conceptually similar to the “alloca trick” I talk about below that Clang uses for local variables (SROA / mem2reg are not semantically required for correctness, but they are needed for any reasonable performance, so the situation is still pretty analogous).
The alloca is part of f
’s stack frame, so when f
returns it is gone. That memory can never be accessed again (obviously at the machine level it is still there and will probably be overwritten by a future function call, but as far as LLVM IR semantics (or C) it cannot be accessed again). To phrase this in terms of C, you are essentially asking about what happens to the memory pointed to by p
in the function bar
after it returns (and I think you already know the answer at the C level; the answer is the same at the LLVM IR level).
If the alloca is in the frame of f, then the value stored at that location should be deallocated (which is what I meant by “destroyed”) when f returns, and so I don’t understand what makes the store i32 %1, i32* %0
successfully storing 123 in the alloca of the main function.
The store inside f
does not affect the alloca inside the main function. The main function takes the value returned by the call and stores the value itself. The return value of f
is an i32 value that you can think of as held in a register at the hardware level and returned by value. In the callee (@f
) the value 123 happens to have been stored to and loaded back from an alloca, but that doesn’t matter after it returns (and indeed, the alloca’s memory is deallocated when @f
returns). Also, you’ll notice that the optimizer will happily reduce @f
to just ret i32 123
or something like that and delete the load, store, and alloca.
I’m guessing that part of this confusion is that you were seeing Clang produce a ton of alloca’s. You may wonder why clang generates all these alloca’s (and you may have noticed the optimizer will just fold them away). This is one place where the semantics of C and LLVM IR differ substantially, so thinking in terms of C won’t lead to the right conclusion.
The fundamental abstraction in C of data is essentially that every variable is assumed to be an object in memory (more precisely, this is what the term “lvalue” in C refers to; if you want to get super technical about this and related concepts in C/C++ a reference is http://en.cppreference.com/w/cpp/language/value_category). For example, you can take the address of any variable.
For example, these two functions are identical
int foo() {
int x = 123;
return x;
}
int bar() {
int x = 123;
int *p = &x;
return *p;
}
Also, in C/C++, variables are mutable. For example, this function is also identical to the previous functions:
int baz() {
int x = 123;
x = 42;
x = 123;
return x;
}
However, in LLVM IR, when you see something like %2 = add i32 %1, %0
or %2 = load i32, i32* %0
, the thing designated by %2
is very different. You cannot take the address of %2
and you cannot mutate the thing designated by %2
. You can think of %2
as a name for a node in a data flow graph (in fact, %2
is represented in the LLVM libraries as an llvm::Value; %2 is a name for an llvm::Value). There’s more about this perhaps subtle point from 10:40 to 11:40 in Chandler’s video https://www.youtube.com/watch?v=haQ2cijhvhE, though you will probably find the first 20 minutes or so very relevant to this conversation (and I would recommend watching the whole video).
(Sometimes it is also useful to think of LLVM’s values as “SSA registers”, i.e. registers that can only be assigned in one place. Don’t think about them as general mutable storage locations though)
So Clang has this problem: how do I accurately translate the semantics of code written in C/C++ (which semantically involves operations on objects in memory) to LLVM IR, which revolves around nodes in a data flow graph (whose SSA values have no notion of semantically being in a “storage location”; they are just names for llvm::Value’s that the LLVM libraries operate on, which are (roughly) nodes in a data flow / control flow graph).
The answer is that Clang models C/C++ semantics of local variables by explicitly using alloca to create temporary in-memory storage locations. Then the correspondence between the generated LLVM IR and the C/C++ semantics is easy to get right. This requires inserting a lot of loads and stores from allocas just to access the values (e.g. needing two loads, an add, and a store to do c = a + b
), but it is worth it because it simplifies Clang’s lowering from C/C++ to LLVM IR.
You may be thinking: in foo
above, at the C/C++ level it is easy to see that x
could be kept in a register instead of a memory location on the stack, and it is only assigned once, so it could be directly lowered into an LLVM IR value instead of kept in memory in an alloca. It turns out that it is actually very difficult to do this analysis in Clang, at least in any reasonable generality. It turns out that there are standard compiler algorithms (generally known as “SSA construction”) that can “easily” and quickly (e.g. linear time) do the analysis and transformation needed to optimally (or close to optimally) get rid of all the allocas. The discovery of an efficient algorithm to do this was actually a revolution in compilers precisely because it allows an SSA representation like LLVM IR to be efficiently constructed from input languages with mutable storage locations (especially in the presence of control flow, which requires deciding where to insert a special type of instruction called a “phi node” to model when a local variable is mutated along multiple control flow paths).
The mem2reg
pass (and it’s more sophisticated cousin, sroa
) are the passes in LLVM that do this.
You are correct in the sense that @f loads the value (%1 = load i32, i32* %0
), then returns it (ret i32 %1
), then control returns to main and %1 = call i32 @f()
finishes executing, then @main stores the value %1 into %0 store i32 %1, i32* %0
.
There is no magic here connected to the alloca’s; you’re explicitly seeing every operation done on those memory locations (which are distinct memory locations). The only thing that could perhaps be considered “magic” are the details of how the LLVM backends generate machine code so that the value passed to the ret
becomes available to the caller. But the semantics at the LLVM IR level are straightforward: the value of the call instruction (%1 in main) is the same as the value passed to the ret
in the callee (%1 in f).
With primitive types such as i32, this probably doesn’t matter, since they would fit in a CPU register, but what if %0 was an alloca for a very large struct? I guess that’s what RVO is all about optimising, but I’d like to confirm that assumption.
Are you asking about ABI details or LLVM IR semantics?
“RVO” doesn’t enter the picture here because we are at LLVM IR level. There is no RVO at the LLVM IR level because RVO refers to a concept that exists only at the C++ language level (in the abstract; it’s not part of the ABI). In fact, if you look on the wikipedia page for RVO, you’ll see that RVO can change a well-defined program’s console output (i.e. the number of times it prints A copy was made.
). RVO is a specific case where the C++ language standard (the semantics of the language in the abstract; this has nothing to do with the ABI) permit the compiler to change the output of a well-defined program, and the set of situations where this transformation is allowed is defined in terms of the abstract language semantics, which are only available to the C++ frontend (which has complete language-level semantic information). This is a case where thinking about LLVM IR in terms of C code is accurate: if you are lowering C++ to C, then as part of that conversion RVO is performed, and there is no concept of RVO at the C level.
Semantically at the LLVM IR level a struct return type is returned by value (i.e. copied) exactly like an i32. %1
in @main
represents a value with the same bit pattern and size as %1
in @f
; this is true regardless of the size of %1
in @f
. The alloca inside main is irrelevant to this. The alloca only matters because the value of %1
inside @main
is stored into the memory it points to by an explicit store instruction, but that is just a store like any other store (it sets the bits in memory at the address pointed at by %0 to the bits of the value of %1; the memory address happens to be an alloca that is allocated in the current stack frame).
If you are wondering about the low-level details of how %1
in @f
ends up in %1
in @main
when the call happens at runtime, then that’s a bit more complicated and depends on the processor-specific ABI. At the processor ABI level, a struct return type of an LLVM IR function is returned by value in registers (just like in C). Like I mentioned, there’s a list of designated return registers (on x86-64 there are two such registers), with specially allocated stack memory to act as extra “registers” beyond the two designated hardware registers (if necessary). This specially allocated stack memory is not explicitly visible at the LLVM IR level and comes into existence in the LLVM backend as LLVM IR is translated into a form closer to machine level.
The exact ABI details of a function call at the LLVM IR level has quite a rabbit hole that you probably don’t care about. E.g. there is sret
, byval
, and inalloca
(http://llvm.org/docs/LangRef.html#parameter-attributes) or calling conventions (http://llvm.org/docs/LangRef.html#calling-conventions) which control very low-level details of the mapping to the processor-specific ABI.
Another related thing (specifically related to C++):
The fact that the processor specific registers (or special stack memory representing extra “registers”) is not explicitly visible at the LLVM IR level is not an issue when lowering C where all structs are POD (Plain Old Data). That is why I said that you can think about this particular issue entirely in terms of lowering C++ to C. In C++, a struct can define constructors/destructors that allow it to maintain complex internal invariants. In particular, it can maintain internal invariants that depend on it being in memory, so the C++ language ABI cannot say that such structs are passed “by value” in registers. For example,
struct Foo {
Foo(const Foo &foo) {
this->x = foo.x;
this->p = &this->x;
}
int x;
int *p;
};
As part of a simple function like the following, the struct Foo that is returned “by value”, but if you think about it in terms of lowering C++ to C (or LLVM IR), it actually involves a call to a copy constructor:
Foo getFoo(Foo &f) {
return f;
}
(look at the generated LLVM IR: https://godbolt.org/g/VbwwJr)
This struct Foo
cannot be returned by value “in registers” because it maintains an internal invariant that depends on being in memory (p
points at x
). Due to this language-level semantic detail, what would p
be set to if this struct were returned in registers? (and it does fit in registers on x86-64 actually)
Because this is a language ABI requirement, all compilers must implement it identically in order to produce machine code that can be linked together. Therefore, it cannot depend on “how smart” each compiler’s optimizer is and is defined at the C++ language level. For example, you don’t want compiler’s disagreeing about whether getFoo will return in registers or in memory depending on whether they optimize this branch (which eliminates Foo’s dependence on its own address, therefore allowing it to in theory be returned in registers):
struct Foo {
Foo(const Foo &foo) {
this->x = foo.x;
if ((foo.x ^ foo.x) != 0) // Always false.
this->p = &this->x;
else
this->p = nullptr;
}
int x;
int *p;
};
Thus, the rules frequently require things to be passed via explicit pointer out-parameters even if they in theory could be passed in registers. For example, the requirement preventing Foo from being passed in registers in https://itanium-cxx-abi.github.io/cxx-abi/abi.html#return-value applies to structs that are “non-trivial for the purposes of calls” which basically means “did you manually write a copy/move constructor or destructor for this class”.
As an interesting factoid, this prevents std::unique_ptr from being passed in registers, even though it is just a single pointer and doesn’t depend on its own address, so could in theory be passed in registers. So converting an old-style:
Qux *createQux();
to a modern-style:
std::unique_ptr createQux();
is actually a slight (probably negligible) pessimization, because the ABI requirement prevents the latter from being returned in registers. The optimizer is actually smart enough to optimize this and instead pass the std::unique_ptr in registers, but the fact that it must “agree with other compilers” about it means it has its hands tied and has to follow the ABI. However, if you compile with link-time optimization, the compiler can verify that it is seeing every possible use of createQux and change them all at once to use registers instead. If createQux is static
, the compiler can also verify that it is seeing every possible user and perform this optimization. Also, if createQux is inlined, the cost will obviously go away.
The ABI actually prevents the optimizer from passing the std::unique_ptr in registers, simply because all compilers (or different versions of the same compiler, or assembly language programmers) must agree on whether it is passed in registers or not for their code to be compatible. So when people say that std::unique_ptr has “zero overhead” you can now tell them that they are technically wrong (they are right in practice though, because except for ABI limitations compilers optimize std::unique_ptr extremely well).
As you can see, this is one case where the lowest-level hardware realities (like passing things in registers or not) require consideration at the language semantic level. This is one aspect that makes ABI’s have a very deep rabbit hole and lots of details, because the lowest-level details can have high-level effects. They also want to be high-performance, but they sometimes must sacrifice that for compatibility or to avoid being infeasibly complex.
– Sean Silva