"Mapping High-Level Constructs to LLVM IR" Github URL

Hi,

It will probably take a few weeks or a month before the “Mapping High-Level Constructs to LLVM IR” document is ready for prime time. Until then, you can review and study it at this URL:

https://github.com/archfrog/llvm-doc/blob/master/MappingHighLevelConstructsToLLVMIR.rst

Please notice that I specifically do not advocate reviewing the document for a week or two. But feel free to give me any feedback, comments, and criticism that you may have to share.

Once the document has been finalized and comitted to LLVM, I’ll delete the repository at Github - or, perhaps even better, simply make a small page that refers to the official copy in LLVM.

– Mikael

Hi,

It will probably take a few weeks or a month before the “Mapping High-Level Constructs to LLVM IR” document is ready for prime time. Until then, you can review and study it at this URL:

https://github.com/archfrog/llvm-doc/blob/master/MappingHighLevelConstructsToLLVMIR.rst

Please notice that I specifically do not advocate reviewing the document for a week or two. But feel free to give me any feedback, comments, and criticism that you may have to share.

Once the document has been finalized and comitted to LLVM, I’ll delete the repository at Github - or, perhaps even better, simply make a small page that refers to the official copy in LLVM.

For the section on lambdas: There’s a better way than using llvm.frameaddress: move all the locals used in the lambda in a struct (and make the container function use that structs values instead of a regular local), then pass the address of the struct to the lambda as a “this”. This is how objc for example passes things to a block (although objc does ref counting in addition to that so the block can be used after the original returned)

Just briefly looking over the document, I really like the content.

I’m now starting to see a really good “fit” for this document: a “guide for language frontend implementers” illustrating basic techniques along with a discussion of implementation decisions regarding the lowering of certain constructs. I don’t think that we currently have any documentation targeted at language frontend writers; I would really like to see this document evolve into that.

E.g., what are the different ways to do lambdas and what are their tradeoffs regarding optimizability, etc.; or a discussion of the various function attributes (e.g. noalias) which are vital for getting the best performance out of the optimizers (many languages semantically don’t have aliasing issues (or fewer than C/C++ at least!), and this needs to be communicated to the optimizer; currently we don’t have documentation offering guidance in this regard).

Most of the documentation about the IR is aimed at people writing optimization passes. Catering to language frontends seems like it would be a really good thing to do in a systematic fashion.

– Sean Silva

I’m glad you see the potential of this document. It is very important that everybody joins in and add their pennies so that the document eventually reflects the real experience of people who have actually tried and studied these things, and who are familiar with LLVM IR from using and implementing it for a long time.

I sort of hope that this document will one day cover almost all LLVM IR instructions and attributes so that the Language Reference can be used as a reference (like the title of it suggests) and this document can be used as a primer/guide/tutorial/introduction to LLVM IR. I imagine people will start out with this document and then gradually become familiar with the Language Reference.

– Mikael

I'm glad you see the potential of this document. It is very important
that everybody joins in and add their pennies so that the document
eventually reflects the real experience of people who have actually tried
and studied these things, and who are familiar with LLVM IR from using and
implementing it for a long time.

It would really help if you tracked down various frontend implementers and
asked them for feedback. I can at least think of Rust, Haskell (GHC), and
Rubinius.

-- Sean Silva

Great idea! It is now on my to-do list.

– Mikael

Hey Mikael,
cool, this is really exactly what I need right now. It will speed up my learning curve immensely.
Thanks a bunch.

Daniel

Just want to comment that I strongly approve of your intended goal. Depending on how my time shakes out over the next few weeks, I may even take some time to write up my own experiences. I particularly like how you have chosen to layout various alternative implementations rather than choosing "one true implementation".

A few areas you haven't covered and might want to consider:
- How to enable debug information? (Line and Function, Variable)
- How to interface with a garbage collector? (link to existing docs)
- How to express a custom calling convention? (link to existing docs)
- Representing constructors, destructors, finalization
- How to examine the stack at runtime? How to modify it? (i.e. reflection, interjection)
- Representing subtyping checks (with full alias info), TBAA, struct-path TBAA
- How to exploit inlining (external, vs within LLVM)?
- How to express array bounds checks for best optimization?
- How to express null pointer checks?
- How to express domain specific optimizations? (i.e. lock elision, or matrix math simplification) (link to existing docs)
- How to optimize call dispatch or field access in dynamic languages? (ref new patchpoint intrinsics for inline call caching and field access caching)

Out of curiosity, what do you see as the intended scope of this document? I see there being three main categories of languages: pure static compilation, "managed languages" (i.e. compile to bytecode + runtime system), and pure dynamic languages*. I could see different features and focus for a document geared at each of these camps. For example, the first two are going to be interested in static optimization, whereas the last two are going to be interested in speculative optimization. What are your thoughts on this?

* Note: Let's not get too caught up in the categorization. It's not really important where exactly the lines are drawn.

Philip

Hi Philip,

Thanks for your great list of ideas for the document!

I don’t really have a scope for the document beyond: If it something that requires mapping from high-level to LLVM IR, I think it should go into the document. I started out using C++ examples because many people know C++. I am personally mostly an advocate of statically checked languages but I don’t see that as a reason to not include information of relevance to non-statically checked languages. As for garbage collection, well, more and more languages are making use of that so I think it is highly relevant to the document. I think all of the issues you have mentioned belong in this document, although I am not sure I’ll be the best to write all of the document. I sort of hope that everybody will add their pieces so that we get a huge document that addresses nearly all the needs of somebody who is going to start on or be using LLVM IR.

You are all more than welcome to branch the document at GitHub, add your corrections and even entire chapters, and I’ll be happy to merge it all back to a single document.

– Mikael

I have a significant chunk of notes from my experience with garbage collector integration with LLVM that I’d be happy to contribute to this effort.

The original notes live here: https://code.google.com/p/epoch-language/wiki/GarbageCollectionScheme

I imagine it would be preferred if that document was formatted and edited to match the existing efforts; I’ll try and start converting it over in the next several days and submit a pull req when I’ve got a finished chapter.

I also have some miscellaneous notes from my other work on the Epoch language that might be worth including, I’ll have to comb through them and see what I have that’s not already covered in this writeup.

  • Mike

Hi Mike,

That sounds awesome! I have no experience with implementing garbage collection at all, so I was wondering if I should do a call for writers. I am already overloaded by the prospect of having to learn and document zero-cost exception handling. As I said to Dirkjan at one point: I am probably not the best to write this document, except for the fact that I have all the time in the world.

If it is too much trouble writing up the notes in reST, you can always do it in whichever format you prefer and hand it over to me. I don’t mind an occasional manual reformatting task to cool down my brain a bit :wink: This goes for anybody who reads this message and does not like the idea of having to learn reST just to submit notes or chapters to the final document. Submission in every one of these formats are accepted: Word over to PostScript over to PDF to ASCII to reST. And even more, if only I can find a way to read it and/or print it on a Windows box.

I meant to say this earlier on: Anybody with sufficient knowledge and/or experience is more than welcome to submit one or more chapters to be included in the final text. The basic idea is to document all those trivial and difficult things that go into making a front-end using LLVM. The scope of the document is strict LLVM IR, though, so there will be no mention of the C++ or C API.

– Mikael

Hi,

It will probably take a few weeks or a month before the “Mapping High-Level Constructs to LLVM IR” document is ready for prime time. Until then, you can review and study it at this URL:

https://github.com/archfrog/llvm-doc/blob/master/MappingHighLevelConstructsToLLVMIR.rst

Please notice that I specifically do not advocate reviewing the document for a week or two. But feel free to give me any feedback, comments, and criticism that you may have to share.

This looks really awesome! Great idea starting this, and thank you for pushing it forward. Some thoughts:

  • In “local variables”, it would be great to talk about how the “alloca trick” avoids forcing your frontend to build SSA. You could even include an example.

  • In the “constants” section, it is probably best to say that “constants that allocate memory” are just global variables in LLVM IR, marked with the ‘constant’ keyword. It would also be great to mention constant exprs here, since they are a point of confusion (and you introduce them in sizeof).

  • Having something that talks about lowering C-style unions to llvm IR would be great :slight_smile:

  • A nice new top-level section would be “interoperating with a runtime library”, pointing out that not everything needs to be emitted as inline LLVM IR: a frontend can either just call into a runtime library, or it can even emit a call to a runtime library (whose body is also available as IR) and then have the optimizer inline it if run.

Once the document has been finalized and comitted to LLVM, I’ll delete the repository at Github - or, perhaps even better, simply make a small page that refers to the official copy in LLVM.

Are you interested in just building it in llvm.org/docs? Unless your workflow is better on github, it seems easier to do it on llvm.org - it would make it easier for other people to chip in.

-Chris

Hi Chris,

Thanks for the supporting words! I’m pushing the document both for egoistic motives (like so many others, I’ll learn a ton from this document) and for altruistic motives - the easier it is to implement a new language, the more interesting and highly well-thought out languages we will see in the future. And I see it as my purpose, as a mostly black-box user of LLVM, to enhance the experience for newcomers so that they don’t turn away and waste time on other projects just because it all seems rather overwhelming at first.

I couldn’t recall having heard of the “alloca trick”, but a Google search revealed that this is described in the Kaleidoscope sample. I will be more than happy to include it - that’s precisely what the document is also for: Teaching people all the things that cannot easily be said in a Language Reference. In a way, the name is already now becoming poorly chosen. Because I begin to see a User’s Guide forming in the horizon. And that would go really well with the Language Reference; most products have both.

I’ve added all of your suggestions to my to-do list, which I’ll write into the document later today, so that none of the suggestions get lost. Yes, the unions I thought about at some point but forgot about them again. I also feel that there needs to be good documentation of GEP and extractvalue - when to use one and when to use the other. In fact, the whole structure/union aspect seems mostly overlooked because I got too preoccupied with the class stuff.

I am not at all opposed to working directly from llvm.org/docs, the only thing is that I do a lot of small commits with an occasional large commit here and there, and I wouldn’t want to provoke a review whenever I change a single line here or there. The reason I use GitHub is that it provides a nifty, colorized page (https://github.com/archfrog/llvm-doc/blob/master/MappingHighLevelConstructsToLLVMIR.rst) that people (including myself) can view without going through the trouble of installing and running Sphinx locally. And also, it allows people to submit reviews by forking, creating an issue, or attaching a comment (all three of which have already been in use). I think it is better that I do it in GitHub for the time being as I tend to make many small, stupid mistakes that I usually discover quite quickly and then fix. Then when I feel I’ve got something interesting to show people, I can submit a commit and everybody can join in the review.

– Mikael

Hi Chris,

I just had the joy of reading Regehr’s and your articles on undefined behavior in C and C++. Thank you for those. They sort of put words on things that I have felt for a decade or so - that C and C++ are too unsafe languages to be used in all but the most “masochistic” environments (sorry, you guys who love C++ out there - please don’t get offended just because I personally happen to have grown very tired of the C/C++ universe long ago, I once lived and breathed C and C++ like you do now!).

Anyways, I just wanted to suggest something that I have been doing for a few decades. And that is to increase the reliability of the output program by using at least four different compilers. Not three times GNU and one time MSVC, but four different vendors’ compilers. I believe LLVM already does this, but I also know of many projects that do the “fatal” mistake of tying themselves in with a single compiler, like MSVC, and then sort of lean on the implied behavior in undefined cases. Obviously, such projects are in for a rough ride the first time they ever get ported to another compiler.

Like I always say: Windows didn’t become stable until the time that Microsoft was forced, by business plans, into porting Windows to four different architectures. And the same goes with many software projects that only use a single compiler on a single platform - they are, in your words, mines waiting to explode.

I’m confident that you know all this well, but you write “There is No Reliable Way to Determine if a Large Codebase Contains Undefined Behavior”. True indeed, but if you use 4+ compilers, perhaps combined with building on both big-endian and little-endian platforms, and do both debug and release builds, you get much closer to the magic point where you can actually sleep soundly knowing that your code is mostly okay :slight_smile:

– Mikael

Hi Chris,

Thanks for the supporting words! I'm pushing the document both for
egoistic motives (like so many others, I'll learn a ton from this document)
and for altruistic motives - the easier it is to implement a new language,
the more interesting and highly well-thought out languages we will see in
the future. And I see it as my purpose, as a mostly black-box user of
LLVM, to enhance the experience for newcomers so that they don't turn away
and waste time on other projects just because it all seems rather
overwhelming at first.

I couldn't recall having heard of the "alloca trick", but a Google search
revealed that this is described in the Kaleidoscope sample. I will be more
than happy to include it - that's precisely what the document is also for:
Teaching people all the things that cannot easily be said in a Language
Reference. In a way, the name is already now becoming poorly chosen.
Because I begin to see a User's Guide forming in the horizon. And that
would go really well with the Language Reference; most products have both.

TBQH I'm pretty set on this being a guide for language frontends, rather
than a general "user's guide" for the IR. The IR has at least two very
different classes of users: optimization writers (which are mostly
transforming IR) and language frontend writers (which are mostly creating
IR). Almost everything in this document is geared at language frontend
writers (or more generally "people generating IR"), rather than
optimization writers (we already have pretty good docs for them).

I've added all of your suggestions to my to-do list, which I'll write into
the document later today, so that none of the suggestions get lost. Yes,
the unions I thought about at some point but forgot about them again. I
also feel that there needs to be good documentation of GEP and extractvalue
- when to use one and when to use the other. In fact, the whole
structure/union aspect seems mostly overlooked because I got too
preoccupied with the class stuff.

GEP is for forming addresses, and extractvalue/insertvalue is for
extracting/inserting fields from aggregate-typed SSA values.

I am not at all opposed to working directly from llvm.org/docs, the only
thing is that I do a lot of small commits with an occasional large commit
here and there, and I wouldn't want to provoke a review whenever I change a
single line here or there. The reason I use GitHub is that it provides a
nifty, colorized page (
https://github.com/archfrog/llvm-doc/blob/master/MappingHighLevelConstructsToLLVMIR.rst)
that people (including myself) can view without going through the trouble
of installing and running Sphinx locally. And also, it allows people to
submit reviews by forking, creating an issue, or attaching a comment (all
three of which have already been in use). I think it is better that I do
it in GitHub for the time being as I tend to make many small, stupid
mistakes that I usually discover quite quickly and then fix. Then when I
feel I've got something interesting to show people, I can submit a commit
and everybody can join in the review.

If it's easier for your workflow to iterate on github, that's fine,
although eventually we will want to move it into docs/. It definitely has a
bit of a "grab bag" feel; as the content solidifies, I'd like to see a
better organization.

There's some content though that might be easier to develop in-tree, like
how to hint the optimizers to get maximum performance. Especially alias
analysis (both TBAA, and the various function/parameter attributes) and
alignment, as most non-C languages can provide very strong aliasing and
alignment guarantees.

-- Sean Silva

It would be worth mentioning that this is the trick that Clang itself uses. There can be some tradeoffs though - particularly if your language involves anything which looks to LLVM like an assignment which isn’t semantically. (For example, object relocation, metadata updates, lazy computations, etc…) Long term, you probably want to teach LLVM about your semantics anyways, but as a short term measure doing SSA conversion yourself can be useful to increase out of the gate performance. Agreed. Adding a description of the variant type (type safe union) would also be useful. A couple things to add here: 1) Your runtime function might also be represented as an custom LLVM intrinsic. This allows selective lowering with custom passes. (By default, you’d lower the intrinsic into a runtime call.) 2) Getting function attributes correct on your runtime calls can be key for performance. (i.e. if this is a pure call, CSE should exploit that.) Custom calling conventions (i.e. fewer caller saved registers, etc…) can also be useful. Philip

You’re absolutely right, I didn’t think it through: This is and remains an LLVM IR guide for front-end writers, not a user’s guide as such. That would be an entirely different project.

– Mikael