request for tutorial

When I registered for dev conference, there was a field asking what I was particularly interested in learning. I didn’t fill it out then , but it occurs to me now that I’d really enjoy a tutorial on how to develop a new back end.

I spent some time recently reviewing existing material (documentation and code) and not making a lot of progress. Indeed, under some time pressure, I’m writing my own back end.

A chance to work through the material with some guidance would be greatly appreciated.

Thanks,
Preston

(Sorry about the wall of text, it ended up as a brain dump of a bunch of backend-related documentation that I know about/have bookmarked in the past. Hopefully there’s something useful in there.)

If you haven’t stumbled across them already, these might be helpful:

http://llvm.org/devmtg/2009-10/Korobeynikov_BackendTutorial.pdf
http://jonathan2251.github.io/lbd/

http://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1/

http://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/

Another good source of documentation is simply to look at the commits that introduce new backends, e.g. the initial commits for SystemZ and AArch64 come to mind. If you use git, something like this might be handy:

git log -p -n1 $(git log --pretty=‘format:%H’ lib/Target/AArch64/ | tail -n1) >initial.patch

Generally the most consistent source of good documentation is the devmeetings, so definitely browse what’s available at <http://llvm.org/devmtg/> (even quite old ones can still be really useful, such as Anton’s slides). Also, as I’m sure you’ve heard, a lot of that backend stuff is slated for change (getting rid of SelectionDAG, global-isel, separating the MI layer, etc.), although the process will probably take a long time.

There’s also a lot of backend-related knowledge encoded (trapped?) inside of the TableGen backends, and AFAIK the only way to really puzzle out the deep details there is to look at the source code of the TableGen backends in utils/TableGen/ and correlating that with the generated .inc files (look in your build dir) and where those .inc files are included inside lib/Target/$ARCH/; see utils/TableGen/TableGenBackends.h for an overview of what’s there. There is some rough documentation about them in <http://llvm.org/docs/WritingAnLLVMBackend.html> and <http://llvm.org/docs/CodeGenerator.html>.

The actual TableGen “language” is documented fairly poorly but actually has real dedicated documentation available (<http://llvm.org/docs/TableGenFundamentals.html>, <http://llvm.org/docs/TableGen/LangRef.html>). There’s also tests/TableGen/ which might be useful. To learn about the in-code data structures that are produced by TableGen and manipulated by the TableGen backends, you basically need to look at include/llvm/TableGen/Record.h and lib/TableGen. Beware: the TableGen code is not quite up to par with the rest of the codebase. Also, don’t assume that there’s some coherent “underlying structure” to the TableGen language; it’s mostly a product of long periods of neglect punctuated by quick hacks when a developer gets frustrated by clunky existing ways to express something (or inability to express something).

The only documentation (besids the code itself) we have about the MI layer is the short section <http://llvm.org/docs/CodeGenerator.html#machine-code-description-classes>. You can get an idea of the scope of the MI layer from the plans laid out in <http://thread.gmane.org/gmane.comp.compilers.llvm.devel/65434>. Other than that you’ll probably have to go digging in the source or ask on the mailing lists.

There’s a blog post about MC <http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html> and an old dev meeting talk by Daniel Dunba, along with a section here <http://llvm.org/docs/CodeGenerator.html#the-mc-layer>r. There’s also a quite thorough tutorial specifically about implementing an integrated assembler <http://www.embecosm.com/appnotes/ean10/ean10-howto-llvmas-1.0.html>. Other than that, I don’t know very much material to recommend about MC.

If you gain any insight about a specific thing in the course of your backend project, please feel free to share that knowledge by documenting <http://llvm.org/docs/SphinxQuickstartTemplate.html>.

– Sean Silva

Hi Preston,

I don’t have any extra documentation to offer, but I’d be glad to talk through the topic with you at the dev conference and/or elsewhere. I’ve written several backends, so have a reasonable feel for what’s involved.

For example, I strongly suggest writing the integrated assembler and disassembler first, then adding compiler code generation support to that. It’s much easier to adjust assembler-centric definitions to play nicely with code generators than the other way around in my experience, and you end up with .td files that more accurately represent the architecture rather than just a compiler-centric view of that architecture.

-Jim

He did this again last year at the EuroLLVM:

http://llvm.org/devmtg/2012-04-12/Slides/Workshops/Anton_Korobeynikov.pdf

They're similar, but different. I assume the newer one is better, but you
can look at both, shouldn't take too long.

cheers,
--renato

http://llvm.org/devmtg/2012-04-12/Slides/Workshops/Anton_Korobeynikov.pdf

They're similar, but different. I assume the newer one is better, but you
can look at both, shouldn't take too long.

Yeah, right. The EuroLLVM one was a "refreshed" version of the first
one. For example, there was no MC in 2009 and so on.

The backend stub is available at https://github.com/asl/llvm-openrisc
but given the LLVM development speed it's heavily outdated. Maybe it
will make sense to look into new backends like SystemZ.

It would be nice if a stub backend could be committed to the tree and so it would stay up to date with changes...

David

That is a good idea.

The caveat is that people needing quick changes to their backends (for
fixing buildbots, for instance) would have the chance to break the dummy
back-end and slow down bug-fixing, or end up disabling some tests on the
dummy backend just to fix the problem. I can't predict how much of it will
actually impact day-to-day development, but it'd be crucial to have some
code owner that would keep it tidy and relevant.

Anton,

I've forked your repository in GitHub and I'll give it a try to make it
work on the new LLVM, but if you have some time, feel free to do it, too. :wink:

cheers,
--renato

It would be nice if a stub backend could be committed to the tree and so it would stay up to date with changes...

I asked for this several times and the conclusion was "no, what for,
we have plenty of other backends in the tree which can be thought as
examples". IMHO, it will be nice thing to have though...

I think that it would be fine as long as we have two conditions:
* Fixes on other back-ends that break the dummy tests are allowed to
disable them
* Code owners will fix the broken tests on a timely basis

If most/all tests end up disabled because they all break, then it'd be good
to remove from the tree, since the code owners are not doing their job
right.

Would also be good to get bugs created for each test other people disable,
so we could prioritize the changes.

I don't know much about that back-end, but I'd really wanted to learn
building one from scratch, so I could help you on the bug-fixing and adding
lots of comments to help guide folks through the changes, but I can't be
the sole owner, as I know nothing about it. :wink:

cheers,
--renato

I asked for this several times and the conclusion was "no, what for,
we have plenty of other backends in the tree which can be thought as
examples". IMHO, it will be nice thing to have though...

I think that it would be fine as long as we have two conditions:
* Fixes on other back-ends that break the dummy tests are allowed to
disable them
* Code owners will fix the broken tests on a timely basis

(Devil's advocate, but half serious):

It may be a good idea to require that the dummy backend is kept up to date,
as a way to keep a pulse on changes to the interfaces between the
target-independent and target-dependent parts of LLVM. A slightly idealized
way to put it is that the dummy backend would be a "living definition" of
the boundary between the target-dependent and target-independent parts of
LLVM. By virtue of being just a stub, complexity related to being a "real
backend" could be removed, so that when you look at a file inside the stub
backend, you know that there isn't extraneous junk that some other target
wouldn't have: everything there is essentially an example of something that
every target has in common.

Just as having a reduced testcase is much easier to debug compared to the
original unreduced code it originates from, so too a stub backend could be
useful to gain understanding of what is happening in a real backend.

If we included a script that duplicates the stub into a new directory and
renames things appropriately (and probably also clang-format's it along the
way, since name lengths may have changed), that would radically alter how
easy it is to teach/learn about the backends too, for example, with a "real
backend", it's just not going to be realistic to have a learning material
like the following:
    1. Explain calling convention handling, and how XXXCallingConv.td works
    2. Have an exercise "implement a calling conv with these properties:
..."
... since with a real backend, you don't want to be messing around and
inserting random new calling conventions/instructions/etc., but with a stub
(or even better, a clone of the stub), it is natural to say "suppose that
the target were different in this
entirely-hypothetical-but-pedagogically-useful respect; modify the code to
handle this". Those kinds of hands-on exercises are fundamental to gaining
confidence with how code works.

So I think the argument can be made that even though a stub backend might
represent an additional maintenance burden, what you get in return is that
it becomes a lot easier to attract developers to work on the backend
(either existing LLVM developers that didn't do backend stuff, or new
contributors), which then share the maintenance burden. Realistically, even
if the stub backend contributes to attracting just *1* new contributor to
backend development, who goes on to work full-time on LLVM, then it will
have paid for itself many times over in maintenance cost.

-- Sean Silva

These things have a tendency to bitrot. When I started looking at gcc
it had an example frontend (treelang). The problem was that it was
updated just enough to keep it building. It didn't follow the best
practices that the other frontends had been moved too. In the end it
was easier to read the fortran FE code than treelang's.

Cheers,
Rafael

--- end quoted text ---

Hi,
Based on my experience, the best path forward is for folks to use the real
backends as examples of design. What is needed is a tightly focused backend design
guide available and properly maintained. The design guide should only touch on
the essential issues and concepts to help someone get started. The aspiring
backend designer must take responsibility for traversing the learning curve.
And in this context, one can envision the limited nature of the documented
design guide, where the goal is to point folks into the key objects in the
code with overview commentary concerning major dependencies, etc. Within
this limited scope, it shouldn't be too much of a burden to maintain the
design guide document, while providing motivated developers the bridge they
need to get started without wasting a lot of time.

enjoy,
Karen

Thanks all, for all the pointers. I’d read some of them, but others are new.
Jim, I’d certainly appreciate the chance to get together at the dev conference.
I think y’all’s ideas of a stub back end sound good, though I can appreciate the maintenance difficulties.
Karen’s idea for a new design guide is also good, though I think the idea of looking at other back ends for examples of good design is perhaps wrong.

A lot of my difficulty in reading other examples is that it’s not clear what matters and what doesn’t. It’s what I hope to get by sitting next to someone and asking questions. Some of this could be addressed in a guide. I’d start with a chapter on planning. When writing a new back end, we wonder:

  • what sort of architecture are we dealing with? 3-address, 2-address, 1-address, 0-address (stack machine)?
  • what do we want to build? Do we need an assembler?, a disassembler?, a JIT?, an instruction scheduler?, a register allocator?
  • Can we talk about all these things in an orthogonal fashion? With examples, or perhaps pointer to specific targets that illustrate different things?

Another approach, a mix between a dummy back end and Karen's proposal, is
to not only write documents on how things piece together, but also add
comments to the code on how important this file/function is, how it fits
with the rest, and how generic/specific it is.

Documentations get outdated more often than comments in the code, and LLVM
is particularly good at generic comments, but not so much at teaching how
to use the code.

Patches adding comments are also a good way to learn how something works.
You send a comment, people say how wrong that is, and in the end, you learn
by teaching others via your comments.

cheers,
--renato

This is especially true in the back end. For example, MCInstrDesc[1] completely lacks any class-level doc comment saying how you would map from an instruction to the description. It turns out that each back end exposes an array of MCInstrDesc objects, indexed by opcode (in the MCInstr sense, not in the target instruction set sense), but the only way to discover this is by grepping the codebase.

We're generally quite good at adding comments on individual methods (or at making the method names self-explanatory), but not at documenting classes. I'd encourage anyone who has added a class to LLVM to revisit it and see if they can improve the documentation.

Another example is object::ObjectFile. You can cast<> or dyn_cast<> these down to object::COFFObjectFiles or object::MachOObjectFiles easily, but object::ELFObjectFile is a template class and there's no documentation of what the template parameter should be, of it if's possible to infer it from the object::ObjectFile somehow.

David
Who has spent the last week writing a tool that uses the MC layer and found the lack of documentation frustrating

[1] http://llvm.org/doxygen/classllvm_1_1MCInstrDesc.html

This strikes me as poor design, helped by the fact that there was no
expected API or documentation on how to use it, so special implementations
will differ and become incompatible with each other. Per se, it's not a
problem, but when you change the interface, (or when trying to use
polymorphism), things can go wrong in mysterious ways.

A few lines of comments saying what's expected from the derived classes
would have prompted folks to work around the template problems (or found a
different solution) before it became an issue.

cheers,
--renato

I think y’all underestimate how important documentation can be. There are, after all, documents out there that purport to be guides to writing a back end for LLVM. I know of 2 other experienced & motivated compiler writers who read the available documentation, wrote some code, foundered, gave up, and wrote their own back end from scratch. So there’s three of us that I know about, and I don’t get around much. A friend in the newspaper business said they had a rule of thumb that said approximately: “One letter from a reader implies another hundred readers out there who had the same opinion.”

Preston

Have you read through http://llvm.org/docs/CodeGenerator.html ? It’s a reasonably good document, but IME people tend to ignore it.

I don’t really find it in the documentation so far, but I think the instruction selection stuff is based on a BURG-like pattern matcher. It seems like this is the key intellectual challenge for the not-completely RISC machines. Lots of discussion seems warranted, but I find none at all. Somewhere inside the x86 target must be a lot of stuff on how to handle a 2-address machine. I’ve done an x86 CG once before, and all of the fun was in getting the right set of patterns to handle the fancy addressing modes, the condition codes, and the 2-address instructions. I’m working on a 1-address machine now (an accumulator machine, with a set of register and an accumulator, and instructions of the form ACC += Register) and the pattern matcher to do everything optimally is delightful.

SelectionDAG is actually not a BURG optimal tiler; it’s a simple maximal munch pattern-matcher. The output of the instruction selector is still in SSA form, which mandates that, no matter the address form of your machine, you’ll need to model operations as been three-operand, with a tying constraint that will be satisfied later by the register allocator. I don’t know that anyone has been particularly successful at targeting accumulator architectures with LLVM, but you might look at how X86 handles integer multiplies for some inspiration.

what do we want to build? Do we need an assembler?, a disassembler?, a JIT?, an instruction scheduler?, a register allocator?

The minimal amount of stuff you need to write is described in http://llvm.org/docs/WritingAnLLVMBackend.html#basic-steps

Generally speaking, it amounts to defining your register classes and instructions, and implementing enough callbacks for instruction selection and register allocation (e.g., how to materialize a copy).

–Owen

I think y'all underestimate how important documentation can be.

I'm a strong proponent for good documentation, and whenever I get a solid
understanding of a specific part of LLVM I will usually write documentation
for it. (I'm pretty clueless about backend stuff, which is why I haven't
produced any documentation there!)

I think it's largely an issue with LLVM's culture. If somebody commits
something without tests, they will rapidly receive code review feedback
asking for tests; if someone adds a diagnostic to clang that has a really
bad user experience, they will be flagged for it and asked to improve it.

On the other hand, if somebody commits something without documentation (or
with poor documentation), usually I'm the only one that will flag them for
it and ask them to improve it.

There are, after all, documents out there that purport to be guides to
writing a back end for LLVM. I know of 2 other experienced & motivated
compiler writers who read the available documentation, wrote some code,
foundered, gave up, and wrote their own back end from scratch. So there's
three of us that I know about, and I don't get around much.

This makes me kind of depressed. I had always surmised that most of the
reason that I had a hard time grokking the backend code was due to simply
not having any formal compiler education (e.g. never taken a course or read
a book about it), so last weekend I read Cooper & Torczon's "Engineering a
Compiler" in an attempt to gain a better understanding of the subject. If
you're having a hard time grokking the LLVM backend stuff (and you're the
most cited author (excluding self-citations) in EaC), then I'm not sure
what my prospects are...

Don’t be depressed! I expect my difficulties, and those of the other folks I mentioned, are due to impatience. I know what’s supposed to be in a back end and how it’s generally supposed to work, so I read real fast, looking for clues to help me meld my understanding with that of the folks who wrote it. After a certain amount of floundering, I think: “I could write my own, and get it done quicker and understand it better.” It’s like what we all do for all sorts of poorly documented code.

It’s one of the reasons I like references to papers (e.g., this register allocator is based on that paper, with some mods). I read that sentence and instantly I know a lot about what’s going on and can look at the list of mods and know even more. Without the refs, I just have an annoying puzzle, and all my book learning is no help.

Preston