[Proposal] Annotated assembly output

The following is a brief proposal for annotated assembly (and disassembly) output. Kevin Enderby and I have been discussing this a bit and are interested in getting broader feedback from interested folks.

    LLVM Rich Assembly Output

LLVM's (dis)assembly output is currently very raw. Consumers have limited ability to introspect the instructions' textual representation or to reformat for a more user friendly display. A lot of the actual instruction semantics are contained in the MCInstrDesc for the opcode, but that's not sufficient to reference into individual portions of the instruction text. For clients like disassemblers, list file generators, and pretty-printers, more is necessary than the raw instructions and the ability to print them.

The intent is for the vast majority of the new functionality to not require new APIS, but to be in the assembly text itself via markup annotations. The markup is simple enough in syntax to be robust even in the case of version mismatches between consumers and producers. That is, the syntax generally does not carry semantics beyond "this text has an annotation," so consumers can simply ignore annotations they do not understand or do not care about.

** Instruction Annotations

Annoated assembly display will supply contextual markup to help clients more efficiently implement things like pretty printers. Most markup will be target independent, so clients can effectively provide good display without any target specific knowledge.

Annotated assembly goes through the normal instruction printer, but optionally includes contextual tags on portions of the instruction string. An annotation is any '<' '>' delimited section of text(1).

annotation: '<' tag-name tag-modifier-list ':' annotated-text '>'
tag-name: identifier
tag-modifier-list: comma delimited identifier list

The tag name is an identifier which gives the type of the annotation. For the first pass, this will be very simple, with memory references, registers, and immediates having the tag names "mem", "reg", and "imm", respectively.

The tag modifier list is typically additional target-specific context, such as register class.

Clients should accept and ignore any tag names or tag modifiers they do not understand, allowing the annotations to grow in richness without breaking older clients.

For example, a possible annotation of an ARM load of a stack-relative location might be annotated as:

    ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>, <imm:#4>]>

1: For assembly dialects in which '<' and/or '>' are legal tokens, a literal token is escaped by following immediately with a repeat of the character. For example, a literal '<' character is output as '<<' in an annotated assembly string.

** C API Details

Some intended consumers of this information use the C API, therefore a new C API function for the disassembler will be added to disassemble an instruction with annotations, "LLVMDisasmInstructionAnnotated.".

How is the client supposed to make use of this markup information? At
first glance it seems like client code will just devolve into a pile
of regex insanity. Why not use an existing standardized markup, like
XML (not that I'm that fond of XML)?

At a higher level, why not expose an API for iterating over
(potentially annotated) tokens which can be programmatically
inspected. So what you expose to clients is an AnnotatedAsmTok. Given
an AnnotatedAsmTok, they can call "getAnnotation()", or
"getRawText()". A textual representation which can be read into this
form might be useful, but we should provide the parser.

I guess what I think needs a bit more explanation is why you chose to
go the "markup" route, instead of a normal programmatic API. Maybe you
could also include a couple use cases that capture your "vision" for
this functionality, and maybe a tiny bit of sample code doing
something interesting with a very rough initial interface (if it seems
more natural, since you're talking about a C API, you can just assume
bindings and write the example in your scripting language of choice).

-- Sean Silva

Hi Sean,

Thanks for the feedback! Exactly the sort of discussion I was hoping to get started.

How is the client supposed to make use of this markup information?

Target-independent introspection of the assembly. A simple example is color-coded output in a GUI disassembly display. All registers show up one color, all memory references another, and immediates yet another, and other such simple things. More interestingly, the client could use the markup to simplify implementation of mouse-over introspection of register values without having to know anything about the assembly syntax. The only target hook required would be "get the value of the register named 'foo'" since identifying the register names in the asm string is handled by the markup. Or, getting a bit fancier, visualizing data assembly data flow with def-use chains for a register being marked with arrows, again likely triggered via mouseover of a register name. The key bit here is that this is doable without the client having any knowledge of the target assembly syntax itself.

At first glance it seems like client code will just devolve into a pile
of regex insanity. Why not use an existing standardized markup, like
XML (not that I'm that fond of XML)?

Plain regex would be a very bad way to handle this. Client code should be very simple, just looking for the '<' characters to find annotations. A parser to recognize the markup and ignore it all should be almost trivial.

XML is basically just massive overkill for this. The idea is a lightweight annotation system that a client can easily strip off while paying attention to the bits and pieces it cares about.

At a higher level, why not expose an API for iterating over
(potentially annotated) tokens which can be programmatically
inspected. So what you expose to clients is an AnnotatedAsmTok. Given
an AnnotatedAsmTok, they can call "getAnnotation()", or
"getRawText()". A textual representation which can be read into this
form might be useful, but we should provide the parser.

We could. It's just outside the scope of what we're looking to do on the initial implementation. Note that it does get a bit more complicated since we're not just annotating tokens, but regions of text, and the annotations can (and often will be) nested.

I guess what I think needs a bit more explanation is why you chose to
go the "markup" route, instead of a normal programmatic API.

To keep the surface area of the C API as minimal as possible and robust against changes in what's marked up and how. Consider the interface in EnhancedDisassembly.h, for an example of what we specifically want to avoid (and obsolete).

Maybe you
could also include a couple use cases that capture your "vision" for
this functionality, and maybe a tiny bit of sample code doing
something interesting with a very rough initial interface (if it seems
more natural, since you're talking about a C API, you can just assume
bindings and write the example in your scripting language of choice).

Does the description up above sufficiently answer this? FWIW, one of the bits of example "how do I use this?" code I want as part of the project is a pretty-printed disassembly. Specifically, llvm-objdump will produce annotated disassembly and there will be a standalone tool that will take that text as input and use the markup to produce a pretty-printed output (as HTML, ANSI color codes or whatever).

A quick real-world example of where this can get used is colorized disassembly in LLDB without LLDB having to re-implement an assembly parser to do it.

-Jim

Hi Jim, thanks for the response. That pretty much clears up my primary
concern. +1 for keeping the C API small/stable/robust :slight_smile:

Having multiple hand-implemented parsers accepting the output, I think
it would be wise to have an official "conformance suite" for the
syntax so that external implementors can sleep more soundly with their
implementation; if I were implementing a parser for it, having such a
"conformance suite" would certainly help me feel better. The syntax is
pretty simple, so the whole suite can probably fit in one file.

-- Sean Silva

Hi Jim, thanks for the response. That pretty much clears up my primary
concern. +1 for keeping the C API small/stable/robust :slight_smile:

Having multiple hand-implemented parsers accepting the output, I think
it would be wise to have an official "conformance suite" for the
syntax so that external implementors can sleep more soundly with their
implementation; if I were implementing a parser for it, having such a
"conformance suite" would certainly help me feel better. The syntax is
pretty simple, so the whole suite can probably fit in one file.

That's an excellent suggestion. We'll see what we can do.

Thanks again for the feedback!

-Jim

Another question: What kind of documentation you are planning to
produce for this feature?

-- Sean Silva

Another question: What kind of documentation you are planning to
produce for this feature?

-- Sean Silva

Hi Jim, thanks for the response. That pretty much clears up my primary
concern. +1 for keeping the C API small/stable/robust :slight_smile:

Having multiple hand-implemented parsers accepting the output, I think
it would be wise to have an official "conformance suite" for the
syntax so that external implementors can sleep more soundly with their
implementation; if I were implementing a parser for it, having such a
"conformance suite" would certainly help me feel better. The syntax is
pretty simple, so the whole suite can probably fit in one file.

That's an excellent suggestion. We'll see what we can do.

Thanks again for the feedback!

-Jim

-- Sean Silva

Hi Sean,

Thanks for the feedback! Exactly the sort of discussion I was hoping to get started.

How is the client supposed to make use of this markup information?

I've implemented a binary representation for arbitrarily nested structured data, which I call "storyboard data" - the files usually end with ".sbd", part of my v3c-storyboard project in SourceForge: http://sourceforge.net/projects/v3c-storyboard/.

Their text equivalent is a storyboard text file, which, again by convention end with ".sbt".

Here's an example:

program
( name("hello-world")
, contents
     ( puts
         ( class(function-prototype)
         , returns(int)
         , parameters
             ( str(type(pointer(const(char))))
             )
         )
     , main
         ( class(function)
         , returns(int)
         , parameters
             ( argc(type(int))
             , argv(type(array(pointer(char))))
             )
         , body
             ( puts("Hello, world!")
             , return(0)
             )
         )
     )
)

The current version which you can download can walk the sbd created from this data and output "C" code for compilation.
Version 0.2.0-05 (which I'll release in a week or two) has a "hello-world3-test" program that uses an LLVM Module to dump IR that's assembled into a hello world program and run, looking for that "Hello, world!" output for the test to pass.

The 0.2.0-05 "hello-world4-test" program interprets the sbd. The C++ program "calls" the "main" function defined in the sbd and the implementation uses libffi to create closures for functions defined by the sbd and call interfaces for "real" functions, like "puts".

The sbd API can be called from C and C++.

All the demos/tests use C++ as it saves typing, but as you may suspect from looking at the projects other parts, I plan to implement a graphical user interface(GUI) to interact with sbds.

That GUI will at least in part be implemented by sbds, once the interpreter can interact with Qt's C++ classes, which I'm about to start.

Like LLVM I've got one global repository for type information and functions.

The plan is to have multiple domains, where a domain could, for example. implement a domain editor that allows the user to create, edit and execute other domains.

These are all building blocks to address the fundamental problem of software development - text.

Intel are working on a "Kinect" - like camera that can be used to scan hand gestures in real time.

I want to use that (or something like it) to develop software.

Oh, and it would be really neat if sbd or sbt was an output option for the LLVM tools (sorry, I'm a bit of an LLVM newbie).

New attributes could be added as the need arises, and parsers could use the bits they understand and leave the rest.

Obviously if the representation changes, e.g. from name(type) to type(name) (no, I don't plan to do that), then everyone has to play catch-up, but that's nothing new to software development.

I'm undecided as to whether sbd's should have a globally unique id (GUID) as a version identifier, to allow for multiple formats/layouts.

Regards,
Philip Ashmore

Hi Jim, thanks for the response. That pretty much clears up my primary
concern. +1 for keeping the C API small/stable/robust :slight_smile:

Having multiple hand-implemented parsers accepting the output, I think
it would be wise to have an official "conformance suite" for the
syntax so that external implementors can sleep more soundly with their
implementation; if I were implementing a parser for it, having such a
"conformance suite" would certainly help me feel better. The syntax is
pretty simple, so the whole suite can probably fit in one file.

That's an excellent suggestion. We'll see what we can do.

I keep thinking that if we're stuck with parsing, we might as well be structured and regular, like JSON or YAML. That thought then leads to suggesting that we adopt Nick's YAML I/O system.

I would very much like to eventually add annotations like we use in our legacy tools that provide a summary of performance info like latency and flags indicating microcoded instructions, etc. This would be simplified with a structured output.

Alex