RFC: Log Symbolizer

Overview

Producing usable logs often requires referring to the addresses at which events occurred, for backtraces, sanitizer reports, and the like. Since logs are generally intended to be human readable, it’s important that addresses be referenced symbolically.

Symbolizing requires a considerable amount of information: the symbol table of the binary, where shared libraries were mapped, and the symbol tables of those libraries. Typically, the binary itself symbolizes the logs, possibly in concert with the underlying OS. Accordingly, all this information needs to be available at runtime.

For size-constrained environments, it’s desirable to strip out as much information as possible from the runtime environment. Ideally, this could be done without producing unreadable logs.

Instead of symbolizing, a binary can defer the process by embedding markup in its logs. The markup provides necessary contextual information, like memory layout, as well as presentation context, like whether an address is a line of a backtrace. Later, a symbolizing filter can process the logs and replace the markup with human-readable symbols, formatted appropriately.

Decoupling symbolization from execution makes the process more flexible. For instance, symbolization could be done lazily on-demand or in batch. The resulting output could be plain text, colored terminal output, or rich HTML with links to hosted sources. For embedded development, the symbolizing filter would typically run on considerably more powerful hardware, and the debug binaries could be hosted remotely via debuginfod.

Markup-based log symbolization has been well-used in Fuchsia, but it makes no onerous assumptions about the host or target platform. The approach should generalize well to the platforms currently supported by LLVM.

Proposal

We propose incorporating the symbolizer markup format currently used by Fuchsia into LLVM.

A simple symbolizing filter should be added: llvm-logsym. This would replace the markup in logs from stdin with human-readable symbol information and output the resulting text to stdout.

A symbolization markup parsing library should be added to LLVM’s symbolization library. llvm-logsym should use this library internally. The library would allow more exotic symbolizing filters to be written in conjunction with the existing symbolizer library, depending on specific user needs.

A markup reference document should be added to LLVM based on the contents of the existing specification.

Implementation Notes

llvm-logsym should share a number of options with llvm-symbolizer:

  • --basenames
  • --debuginfod
  • --demangle
  • --dwp
  • --fallback-debug-path
  • --functions
  • --inlining
  • --relativenames
  • --dia
  • --default-arch
  • --dsym-hint
  • The opposing variants of any of these flags

The options not shared with llvm-symbolizer are those that control output in ways that conflict with the markup specifiers or that specify specific objects to symbolize.

The common options should be extracted into a library for constructing a Symbolizer configuration from flags. This library should then be used in both llvm-symbolizer and llvm-logsym.

Markup tags not understood by llvm-logsym should be ignored and passed through unchanged; this would allow later passes to handle them. The presence of unhandled markup should cause contextual markup to be passed through unchanged; otherwise the markup could become impossible to interpret.

The markup parser should be simple enough to be a single class in the existing Symbolize library.

Reference

Markup Reference

I don’t have any particular opinion about the general functionality, but I did want to challenge the idea of making yet another LLVM executable (bearing in mind on Windows at least, each executable is huge, relatively speaking, due to the lack of proper symlinks), when there’s such considerable overlap in its interface with llvm-symbolizer. My thought was that you could add a new option to llvm-symbolizer, which takes a log file as input, and sends it down the log symbolization path instead of the current behaviour path. Addresses would be read from the file rather than stdin/on the command-line, for example, and other output will be re-echoed as-is (much like is already the case anyway). I believe everything else would work more-or-less the same as your proposal. Errors could be emitted if using existing llvm-symbolizer options that aren’t compatible with log processing.

This alternative definitely came up in earlier discussions about this. I had a slight preference for separating out the binary for cleanliness’ sake, but opinions seem to be much stronger for keeping this in llvm-symbolizer due to the size issue. Making this a mode of llvm-symbolizer sounds good to me.

I think this would be a good addition to the llvm-project. Not got a strong opinion on where it goes though.

From reading the markup it seems like the build-id is used to find the ELF file. There may need to be some guidance on how to use this for a deeply embedded project that dumps the firmware to binary from the ELF file. The page GNU Build IDs for Firmware | Interrupt has got some thoughts on how to do this.

I’ve uploaded a stack of three patches, D124686, D124798, and D126980, to add a parser for the markup language and a minimal --filter mode to llvm-symbolizer that just handles {{{symbol}}}.

Would any of the interested parties have the bandwidth for code reviews along these lines? I’ll be building out quite a few such patches in the near future: mostly greenfield code, using the existing Symbolizer libraries.

I’ll be out of office for this week, but will aim to take a look when I get back the following week. Although please don’t wait for me if someone else is able to approve before then.