LLVM on bare-metal

Hello!

Q1
Are there any resources or examples on embedding LLVM into an ARM-based bare-metal application? Searching in this area only turns up information on how to use LLVM to target bare-metal when I want to compile LLVM for linking against a bare-metal application.

Q2
Are there any memory usage benchmarks for LLVM across the common tasks (especially loading bytecode, doing the optimization passes and finally emitting machine code)? My target (embedded) system has only 1GB of RAM.

Background:
I'm about to embark on an effort to integrate LLVM into my bare-metal application (for AM335x, Cortex-A8, also known as beaglebone black). The application area is sound synthesis and the reason for embedding LLVM is to allow users to develop their own "plugins" on the desktop (using a live coding approach) and then load them (as LLVM bytecode) on the embedded device. LLVM would be responsible for generating (optimized, and especially vectorized for NEON) machine code directly on the embedded device and it would take care of the relocation and run-time linking duties. This last task is very important because the RTOS (Texas Instrument's SYS/BIOS) that I'm using does not have any dynamic linking facilities. Sharing code in the form of LLVM bytecode also seems to sidestep the complex task of setting up a cross-compiling toolchain which is something that I would prefer not to have to force my users to do. In fact, my goal is to have a live coding environment provided as a desktop application (which might also embed Clang as well as LLVM) that allows the user to rapidly and playfully build their sound synthesis idea (in simple C/C++ at first, Faust later maybe) and then save the algorithm as bytecode to be copied over to the AM335x-based device.

Thank you in advance for any help or pointers to resources that you can provide!

Kind regards
Brian

Hi Brian,

I'm afraid I can't answer your actual questions, but do have a couple
of comments on the background...

LLVM would be responsible for generating
(optimized, and especially vectorized for NEON) machine code directly on
the embedded device and it would take care of the relocation and
run-time linking duties.

That's a much smaller task than what you'd get from embedding all of
LLVM. lldb is probably an example of a program with a similar problem
to you, and it gets by with just a pretty small stub of a
"debugserver" on a device. It does all CodeGen and even prelinking on
the host side, and then transfers binary data across.

The concept is called "remote JIT" in the LLVM codebase if you want to
research it more.

I think the main advantage you'd get from embedding LLVM itself over a
scheme like that would be a certain resilience to updating the RTOS on
the device (it would cope with a function sliding around in memory
even if the host is no longer available to recompile), but I bet there
are simpler ways to do that. The API surface you need to control is
probably pretty small.

Sharing code in the form of LLVM bytecode
also seems to sidestep the complex task of setting up a cross-compiling
toolchain which is something that I would prefer not to have to force my
users to do.

If you can produce bitcode on the host, you can produce an ARM binary
without forcing the users to install extra stuff. The work involved
would be pretty comparable to what you'd have to do on the RTOS side
anyway (you're unlikely to be running GNU ld against system libraries
on the RTOS), and made slightly easier by the host being more of a
"normal" LLVM environment.

Cheers.

Tim.

Hi Tim.

Thank you for taking to time to comment on the background!

I will definitely study lldb and remote JIT for ideas. I worry that I will not be able to pre-link on the host side because the host cannot(?) know the final memory layout of code on the client side, especially when there are multiple plugins being loaded in different combinations on the host and client. Is that an unfounded worry?

I suppose it is also possible to share re-locatable machine code (ELF?) and only use client-side embedded LLVM for linking duties? Does that simplify things appreciably? I was under the impression that if I can compile and embed the LLVM linker then embedding LLVM's codegen libraries would not be much extra work. Then I can allow users to use Faust (or any other frontend) to generate bytecode in addition to my "live coding" desktop application. So many variables to consider... :slight_smile:

Kind regards
Brian Clarkson

Orthogonal Devices
Tokyo, Japan
www.orthogonaldevices.com

Hello!

Q1
Are there any resources or examples on embedding LLVM into an ARM-based
bare-metal application? Searching in this area only turns up
information on how to use LLVM to target bare-metal when I want to
compile LLVM for linking against a bare-metal application.

I'm not aware of any examples unfortunately. I suspect that this could
be quite challenging depending on how rich an environment your RTOS
offers. It is possible that LLVM depends on Posix or Posix like OS
calls for things like mmap and other file abstractions. I've not
looked at this in any detail as it may be possible to strip these out
with the right configuration options, for example thread support can
be disabled. One possible approach would be to build LLVM for a linux
target and look at the dependencies. That might give you an idea of
what your are up against.

Q2
Are there any memory usage benchmarks for LLVM across the common tasks
(especially loading bytecode, doing the optimization passes and finally
emitting machine code)? My target (embedded) system has only 1GB of RAM.

I don't have anything specific unfortunately. It is, or at least was
possible a couple of years ago, for Clang to compile Clang on a 1GB
Raspberry Pi. I'm assuming the plugins will be smaller than the IR
generated by the largest Clang C++ file, but my Rasberry PI wasn't
doing anything else but compiling Clang.

Hi Tim.

Thank you for taking to time to comment on the background!

I will definitely study lldb and remote JIT for ideas. I worry that I
will not be able to pre-link on the host side because the host cannot(?)
know the final memory layout of code on the client side, especially when
there are multiple plugins being loaded in different combinations on the
host and client. Is that an unfounded worry?

I suppose it is also possible to share re-locatable machine code (ELF?)
and only use client-side embedded LLVM for linking duties? Does that
simplify things appreciably? I was under the impression that if I can
compile and embed the LLVM linker then embedding LLVM's codegen
libraries would not be much extra work. Then I can allow users to use
Faust (or any other frontend) to generate bytecode in addition to my
"live coding" desktop application. So many variables to consider... :slight_smile:

It is possible to build a position independent code on the host and
run it on the device without needing the full complexity of a SysV
dynamic linker. As you say there are many different options depending
on how much your plugins need to communicate with the main program, or
each other, and how sophisticated a plugin loader you are comfortable
writing. There is probably much more information available online
about how to do that than embedding LLVM.

One possible approach is build your plugins on the host as some kind
of position independent ELF executable. Your program on the device
could extract the loadable parts of the ELF, copy them to memory,
resolve potential fixups (relocations in ELF) and and branch to the
entry point. In general ELF isn't compact enough for embedded systems
and it is common to post-process it into some more easily processed
form first.

Would you say that embedding the LLVM linker is a practical way to get the required dynamic linking capabilities on the bare-metal side?

Orthogonal Devices
Tokyo, Japan
www.orthogonaldevices.com

Would you say that embedding the LLVM linker is a practical way to get
the required dynamic linking capabilities on the bare-metal side?

If I've understood you correctly; probably not. The LLVM linker (LLD)
is a static linker, it doesn't have any image loading functionality.
It also isn't really suited for running on top of an RTOS in the same
way that Clang isn't. There is something called llvm-link but that is
a linker for multiple bitcode files to produce a single bitcode file
which I'm guessing isn't what you want either.

I think there is a dynamic linker in one of the JITs, but I can't
remember where it is off the top of my head.

If I get some time this afternoon I'll try and find some links on
either how to write a simple dynamic loader or some examples.

Peter

Yes, I'm definitely referring to the linking functionality in the JIT part of LLVM (ORC?), not the ld replacement which I agree is way too much. As far as I can tell from spec'ing this out, user code (i.e. plugins) will be exporting symbols as well as importing them. In fact, Cling is where I was planning on doing my first source code deep-dive because it seems to cover all the bases in terms of functionality.

So I guess I can tentatively identify my list of functional requirements as:
- load relocatable (but highly optimized) machine code
- relocate the machine code
- export symbols from the loaded machine code (available exports are not known at compile-time)
- import symbols into the loaded machine code (required imports are not known at compile-time)
- finally, actually execute functions exported from the loaded machine code

I latched on to LLVM because I nearly lost my mind trying read the Linux source code for libdl.

Kind regards
Brian Clarkson

Orthogonal Devices
Tokyo, Japan
www.orthogonaldevices.com

Hi Peter

Thank you for your helpful comments, especially on the RPI. Since my use case is lot simpler than compiling all of Clang, I hopefully can take your experience as a good sign.

The RTOS that TI provides for the AM335x actually has pretty complete posix layer and other standard libraries. However, I am working without any virtual memory subsystem, so no mmap. However, I was under the impression that LLVM (ORC specifically) should be able to relocate code at any memory location so the lack of mmap shouldn't be a problem?

Kind regards
Brian

Orthogonal Devices
Tokyo, Japan
www.orthogonaldevices.com

Hi Peter

Thank you for your helpful comments, especially on the RPI. Since my
use case is lot simpler than compiling all of Clang, I hopefully can
take your experience as a good sign.

The RTOS that TI provides for the AM335x actually has pretty complete
posix layer and other standard libraries. However, I am working without
any virtual memory subsystem, so no mmap. However, I was under the
impression that LLVM (ORC specifically) should be able to relocate code
at any memory location so the lack of mmap shouldn't be a problem?

Apologies I don't know a lot about ORC, most of my knowledge is on the
static linker side. I don't think mmap is a requirement, just that a
lot of the code may have been written assuming it was present.
Hopefully there are some other people on the list with more experience
of JITs that can help.

Thinking about the requirements in your earlier mail:

- load relocatable (but highly optimized) machine code
- relocate the machine code
- export symbols from the loaded machine code (available exports are not known at compile-time)
- import symbols into the loaded machine code (required imports are not known at compile-time)
- finally, actually execute functions exported from the loaded machine code

It sounds like you would need some kind of dynamic loader to handle
the symbol resolution and perform relocation. If the communication is
just Kernel (for want of a better name for the main program) to
Module, and not Module to Module then something like a PIE executable
for each module with the symbols exported with export-dynamic. This
would result in only a small number of relocation types that you would
need to handle, with the majority being R_ARM_RELATIVE which is just
the displacement from the static link address (usually 0), and
R_ARM_ABS32 for those requiring the address of a symbol. The major
restriction of PIE is that there is a fixed offset between code and
data.

As I understand it the linux kernel uses something like ld -r for a
relocatable link, which is essentially combines many relocatable
objects into a single one and loads that. That means that a lot of
awkward to handle relocations, especially in Thumb could be exposed.

Apologies I couldn't easily find many examples in open source projects
or guides on how to write a dynamic linker. I have had some experience
with ARM's proprietary linker which had several dynamic linking models
for more bare-metal systems
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0242a/index.html),
however I'm guessing you would prefer to stick to open source
components.

Peter

With the LLVM ORC JIT you actually don't need to embed the JIT Linker in the remote process. ORC supports an RPC mechanism that allows the JIT linker running on a host to query the remote process for the relocated symbol addresses, and perform linking. It works great and allows the JIT target process to be tiny. The LLI ChildTarget tool inside LLVM does exactly what I'm describing and requires a minimal amount of LLVM code.

-Chris