Is it time to start upstreaming the CHERI support to LLVM?

The CHERI fork of LLVM has been developed out of tree for about 10 years now, occasionally upstreaming bits that are generally useful. CHERI provides a capability model on top of virtual memory such that every memory operation (load, store, instruction fetch) must be authorised by a capability. This is either explicit (as an operand of the instruction) or implicit (for legacy operations, there is a default capability in a special register).

The CHERI MIPS and RISC-V prototypes are research artefacts and, as such, their instruction sets evolve very rapidly. This has meant that there’s been limited value in upstreaming LLVM changes because any given LLVM release would be unlikely to be able to target this week’s version of the ISA. A couple of weeks ago, Arm shipped the first set of Morello boards (our, somewhat longer, blog on the subject, a modified Neoverse N1 with CHERI extensions. Morello is a ‘superset architecture’, a limited-run prototype intended to explore which subset of the possible CHERI features make sense for a real CHERI extension to AArch64 (and, more broadly, if CHERI is a useful feature to add to an ISA).

Arm explicitly describes Morello as a one-off with no backwards compatibility guarantees and does not commit to adding CHERI support to AArch64 in the future. Nevertheless, the fact that Morello exists in silicon means that the Morello ISA is now a stable target that can act as an example in-tree CHERI back end until either production CHERI silicon arrives or the CHERI experiment is deemed to have failed, at which point it can be either superseded or removed. Emulators for Morello are available, including Arm’s fixed virtual platform and qemu and in the next few months we expect there to be enough Morello machines available for a self-hosting buildbot.

CHERI requires that the entire compiler pipeline, from the front end to code generation, maintain the distinction between pointers and integers. We lower C’s [u]intptr_t to an LLVM pointer type and use explicit intrinsics to extract the address, which provides a simpler model for tracking pointer provenance than the existing model, and so I believe that a lot of the target-independent code will probably end up being more generally useful (particularly for anyone wanting to target GC’d environments, where maintaining this distinction is equally critical). CHERI provides byte-granularity memory safety and so requires optimisers to refrain from doing ‘safe’ out-of-bounds reads.

Having an in-tree target and tests would make it harder to accidentally break these guarantees. It probably makes sense to wait until after flipping the switch for opaque pointers before merging all of the CHERI diffs because a lot of them are ensuring that the address space is preserved across pointer bitcasts (especially in the simplify libcallls infrastructure), which would all go away with opaque pointers. It would be great to get feedback on both whether folks agree that this is the right time to start the upstreaming effort and, if so, what the right process should be. The diffs are not huge but they are quite invasive, making small changes across the codebase.

6 Likes

It makes sense to fix LLVM’s pointer handling. It’s not broken just for CHERI; it’s broken for all platforms.
There are these patches around for !ptr_provenance metadata and related things (⚙ D68484 [PATCH 01/27] [noalias] LangRef: noalias intrinsics and ptr_provenance documentation.). They are related, and we should have a single proposal that fixes all uses cases. We really can’t afford to have 2 incompatible things trying to go in at the same time.

It’s on my todo list to review the correctness of the !ptr_provenance model, but it would be nice to have more pairs of eyes so we converge on a single, useful, and correct design.

@davidchisnall I’d like to encourage this. Embecosm has very recently been asked about providing rustc support for Morello boards (which of course will need LLVM). We’re likely in the position of being able to provide a lot of the leg work for an upstreaming effort.

Having spent countless hours over the past few years trying to upstream CHERI changes and dealing with merge conflicts, I’m very much in favour of upstreaming as much as possible :slight_smile:

Regarding rust support: I recently tried building rustc against CHERI LLVM instead of the bundled one and that worked just fine. I think the bigger problem to solve in oder to compile rust code for a CHERI platform is rust’s usize. Currently rust uses usize for both size_t and uintptr_t (Support index size != pointer width · Issue #65473 · rust-lang/rust · GitHub and [Pre-RFC] usize is not size_t - language design - Rust Internals), so CHERI/Morello support would require adding something like a uptr that is mapped to an addrspace(200) pointer rather than i128.

1 Like

I think this is a good discussion to have, and upstreaming is likely to be a good idea. To my mind, the main considerations for functionality contributed to LLVM are about the benefit to end users, the cost of having the functionality upstream, and if there are people willing to support it. This tends not to be something we discuss explicitly for officially supported architectural extensions for in-tree targets as the answer tends to be obvious, but I think it’s worth being more explicit here (and I’m anticipating a similar discussion when/if people attempt to upstream custom RISC-V extensions). “Cost” is a difficult topic of course - but in CHERI’s case there’s certainly an argument it would be a net benefit due to forcing improvements in maintaining the distinction between pointers and integers.

Not all of it is relevant, but I like Clang’s criteria for including new extensions as a starting point.

Do you have any very rough order of magnitude estimates of what is needed to get CHERI upstream in LLVM? e.g. engineer months / number of patches / lines of code.

1 Like

Thanks @asb, that’s a great framing for the question. Around half of the total changes are in tests:

$ git diff --stat cheri/latest-merge cheri/master -- */test | tail -1
 1477 files changed, 115634 insertions(+), 2969 deletions(-)
$ git diff --stat cheri/latest-merge cheri/master | tail -1
 2608 files changed, 225960 insertions(+), 9798 deletions(-)

That said, 110KLoC changes in non-test code is still a non-trivial amount. That’s 6-8KLoC in each of the MIPS and RISC-V back ends, and we would not want to upstream either of those yet. Fortunately, 62KLoC of this is a git subrepo of the CHERI compressed capability encoding library (including tests), which is included just to get some defines in a fairly short (200LoC) header file, and is used only in bits that wouldn’t want upstreaming. That brings the total down to about 40 KLoC across LLVM, Clang, LLDB, libc++, compiler-rt, and so on. Some of this is in architecture-specific bits of the code outside the LLVM targets, so the total being proposed for upstreaming would be less.

Going through those questions in turn:

  1. Evidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users.

I believe about 1,000 Morello systems will be made availably to academic and industrial participants in the Digital Security by Design programme, possibly more via a DARPA programme in the US. Having everything upstream provides a clearer place for collaboration and makes it much easier to support the second CHERI architecture to try to upstream. It should also help reduce regressions introduced upstream from integers-are-pointer assumptions creeping in. This, in turn, should help with some virtual targets that have previously struggled with LLVM.

We anticipate that the Morello programme will lead to at least one out of AArch64, RISC-V, or x86 developing a production CHERI implementation, which would be able to build directly on top of the architecture-agnostic bits in LLVM.

  1. A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project.

The changes are invasive and cannot live in a separate tool, they must be in (a fork of) LLVM, Clang, LLDB, lld, and so on.

  1. A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature.

We have a CHERI C/C++ Programming Guide that describes the semantics of CHERI C. CHERI defines a set of extensions in terms of compiler-provided builtins and restricts the possible implementation space of some undefined and implementation-defined behaviour. This is being used by the Linaro / Arm team for GCC support.

  1. Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies.

I’m not sure that this one applies yet, I hope that the existing code will serve for defining CHERI C/C++ behaviour in their relevant standards but it’s probably too early to start working on this. I would expect that we would work on the LLVM parts to ensure that they conformed to any C/C++ CHERI standards.

  1. A long-term support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself.

I think that there are sufficient interested parties (the DSbD includes Microsoft, Google, and Amazon, for example, plus a lot of smaller companies and universities) that there won’t be a shortage of folks working on it.

  1. A high-quality implementation: The implementation must fit well into Clang’s architecture, follow LLVM’s coding conventions, and meet Clang’s quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler.

I believe that this is the case for most of the code and the rest would be improved as part of the upstreaming process.

  1. A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it.

See above: the current tree includes a lot more tests than non-test changes.

2 Likes

As per the comment by @jeremybennett Embecosm has been asked to get involved in this very effort, and to help push for the upstreaming of all this hard work, and provide additional changes where necessary. In the end we’d like to see CHERI support - specifically for Morello - in upstream LLVM, and eventually the Rust compiler.

I’ve actually recently been taking a look at the CHERI changes in the fork, and trying my best to consolidate and rebase those 10 years worth of modifications on top of upstream - which is a somewhat nontrivial process! After that it might be a little clearer how those changes can be split up for eventual upstreaming.

I will probably be making a separate post elsewhere soon, but we were thinking that to help coordinate this we could try and have some kind of monthly community call - similar to the setup that the GCC Rust project has.

Fantastic news, thanks @lewis-revill. Feel free to schedule another a call with me to go through any of the changes that you’re unsure about. @arichardson and @jrtc27 are also responsible for large chunks and I’m sure they’d be happy to explain any bits that you need.