n-bit bytes for clang/llvm

Back in 2009 there was some discussion of the practicality of supporting char sizes greater than 8-bit:


with the consensus seemingly being “quite doable, please get a good patch and submit”.

However the current code appears (to my neophyte eyes) to be explicitly 8-bit, e.g. one instance called out in the mail thread remains:

/// isString - This method returns true if this is an array of i8.
bool ConstantDataSequential::isString() const {
return isa(getType()) && getElementType()->isIntegerTy(8);

I didn’t find anything related beyond this mail thread such as a discussion of a patch but of course I might be searching too narrowly - perhaps someone here can recall whether it went any further, whether insurmountable barriers do exist, etc?

Thanks for whatever advice & thread necromancy you can offer,

As written in the threads shown below, we maintain LLVM patches for 16-bit bytes.


/Patrik Hägglund

It’s definitely doable, but I’d be worried about the maintenance burden. Beyond contributing the initial patches, I’d want to see a maintenance commitment and relatively comprehensive tests that can be run upstream.

For example, if there were an i24 MVT, how would I test my target independent SDAG change that operates on all integer values? Currently, our answer to that question is “find a backend that uses it and test that”. Without such a backend, it’s hard for us to promise that this support will continue to work.

It's definitely doable, but I'd be worried about the maintenance burden.

Yes, that is a problem.

We are currently not allowed to reveal our target (which has 16-bit bytes, and registers with non-power-of-two bit widths) fully, and therefore not able to submit it upstream. One idea we have toyed with is to create a simple "dummy" version of our target, just to be able complement patches with tests. For 16-bit byte support we may also pick some existing simple architecture, such as DCPU-16 or TI C54x. One other idea is just to have the changes on a branch upstream.

/Patrik Hägglund

I agree with the sentiment: without a usable backend bit-rot will surely ensue. I guess ideally the patches would accompany a real backend relying upon them and a target environment executing them, e.g. a simulator environment for the DSP so access to the real hardware isn’t required.

FWIW my original curiosity stems from wondering whether a C/C++ compiler can even be created for a non-mainstream architecture such as Knuth’s original MIX (http://en.wikipedia.org/wiki/MIX - given such features as a byte size that isn’t aligned to the memory word size of 6-bit bytes, 31-bit words, so a char pointer effectively requires padding for the p[5] case of bit-30). But if a MIX backend was created then at least execution environments exist for it - however I guess a very real question for the community clang/llvm would be whether the additional complexity for supporting such a “toy environment” is warranted, especially since MIX was superseded by MMIX (which has a gcc backend).

From: "Tyro Software" <softwaretyro@gmail.com>
To: "Reid Kleckner" <rnk@google.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, March 11, 2015 3:50:45 AM
Subject: Re: [LLVMdev] n-bit bytes for clang/llvm

I agree with the sentiment: without a usable backend bit-rot will
surely ensue. I guess ideally the patches would accompany a real
backend relying upon them and a target environment executing them,
e.g. a simulator environment for the DSP so access to the real
hardware isn't required.

Generically speaking, we'd like to have a mock target for testing purposes (see this thread: http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083085.html). It could be used for this purpose.


As an alternative to fixing the “char == 8 bits” presumption would using non-uniform pointer types have been another possible approach, e.g. keep char as 8 bit but have char* encode both the word address and the byte location within it (i.e. one extra bit in this 16-bit case). Of course this is only a less intrusive (to LLVM) approach if LLVM readily supports such pointers, which may be close to asking “could 8086 small/large/huge pointers be implemented?”

One obvious drawback to such an approach is that dereferencing char* becomes relatively expensive, though for the sort of code being predominantly run on a DSP that might be acceptable.

We're using multiple address spaces to describe two pointer representations for CHERI: AS0 is a 64-bit pointer that's represented as an integer, AS200 is a capability (256-bit fat pointer with base, length, permissions, enforced in hardware). We had to fix a few things where LLVM assumes that pointers are integers, but the different size pointers in different address spaces part works very well. The biggest weakness is in TableGen / SelectionDAG, where you can't write patterns on iPTR that depend on a specific AS (actually, you can't really write patterns on iPTR at all, as LLVM tries to lower iPTR to some integer type first, even when this doesn't make any sense [e.g. on an architecture with separate address and integer registers]).

Having AS0 be a byte pointer, which the back end would lower to two words, and some target-specific AS be a word pointer would likely work quite well.


Hi Tyro,

You seems to suggest that one way to avoid the problem of implementing 16-bit bytes at the C level, would be to use 8-bit bytes instead. :wink: If the target has 16 bits as the addressable unit of storage, then byte pointers would be implemented as “fat pointers”. Yes, this would probably be possible. But we haven’t tried it, both because of backward compatibility concerns (for legacy C code), and because of the performance implications that you point out. (Our target also have 16-bit registers dedicated for pointers.)

/Patrik Hägglund

Thanks - that’s a really helpful steer.

So if I’m understanding correctly, the CHERI address spaces are equivalent as regards actual memory addresses, with the “fatness” being the type, access, etc metadata? (somehow I’d formed the impression that LLVM address spaces needed to be disjoint)


Hi Patrik

Indeed I am hoping to avoid the n-bitian-fork approach (laziness more than anything; the pain of keeping patches moving forwards with the clang/llvm mainstream) And luckily for a toy architecture legacy code compatibility is less of a concern, at least until I sleepwalk into the “port Linux” state…


We're slightly abusing the address space mechanism, but there's no requirement that they be disjoint - if they were then there would be no need for an address space cast instruction. For us, whether they are disjoint is a run-time property: non-capability loads and stores are relative to a global base capability, which may be the entire virtual address space or may be quite restricted.