thoughts about n-bit bytes for clang/llvm

Hello experts,

I am new to Clang I would like to support a system on chip where the smallest accessible data type is 16-bits. In other words sizeof(char) == 1 byte == 16 bits. My understanding is that C/C++ only requires 1 byte >= 8-bits and sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long).

In clang/TargetInfo.h:

unsigned getBoolWidth(bool isWide = false) const { return 8; } // FIXME
unsigned getBoolAlign(bool isWide = false) const { return 8; } // FIXME

unsigned getCharWidth() const { return 8; } // FIXME
unsigned getCharAlign() const { return 8; } // FIXME
unsigned getShortWidth() const { return 16; } // FIXME
unsigned getShortAlign() const { return 16; } // FIXME

These are easy enough to fix and to make them configurable the same as IntWidth and IntAlign are.

There are two consequences that I am aware of that arise because of this change.

The first is in preprocessor initialization. InitPreprocessor defines INT8_TYPE, _INT16_TYPE, INT32_TYPE, and sometimes INT64_TYPE. It only defines INT64 if sizeof(long long) is 64 which seems odd to me.

// 16-bit targets doesn’t necessarily have a 64-bit type.
if (TI.getLongLongWidth() == 64)
DefineType(“INT64_TYPE”, TI.getInt64Type(), Buf);

In my case, INT8_TYPE and INT64_TYPE don’t exist so it doesn’t really make sense to define them.

I think a better way of generating these definitions would be to say the following (psuedo-code, it doesn’t actually compile)

// Define types for char, short, int, long, long long

DefineType( “_INT" + TI.getCharWidth()) + "TYPE”, TI.getCharWidth());

if (TI.getShortWidth() > TI.getCharWidth())
DefineType( “_INT" + TI.getShortWidth() + "TYPE”, TI.getShortWidth());

if (TI.getIntWidth() > TI.getShortWidth())
DefineType( “_INT" + TI.getIntWidth() + "TYPE”, TI.getIntWidth());

if (TI.getLongWidth() > TI.getIntWidth())
DefineType( “_INT" + TI.getLongWidth() + "TYPE”, TI.getLongWidth());

if (TI.getLongLongWidth() > TI.getLongWidth())
DefineType( “_INT" + TI.getLongLongWidth() + "TYPE”, TI.getLongLongWidth());

This would result in the creation of INT8_TYPE, INT16_TYPE, INT32_TYPE, INT64_TYPE for most platforms. For my platform it would only create INT16_TYPE, INT32_TYPE. It would also work for wacky 9-bit machines and where INT8s don’t make much sense and architectures where long long was 128 bits.

The other place I am aware of (thanks to useful assertion) that makes a difference is in Lex/LiteralSupport.cpp for the char literal parser. I am still wrapping my head around this, but I think fixing it for arbitrary size is doable. (As a new person, I need to figure out a good way to test it.)

Do these changes seem reasonable to pursue? What other things are broken in Clang and LLVM by changing the assumption about 8-bit bytes?

Your advice is appreciated,

It would be a long battle, there are all sorts of places that would need to be changed. Most of them you'd discover after tripping over them.

That is not the answer I wanted to hear! :frowning: But thank you for your honest opinion. I really want to use LLVM so I'll let the stumbling begin and see where it takes me.

If any places should come to mind, please let me know.

Thank you,

On the bright side, the changes should tend to be very easy to make and very easy to spot. :slight_smile:

I think the reason might be because internally the SizeOfPointee that is getting passed around as 1 in some cases to handle GNU void* and function pointer arithmetic extensions. At least that is what the code seems to say. See ExprConstant.cpp:357. I am guessing there would be lots of breakage if you set SizeOfPointee to, say, 0 here. Of course there are others on this list that know much better than I. I will just be happy when I can get void* to point an i16 (*sigh* Still working on it.)

Anyway, I'm having fun and learning a little bit each day.


It doesn't really matter what void* translates to in the IR to;
nothing should attempt to load from a void* or do arithmetic with it
anyway. The use of i8* is purely tradition; I think you could change
CodeGenTypes::ConvertNewType to use an arbitrary type without any
trouble. And ExprConstant.cpp is at a the AST level, so it isn't
really relevant here beyond the fact that we support void* arithmetic;
we handle that explicitly in ScalarExprEmitter::EmitAdd and friends,


Thank you Eli! I really appreciate deep insights like that. I now understand that it doesn’t really matter for void, but I went ahead and hacked up my version up to look better on my platform.

case BuiltinType::Void:
case BuiltinType::ObjCId:
case BuiltinType::ObjCClass:
// LLVM void type can only be used as the result of a function call. Just
// map to the same as char.
return llvm::IntegerType::get(getLLVMContext(), 8);


return llvm::IntegerType::get(getLLVMContext(),

For my platform that works nicely out to be i16.


Ray, would you like a hand with this? I'm just getting started on a back
end for a 24-bit word-addressable processor, i.e. 24-bit bytes, so I
share your interest in getting n-bit bytes working. If there's anything
I can do to help out, I'm ready and willing.


Hi Ken,

Great! Good to know that there are other people that could use this functionality. I would be glad to work with you. Maybe I can send you the diffs of what I have so far.


Yes. Please do. And maybe copy this mailing list, too. Perhaps the local
experts will have some comments on your approach.


I am trying to gauge how much interest there is in supporting non-8-bit byte targets.

Other than myself, Ken Dyck of ON Semiconductor has a 24-bit machine he is trying to support. We have been working jointly on this but we are also both new to Clang and LLVM. Although both of our processors are not mainstream, Ken points out that Texas Instruments the C5000 series is also 16-bit architecture. I played around with TI's development environment today, and sure enough, it is perfectly valid to do:

        char foo = 32000;

It would be fairly easy to go through the LLVM/Clang code and change 8s to 16s, fork it, and be done with it. But I hoped we could fix it in a more general way that would be useful to others. As Mike Stump suggested would be the case, most of the places have been easy and straight-forward to fix as we stumbled over them one-by-one.

But some cases are not so easy (for me). For example, in lib/VMCore/Constants.cpp there are several places where (such as isString() ) that make implicit assumptions about byte size being 8 bits. I don't see a way of getting the target information in play there. (If you know a way, please let me know!! :wink: I am guessing, however, there might be a design change required to get the TargetData info. Any ideas? Add another argument? LLVMContext/LLVMContextImpl does not seem to provide what I need.

Assuming all of this gets done, there is the important question about how it gets tested. I have a target, but it probably is not of much interest to anyone else. What would people think of having a dummy test target just for this purpose until C5000 or something else with non-8-bit-bytes becomes available?

Or, is this topic mostly uninteresting to people? (In that case the private fork is looking better and better.) The problem with this is that it cuts through a big swath of the code base (lots of little changes) so probably involves most of the code owners. I think it is useful functionality but it might be because I am just being blinded by my own narrow set of requirements.

As always, thanks for your help and your time,


Hi Ray,

At DiBcom, our application specific processor is using 16-bits byte, and it would definitely be of interest for us to have the support for n-bit bytes.

By using alignment restrictions, and adjusting the adresses' computations in a few target specific places, we have been able to have it work in our own specific case. But this is not clean, and most probably not portable to most other targets.

Best regards,

Hi Ray,

I am trying to gauge how much interest there is in supporting non-8- bit byte targets.

in LLVM talk, I guess what you are saying is that i8 is not a legal
type for your target?



I am trying to gauge how much interest there is in supporting non-8-bit byte targets.

Not much would be my guess. I'd recommend incrementally just sending in the patches. Start simple (fix a few places) and then work up to larger and larger patches.

For example:

   unsigned getCharWidth() const { return 8; } // FIXME

is a great place to start for clang.

It would be fairly easy to go through the LLVM/Clang code and change
8s to 16s, fork it, and be done with it.

Ah, but it hopefully isn't too much more work to change it to Target.getCharWidth() instead...

Assuming all of this gets done, there is the important question about
how it gets tested.

I wouldn't worry about testing. In the end, you'll have your target and you'll test it. For others, just don't change 8 to 16 and expect people not to pitch fork you. :slight_smile: