Byte width specification in Data Layout string

We have a target with a 32-bit minimum addressing unit, i.e. with 32-bit byte.

We’ve managed to get it working with LLVM which does not generally support non-8-bit bytes. The changes are not complete, we continue finding new places which needs to be fixed, but in general, it works.

Most of the changes are straightforward with the notable exception of the DataLayout class.

We extend data layout string specification allowing the user to set the byte width as ‘b’ with the default value of 8.

The issue is with parsing the data layout string. The string specifies type alignments in bits, while they are stored in bytes (see LayoutAlignElem / PointerAlignElem). (Note that storing them in bytes seems natural, because data cannot be aligned to less than a byte.) The conversion to bytes happens when parsing the string (DataLayout::parseSpecifier), so in order for the computed alignments to be correct the byte size must be known at this point.

One possible solution to the issue is to make the ‘b’ specifier positional, i.e. force the user to specify the byte width before anything else (or, strictly speaking, before any alignment specifiers). That way, the parsing process can validate the specified alignments against the byte size and store them in byte terms. This is the way we’ve done it locally.

Ideally, the data layout string should have specified alignments in bytes, but that is impossible to change now.

Are there other ways? I’m sure the issue was discussed many times before, maybe someone can remember the outcome?

Having poked around in the DataLayout code recently, your approach doesn’t seem unreasonable to me… I recently went the other direction with getPointerSize() to store in bits to support non-power of 2 pointers. This is a place where it would be nice to not preclude support for some old architectures with weird word sizes (e.g. 18-bits or something). I’m curious how you handle the standard LLVM memory operations that are essentially aligned to the C concept of 8-bit byte addressable memory? Do you add your own memory operations with different semantics?

Ordinary loads / stores work just fine unless the pointer operand points to a type which is not-a-multiple-of-byte-sized. This should never happen when translating C/C++ code because you can’t address a middle of a byte, and the front end / middle end passes should preserve the semantic.
In reality, the front-end tends to use i8* type in place of void* which complicates things a bit. In most cases, simple replacement getInt8PtrTy by getIntNPtrTy(DL.getByteWidth()) is enough to overcome the issue.

I see! I guess it’s simpler than I thought… Does sizeof return the number of bytes or the number of octets of a type in C? - Stack Overflow

I don’t think parsing the datalayout string, specifically, has been discussed.

I suspect that if we have non-8-bit-byte support, the next thing we’re going to get asked for is support for specifying different byte sizes for different address spaces. It’s probably simpler to just put off the bit->byte conversion until something requests it.

More generally, I assume you’ve seen the previous discussions? (RFC: On removing magic numbers assuming 8-bit bytes - #39 by preames and RFC: On non 8-bit bytes and the target for it - #39 by clattner)

An architectural minimum addressing unit of 32 bits does not preclude using 8-bit bytes, it just requires your backend to use masked loads/stores to access subwords. See early Alpha CPUs as a real-world example of this being done, as well as MIPS and RISC-V where this is done for 8-bit and 16-bit atomics. Real-world C code isn’t going to be happy with CHAR_BIT not being 8, short not being 16-bit and (u)int8/16_t not existing.

Byte is the minimum addressing unit by definition. A target with 32-bit bytes can of course pretend that is an 8-bit byte target, and in some cases it will even be beneficial as you noted in the last sentence. But in the other cases, like accessing individual character of a string, it would lead to poorly generated code. Imagine you would have to use C bitfields everywhere you need to declare a char or uint16_t.
Architectures with unusual byte width are usually embedded ones, so the real-world C code should be out of the question. However, this complicates porting OpenCL, for example.

Interesting thought, didn’t think about it. It indeed looks more like a property of an address space.
But on the other hand, wouldn’t they ask for supporting specifying different alignments for types as well?

Thanks for the links. I suppose according to point (c) from the first link, the patch will never see the world.