[RFC] Proper low level abstractions to LLVM (Bits, Bytes, Addresses and Masks)

3 years ago, I argued that we should add an Align type to LLVM so that its simpler to refactor and reason about (examples in the original announcement).

Over the years, I’ve implemented it and converted most of the code base. It can be improved but at least, I am now much more confident when I change code related to alignment.

Today I’m still confused by the lack of proper abstractions. For instance, it is unclear whether getScalarSizeInBits() / 8 is correct or if it should be (getScalarSizeInBits() + 7) / 8

The logic for converting between bits and bytes is spread out all over the code base.


JF Bastien was also in favor of more abstractions. So I’ve started to tinker with godbolt and crafted a set of types to capture the semantic and basic operations.

If you read until here I have a few questions:

  • How do you feel about it?
  • Any red flags? Something missing? Or any suggestions to improve the design?
  • Are you willing to help?

If this gets some traction, my next move will be to put the API to the test by refactoring a few files and check what’s missing. I’ll also work on a migration plan.

Let me know what you think,
– Guillaume

3 Likes

Capture this into getScalarSizeInBytes() so nobody has to think about it?

1 Like

I have a feeling that this is an area of subtle bugs and inconsistencies. Abstractions sound like the right approach. Align solved similar problems.

Do you have a Phabricator diff?

clang has an existing abstraction for “bytes”: clang::CharUnits. How does that compare to your proposal?

I drew inspiration from it but these are different concepts. The differences and similarities are nicely presented in this stackoverflow post.

So if both char and byte can be used to represent memory addresses only byte is guaranteed to be 8 bit. This is encoded in the proposal with a way to navigate between the two types. I’ve also added convenient methods like getCoveringByte() which returns the number of bytes needed to store a certain number of bits.

My proposal still lacks the ability to use them like regular numbers but this is easily done.

I think my implementation can be trimmed down though, Offset might be unnecessary as well as the separate Log2 struct. Align is definitely needed. Address might need to also embed the address space. I’m unsure about this yet.

Currently my implementation is also more permissive, Bytes and Bits can be implicitly constructed where CharUnits requires a call to its static fromQuantity function.

Another important difference is that CharUnits is signed where the proposal quantities are unsigned. I think it makes more sense to have them signed except for Address and Align.

Not yet but working on it. Will post when I have something to show.

@pogo59 Well, that was just an example, the problem is pervasive throughout the code base.

Do you need one large diff or can split it into different concepts? Addresses sound independent of Bytes.

This will have to be incremental anyways because converting the codebase main APIs will take time.
I can definitely start with Bytes/Bits first and add Address later on as we see fit.

I’ve removed the Offset and Log2 and added arithmetic operations to Bytes/Bits. I’ll start to play with it in a patch and see how this goes.

I hope we can make sure that bit sizes that are not a multiple of 8 are well supported. I’m working on a target that can do arithmetic with any bit size, and, in particular, can load and store to memory in various bit sizes (like 2 bits, 27 bits, etc).

Thx for letting me know. The proposal is not about transforming everything to Bytes but to enforce the semantic where it matters. So if most of the APIs deal with Bits it’s totally fine by me. Currently TypeSize internally deals with Bits but as I’ve shown in my first post, part of the codebase divides this quantity by 8 without a clear intent (is this to compute an Address or get a store size?). The proposal here is to convey the semantic in the type system, that’s it.

For example, for the purpose of your work, if you end up having to touch some of the code that divides by 8, I think my proposal will help decide what’s the best course of action as the intent will be clearer. I’m happy to discuss your use cases in more detail if you want.

This isn’t quite right: TypeSize can be either in bits or in bytes. getTypeStoreSize() returns a size in bytes, getTypeStoreSizeInBits() returns a size in bits. These APIs usually come in pairs.

It’s not completely clear how your proposal handles this – is there going to be a TypeSize<Bits> and TypeSize<Bytes>?

I’m not really sure if this proposal is solving the right problem. Bits vs bytes is just a question of units, while the important semantic questions are in the distinction between type size, store size and alloc size. From the screnshot in your first post, the right way to avoid that / 8 might be to query the store size instead of the type size, or similar.

Ok I dug a bit more in the code and it doesn’t work as I thought it would. This seems broken to me - but maybe there are subtleties that I don’t understand yet.

In fact this is exactly my point, as a developer that doesn’t know some part of the codebase you cannot rely on the compiler to make sure that you’re doing the right thing, TypeSize might be bits or bytes depending on the context so you have to rely on documentation and naming - this is far less effective than types.

The AlignTo function in TypeSize.h illustrates my point. This looks really unsafe to me.

I don’t know yet, I thought it was always dealing with bits. So either this (TypeSize<Bits>/TypeSize<Bytes>) or have it use Bits internally and let users decide what to do when they need Bytes.

I disagree, units are important, they convey semantics (e.g. the AlignTo function I was talking earlier, or this line from ThreadSanitizer.cpp).
From my point of view, both need to be addressed, because if you have abstractions to represent TypeSize, StoreSize and AllocSize you still need to know whether they are in bits or bytes simply because you will need to interact with other concepts (Alignment, Addresses, Offsets).

I’ve started a POC patch, feedback is welcome.