RFC: On removing magic numbers assuming 8-bit bytes

A. This RFC outlines a proposal regarding non-8-bit-byte support that
      got positive reception at a Round Table at EuroLLVM19. The general
      topic has been brought up several times before and one good overview
      can be found in a FOSDEM 2017 presentation by Jones and Cook:
https://archive.fosdem.org/2017/schedule/event/llvm_16_bit/

In a nutshell, the proposal is for the llvm community to
allow/encourage interested parties to gradually remove "magic numbers",
e.g. assumptions on the size of bytes from the codebase. Overview,
rationale and some example refactorings follows.

Overview:

LLVM currently assumes 8-bit bytes, while there exist a few out-of-tree
llvm targets that utilize bytes of other sizes, including our
(Ericsson's) proprietary target. The main issues are the magic number 8
and "/8" and "*8" all over the place and the use of i8 pointers.

There's considerable agreement that the use of magic numbers is not
good coding style, and removing these ones would be of particular
benefit, even though the effort would not be complete and no in-tree
target with tests exist to guarantee that all gains are maintained.

Ericsson is willing to drive this effort. During EuroLLVM19, there
seemed to be sufficient positive interest from other companies for us
to expect help with reviewing patch sets. Ericsson has been performing
nightly integration towards top-of-tree with this backend for years,
catching and fixing new 8-bit-byte continuously. Thus we're able to
commit to doing similar upstream fixes for the long haul in a no-drama
way.

Rationale:

Benefits of moving toward a byte-size agnostic llvm include:
* Less magic numbers in the codebase.
* A reduced effort to maintain out-of-tree targets with non-8-bit bytes
as contributors follow the established patterns. (One company has told
us that they created but eventually gave up on a 16-bit byte target due
to too-high integration burden.)
* A reduction in duplicate efforts as some of the adaptation work would
happen in-tree rather than in several out-of-tree targets.
* For up-and-coming targets that have non-8-bit-byte sizes, time to
market using llvm would be far quicker.
* A higher probability of LLVM being the compiler of choice for such
targets.
* Eventually, as the patch set required to make llvm fully byte size
agnostic becomes small enough, the effort to provide a mock in-tree
target with some other byte size should be surmountable.

As cons, one could see a burden for the in-tree community to maintain
whatever gains that have been had. However the onus should be on
interested parties to mend any bit-rot. The impact of not having as
much magic numbers and such should if anything make the code more easy
to understand. The permission to go ahead would be under the condition
that significant added complexities are avoided. Another con would be
added compilation time e.g. in cases where the byte size is a run-time
variable rather than a constant. However, this cost seems negligible in
practice.

Refactoring examples:
https://reviews.llvm.org/D61432

Best Regards,
Jesper

I’m not a fan of C and C++ supporting anything but 8 bits per byte. Realistically, C and C++ on such targets are different languages from 8-bit-per-byte C and C++, and therefore code isn’t portable from one to the other. I intend to propose that C++23 support only 8 bits per byte, ditto C. I’m therefore not a fan of teaching clang about this.

Separately, teaching LLVM about unusual-sized bytes seems fine to me, if the maintenance burden is low enough and the targets are supported in-tree and are maintained. I agree that you can’t just plop in a target without support, so it makes sense to first clean things up and then land a target. However, I don’t think a mock target makes sense. I’d much rather see a real target.

Are we only talking about powers-of-two here, or “anything goes”? What restrictions are you proposing to impose?

I’m really not convinced by this “magic number” argument. 8 really isn’t that bad to see.

From: llvm-dev [mailto:llvm-dev-bounces@lists.llvm.org] On Behalf Of JF
Bastien via llvm-dev

I’m not a fan of C and C++ supporting anything but 8 bits per byte.
Realistically, C and C++ on such targets are different languages from 8-
bit-per-byte C and C++, and therefore code isn’t portable from one to the
other.

Having done it, I promise you that it's reasonable to write portable C
targeting both 7-bit and 8-bit 'char'. It was too long ago to remember
anything in detail, but the brain cells still remaining from that era
believe it was pretty clean.

I intend to propose that C++23 support only 8 bits per byte, ditto
C. I’m therefore not a fan of teaching clang about this.

My impression is that non-8-bit-byte machines are (these days) generally
small and likely for embedded or other special purposes, so a proposal
to stop trying to squeeze the bloated monster that C++ has become onto
those is probably fine. C, on the other hand, appears to be the language
of choice for all sorts of weird things, and that's less likely to fly.
(Just sayin'. I'm not on either committee and have no vested interest.)
--paulr

From: llvm-dev [mailto:llvm-dev-bounces@lists.llvm.org] On Behalf Of JF
Bastien via llvm-dev

I’m not a fan of C and C++ supporting anything but 8 bits per byte.
Realistically, C and C++ on such targets are different languages from 8-
bit-per-byte C and C++, and therefore code isn’t portable from one to the
other.

Having done it, I promise you that it's reasonable to write portable C
targeting both 7-bit and 8-bit 'char'. It was too long ago to remember
anything in detail, but the brain cells still remaining from that era
believe it was pretty clean.

I agree it’s *possible*, same way I’ve seen some correct uses of volatile, and the same way I’m sure some code supported non-two’s-complement machines correctly. What I’m saying is that most code simply isn’t, often in subtle ways. The code you wrote is therefore a subset of C which is incompatible with C at large.

I don't think it was really that severe of a "subset" and would
dispute that a "subset" is inherently "incompatible" but that
is straying too far from the topic at hand to tolerate.
--paulr

Hi Jesper,

thank you for working on this. My company (Codasip) would definitely be interested in having this feature upstream. I think that this is actually important for a suprisingly large number of people who currently have to maintain their changes downstream. I have a couple of questions and comments:

  1. Do you plan on supporting truly arbitrary values as the byte size or are there in fact going to be limitations (e.g. the value has to be a multiple of 8 and lower or equal to 64)? I recall that we had a customer asking about 36-bit bytes.
  2. If you define a byte to be e.g. 16 bits wide, does it mean that “char” is also 16 bits wide? If yes then how to do you define types like int8_t from stdint.h?
  3. Have you thought about the possibility to support different byte sizes for data and code?
  4. I realize that this is a separate issue but fully supporting non-8-bit bytes requires also changes to other parts of a typical toolchain, namely linker (ld/lld) and debugger (gdb/lldb). Do you maintain out-of-tree changes in this area as well?

Thank you,
Pavel

JF Bastien via llvm-dev <llvm-dev@lists.llvm.org> writes:

I’m not a fan of C and C++ supporting anything but 8 bits per
byte. Realistically, C and C++ on such targets are different languages
from 8-bit-per-byte C and C++, and therefore code isn’t portable from
one to the other. I intend to propose that C++23 support only 8 bits
per byte, ditto C. I’m therefore not a fan of teaching clang about
this.

It depends on what you mean by "byte." Is "byte" a unit of addressing
or a unit of data?

We have worked with machines that had word-level (32-bit) addressing.
8-bit data works just fine on such machines, you simply treat them like
bitfields. Types like sint8_t are reasonable on such machines. Types
like sint8_t * can also be reasonable.

I think it would be pretty bad to restrict either language to 8-bit
bytes.

                            -David

Hi Jesper,

My company (CML Microsystems) would definitely be interested in having this feature upstream too.

We currently maintain an out of tree backend that has a minimum addressable size of 16 bits and this is implemented using the method outlined by Jones and Cook of Embecosm that you refer to in the RFC.

Our implementation is slightly different than the one you’ve proposed in that we used the concept of bitPerChar and only support multiples of 8 bits for that char width.

I would be happy to help out with the work in any way I can.

Regards
Sean Kilmurray

CML disclaimer.txt (4.77 KB)

I’m not a fan of C and C++ supporting anything but 8 bits per byte.
Realistically, C and C++ on such targets are different languages from
8-bit-per-byte C and C++, and therefore code isn’t portable from one
to the other. I intend to propose that C++23 support only 8 bits per
byte, ditto C. I’m therefore not a fan of teaching clang about this.

On portability, the same is true for byte order and more. Also, the
standard is what it is currently and the non-8-bit byte targets do
exist. However, we don't suggest clang changes for now.

Separately, teaching LLVM about unusual-sized bytes seems fine to me,
if the maintenance burden is low enough and the targets are supported
in-tree and are maintained. I agree that you can’t just plop in a
target without support, so it makes sense to first clean things up
and then land a target. However, I don’t think a mock target makes
sense. I’d much rather see a real target.

I'd also much rather see a real target. Hopefully, the cleanup will
make it more likely to happen.

Are we only talking about powers-of-two here, or “anything goes”?
What restrictions are you proposing to impose?

We're proposing "anything goes" larger than 8 as that's what the
standards says, and as we've talked to people having non-power-of-two
architectures. Also, we feel there's no major disadvantage of going for
that. Yes, we can't use masks and shifts the same way, but we feel that
won't have a big impact. (However, our target has 16-bit bytes, so if
the community would rather see powers-of-two, we could live with that.)

I’m really not convinced by this “magic number” argument. 8 really
isn’t that bad to see.

Though the meaning isn't always clear, i.e. if it's handling bytes or
octets. And perhaps it doesn't have to be, for as long as you have an
8-bit byte architecture, but when you start to clean up for another
architecture, it becomes a pain and is not always obvious. Especially
not when you're mucking around with Dwarf. Also, there's "& 7", ">> 3"
as well for instance. Not that bad either (as magic numbers often
aren't in context), but if you grep for them, you often have to look a
bit extra to see if it's e.g. a flag, a byte or an octet.

Regards,
Jesper

>
> A. This RFC outlines a proposal regarding non-8-bit-byte support
> that
> got positive reception at a Round Table at EuroLLVM19. The
> general
> topic has been brought up several times before and one good
> overview
> can be found in a FOSDEM 2017 presentation by Jones and Cook:
>

https://protect2.fireeye.com/url?k=b58a506e-e9015b52-b58a10f5-86ef624f95b6-937d68ba77c32042&u=https://archive.fosdem.org/2017/schedule/event/llvm_16_bit/

Hi Jesper,

We (ASIP Designer team, Synopsys) are also interested in a cleaner approach without those magic constants.

Instead of 'n-bit bytes', we tend to talk about 'word addressing' and 'addressable unit size'.

We support C/C++ for various architectures that our customers create with our tools.
Some of those have multiple memories where the addressable unit size depends on the memory (address space).

NOTE: the addressable unit size can be any value (like 10 or 41 or 2048)

We also keep the bits in 'char' separate from the addressable unit size. As such, an 8 bit char can
have a 'storage size' of 24 bits in one memory and 64 bits in another.
(but a char can also be 24 bits, or 10 or something else).

I think that a cleaner abstraction in datalayout is in general useful.

We use:
   unsigned getAddresssableUnitSize(unsigned AddressSpace=0);

which, for the main llvm, could be implemented as :
   unsigned getAddresssableUnitSize(unsigned AS=0) { return 8; }

Greetings,

Jeroen Dobbelaere

Hi Jesper,

thank you for working on this. My company (Codasip) would definitely
be interested in having this feature upstream. I think that this is
actually important for a suprisingly large number of people who
currently have to maintain their changes downstream. I have a couple
of questions and comments:

1. Do you plan on supporting truly arbitrary values as the byte size
or are there in fact going to be limitations (e.g. the value has to
be a multiple of 8 and lower or equal to 64)? I recall that we had a
customer asking about 36-bit bytes.

We plan on supporting arbitrary sizes with a lower limit of 8, not
necessarily power-of-two or multiples of 8. I have to admit that I
haven't thought very much about what the upper limit might be. We might
leave it up to other interested parties to explore that and if we
receive suggestions on how to generalize also in that respect, we'll
certainly consider them.

2. If you define a byte to be e.g. 16 bits wide, does it mean that
"char" is also 16 bits wide? If yes then how to do you define types
like int8_t from stdint.h?

Yes, char is the same. The int8_t type is optional according to the
standard and we don't define it for our OOT target. The int_least8_t is
required, but we just define it to be byte sized.

3. Have you thought about the possibility to support different byte
sizes for data and code?

Not really, but I saw that Jeroen Dobbelaere just suggested supporting
memory spaces with different byte sizes.

4. I realize that this is a separate issue but fully supporting non-
8-bit bytes requires also changes to other parts of a typical
toolchain, namely linker (ld/lld) and debugger (gdb/lldb). Do you
maintain out-of-tree changes in this area as well?

That's true, we do. I've also seen some community interest in those
areas, e.g. from Embecosm:
https://www.embecosm.com/2018/02/26/how-much-does-a-compiler-cost/

and from within Ericsson:
https://www.youtube.com/watch?v=HAqtEZmci70

Thanks,
Jesper

Hi Jeroen,

Thanks, these are interesting differences. The CHAR_BIT and byte
relation is established in the C standard and I would prefer the byte
terminology. It means the same thing as addressable unit but is a bit
shorter and probably more widely known.

Do you suggest any in-tree changes at some call sites using the
suggested AdressSpace parameter or would we rely on the default value
always? I could definitely include this, but if the parameter is never
used, perhaps we can leave it out-of-tree for now?

Best Regards,
Jesper

Hi Jesper,

We (ASIP Designer team, Synopsys) are also interested in a cleaner
approach without those magic constants.

Instead of 'n-bit bytes', we tend to talk about 'word addressing' and
'addressable unit size'.

We support C/C++ for various architectures that our customers create
with our tools.
Some of those have multiple memories where the addressable unit size
depends on the memory (address space).

NOTE: the addressable unit size can be any value (like 10 or 41 or
2048)

We also keep the bits in 'char' separate from the addressable unit
size. As such, an 8 bit char can
have a 'storage size' of 24 bits in one memory and 64 bits in
another.
(but a char can also be 24 bits, or 10 or something else).

I think that a cleaner abstraction in datalayout is in general
useful.

We use:
   unsigned getAddresssableUnitSize(unsigned AddressSpace=0);

which, for the main llvm, could be implemented as :
   unsigned getAddresssableUnitSize(unsigned AS=0) { return 8; }

Greetings,

Jeroen Dobbelaere

> From: llvm-dev <llvm-dev-bounces@lists.llvm.org> On Behalf Of
> Jesper
> Antonsson via llvm-dev
> Sent: Thursday, May 2, 2019 14:21
> To: llvm-dev@lists.llvm.org
> Subject: [llvm-dev] RFC: On removing magic numbers assuming 8-bit
> bytes
>
> A. This RFC outlines a proposal regarding non-8-bit-byte support
> that
> got positive reception at a Round Table at EuroLLVM19. The
> general
> topic has been brought up several times before and one good
> overview
> can be found in a FOSDEM 2017 presentation by Jones and Cook:
>

https://protect2.fireeye.com/url?k=2d1c7421-7195ae31-2d1c34ba-0cc47ad93e2a-4cae3fa5d066a45a&u=https://archive.fosdem.org/2017/schedule/event/llvm_16_bit/

Hi Jesper,

Thanks, these are interesting differences. The CHAR_BIT and byte
relation is established in the C standard and I would prefer the byte
terminology. It means the same thing as addressable unit but is a bit
shorter and probably more widely known.

Looking purely from a c/c++ language viewpoint, this makes sense.
We settled on using 'addressable unit size', but any abstraction will already be helpful.

Do you suggest any in-tree changes at some call sites using the
suggested AdressSpace parameter or would we rely on the default value
always? I could definitely include this, but if the parameter is never
used, perhaps we can leave it out-of-tree for now?

Whenever optimizations are done on store/load/pointers, the address space should be taken into account.
For us it is fine to manage that part out-of-tree.

Greetings,

Jeroen Dobbelaere

Jeroen Dobbelaere via llvm-dev <llvm-dev@lists.llvm.org> writes:

Hi Jesper,

Thanks, these are interesting differences. The CHAR_BIT and byte
relation is established in the C standard and I would prefer the byte
terminology. It means the same thing as addressable unit but is a bit
shorter and probably more widely known.

Looking purely from a c/c++ language viewpoint, this makes sense. We
settled on using 'addressable unit size', but any abstraction will
already be helpful.

Given that f18 has just been accepted as an LLVM project, we probably
shouldn't be using C/C++ or any specific language terminology in LLVM.
It seems useful to distinguish "addressable unit size" from "data size"
and talk about things using such abstract terminology.

                            -David

Jeroen Dobbelaere via llvm-dev <llvm-dev@lists.llvm.org> writes:

Hi Jesper,

Thanks, these are interesting differences. The CHAR_BIT and byte
relation is established in the C standard and I would prefer the byte
terminology. It means the same thing as addressable unit but is a bit
shorter and probably more widely known.

Looking purely from a c/c++ language viewpoint, this makes sense. We
settled on using 'addressable unit size', but any abstraction will
already be helpful.

Given that f18 has just been accepted as an LLVM project, we probably
shouldn't be using C/C++ or any specific language terminology in LLVM.

Regardless of f18, we shouldn't anyway, unless it's terminology we independently define in our language reference. LLVM has supported many different language frontends for a long time :slight_smile:

-Hal

It seems useful to distinguish "addressable unit size" from "data size"
and talk about things using such abstract terminology.

                            -David

I agree, addressable unit size is probably a better abstraction.
However, in the lib/CodeGen directory alone, there's some 785 uses of
the word "byte" and a significant fraction of the code that we want to
modify is using the byte terminology today. An example of unmodified
code from my showcase patch set:

    assert(!(Shift & 0x7) == 0 &&
           "Shifts not aligned on Bytes are not supported.");
    uint64_t Offset = Shift / 8;
    unsigned TySizeInBytes = Origin->getValueSizeInBits(0) / 8;
    assert(!(Origin->getValueSizeInBits(0) & 0x7) == 0 &&
           "The size of the original loaded type is not a
           "multiple of a byte.");

How would you prefer we handle this? If we only remove the magic
numbers using getAddressableUnitSize() instead of getBitsPerByte() we'd
get some mixed terminology. If the community is ok with that, we're
happy to do this. If we would go for changing the terminology overall,
then the work and the patch sizes would grow considerably.

BR,
Jesper

Although the above is mentioning bytes, looking at the “/ 8” and “& 0x7” makes it look like the author meant octets and not bytes.
Bytes can be any size of bits. Octets are only ever 8 bits.

Although the above is mentioning bytes, looking at the "/ 8" and "& 0x7" makes it look like the author meant octets and not bytes.
Bytes can be any size of bits.

I don't think you'll have much luck trying to make that stick for a
general audience, or even a general compiler-writer audience. Byte is
far too strongly associated with 8 bits these days.

Octets are only ever 8 bits.

You might be able to convert all uses of byte to octet and abandon
byte entirely, but at that point why bother? It feels like a change
just for the sake of pedantry.

I like the "addressable unit" name, though it's a bit long (AddrUnit
seems OK). It at least signals to a reader that there might be
something weird going on. Getting someone writing new code to think in
those terms is a different matter, of course, but I don't think any of
the changes under discussion really help there.

BTW, is there an open source backend (in a fork, I assume) that does
this? So that we can get some kind of idea of the real scope of the
changes needed.

Cheers.

Tim.

Although the above is mentioning bytes, looking at the "/ 8" and "& 0x7" makes it look like the author meant octets and not bytes.
Bytes can be any size of bits.

I don't think you'll have much luck trying to make that stick for a
general audience, or even a general compiler-writer audience. Byte is
far too strongly associated with 8 bits these days.

+1 Please don't try; insisting on a distinction will confuse many new
contributors.

Octets are only ever 8 bits.

You might be able to convert all uses of byte to octet and abandon
byte entirely, but at that point why bother? It feels like a change
just for the sake of pedantry.

I like the "addressable unit" name, though it's a bit long (AddrUnit
seems OK). It at least signals to a reader that there might be
something weird going on. Getting someone writing new code to think in
those terms is a different matter, of course, but I don't think any of
the changes under discussion really help there.

BTW, is there an open source backend (in a fork, I assume) that does
this? So that we can get some kind of idea of the real scope of the
changes needed.

Strongly agreed.

My personal take is this is an invasive enough change with enough likely
ongoing maintenance fall out to require substantial justification before
the work was undertaken upstream. A open source backend proposed for
inclusion upstream would be one part of that. Active contribution from
the sponsors in other areas would also be a key factor.

Hi Jesper,

thank you for working on this. My company (Codasip) would definitely
be interested in having this feature upstream. I think that this is
actually important for a suprisingly large number of people who
currently have to maintain their changes downstream. I have a couple
of questions and comments:

  1. Do you plan on supporting truly arbitrary values as the byte size
    or are there in fact going to be limitations (e.g. the value has to
    be a multiple of 8 and lower or equal to 64)? I recall that we had a
    customer asking about 36-bit bytes.

We plan on supporting arbitrary sizes with a lower limit of 8, not
necessarily power-of-two or multiples of 8. I have to admit that I
haven’t thought very much about what the upper limit might be. We might
leave it up to other interested parties to explore that and if we
receive suggestions on how to generalize also in that respect, we’ll
certainly consider them.

  1. If you define a byte to be e.g. 16 bits wide, does it mean that
    “char” is also 16 bits wide? If yes then how to do you define types
    like int8_t from stdint.h?

Yes, char is the same. The int8_t type is optional according to the
standard and we don’t define it for our OOT target. The int_least8_t is
required, but we just define it to be byte sized.

  1. Have you thought about the possibility to support different byte
    sizes for data and code?

Not really, but I saw that Jeroen Dobbelaere just suggested supporting
memory spaces with different byte sizes.

  1. I realize that this is a separate issue but fully supporting non-
    8-bit bytes requires also changes to other parts of a typical
    toolchain, namely linker (ld/lld) and debugger (gdb/lldb). Do you
    maintain out-of-tree changes in this area as well?

That’s true, we do. I’ve also seen some community interest in those
areas, e.g. from Embecosm:
https://www.embecosm.com/2018/02/26/how-much-does-a-compiler-cost/

and from within Ericsson:
https://www.youtube.com/watch?v=HAqtEZmci70

What are you using for the executable file format for machines whose byte size is not 8? Looks like the ELF spec assumes that a byte is 8 bits long.