When There Are No Bytes

In the past, we struggled with LLVM for our targets, which do not support byte addressed memory. A topic that appears once in a while. Some patches are on the net.

Here, I want to present our approach to support these architectures without patching LLVM.

Comments are welcome.

When There Are No Bytes

Most common architectures have the same memory model: each byte is
addressable
. But there are some special architectures for special
purposes, where addressing each byte is not required, eg on vector
processing architectures. The compiler infrastructure LLVM supports
the first set of architectures: the byte addressed architectures.

Once in a while the question appears to patch LLVM to support an
instance of the second set of architectures. This could be bit
addressed
memory or word addressed memory. We call this target
addressed
architectures. There is a set of patches for an old
version of LLVM, but these are incomplete and have not been upgraded
for newer versions of LLVM. In the past we applied these patches and
added new ones, but got stuck on LLVM version 3.4.2, while LLVM
version 7.0.0 was released.

In the following we present our approach: we perform target dependent
transformations without patching. This approach was first applied
to LLVM 7.0.0 as a feasibility study for a bit addressed
architecture and then implemented for a word addressed architecture.
Later we upgraded to LLVM 14.0.6 without problems.

The Idea

Our approach is to stay with LLVM’s byte address mode and make all
target address transformations explicitly. When we get in touch
with the architecture or its environment we transform the byte
address
into a target address.

We touch the architecture or its environment when

  1. we load values from memory or store values to memory
  2. we access target address labels provided by the assembler
  3. function calling:
    • we call system functions or legacy code, that might need a
      target addresses
    • we are called from system functions or legacy code, that passes
      target addresses

In this approach it does not matter if the target address addresses 1
bit or 64-bits words; we implemented both.

Transforming Memory Access

In the first transformation step, we transform LLVM memory accesses,
which are implemented by ISD::LOAD and ISD::STORE
operations. There are other operations, but for the sake of
simplicity we stay with these two operations. These operations have a
byte addressed pointer which has to be transformed into target
addressed
pointer. Therefore, we add 3 target operations:

  1. Targetaddress
  2. convertByteaddressToTargetaddress
  3. convertTargetaddressToByteaddress

The first operation is a simple marker, that a transformation was done
and avoids endless recursion. The other two operations are the
explicit transformations. We keep them explicit to make optimizations
on these operations simpler.

During the legalize pass we replace the memory operations:

if (TargetISD::TargetAddress != Load.getBasePtr().getOpcode()) then
 load(ptr) -> load(Targetaddress(convertByteaddressToTargetaddress(ptr)))

and

if (TargetISD::TargetAddress != Store.getBasePtr().getOpcode()) then
 store(..,ptr,..) -> store(..,Targetaddress(convertByteaddressToTargetaddress(ptr)),..)

In a pre-ISel pass we remove the operation Targetaddress again:

Targetaddress(ptr) -> ptr

When we view a DAG before the isel pass, then the graph should look
quite familiar and ready for instruction selection.

Transforming Assembler Interaction

In the second transformation step, we transform all target addresses
provided by the assembler into byte addresses as needed by LLVM. In
general, these addresses are given in form of labels that were
generated by the compiler in assembler code, eg GlobalAddress,
FrameIndex, ConstantPool and the like to address data.

GlobalAddress -> Targetaddress(convertByteaddressToTargetaddress(GlobalAddress))

Instruction Selection

For instruction selection we have to implement instructions for the
operations convertByteaddressToTargetaddress and
convertTargetaddressToByteaddress. We keep them abstract throughout
the instruction selection pass as long as possible to have the
possibility to optimize these operations with their semantics and not
their implementation. In the code generator table-gen file we
specify the operations and add two pseudo instructions:

// the operations
def ConvertByteaddressToTargetaddress:
   SDNode< "TargetISD::convertByteAddressToTargetAddress", SDTUnaryOp>;
def ConvertTargetaddressToByteaddress:
   SDNode< "TargetISD::convertTargetAddressToByteAddress", SDTUnaryOp>;

// the pseudo instructions
let usesCustomInserter = 1 in {
 def ConvertTargetaddressToByteaddress_rr:
     Pseudo< (outs Regs:$dst), (ins Regs:$src),
     		 "$dst := convertTargetaddressToByteaddress($src)",
		 [(set Regs:$dst,
		       (ConvertTargetaddressToByteaddress Regs:$src))]>;

 def ConvertByteaddressToTargetaddress_rr:
     Pseudo< (outs Regs:$dst), (ins Regs:$src),
     		 "$dst := convertByteaddressToTargetaddress($src)",
		 [(set Regs:$dst,
		       (ConvertByteaddressToTargetaddress Regs:$src))]>;
}

We implement the function EmitInstrWithCustomInserter in the target
lowering by replacing these instruction by the instructions that
implement these abstract operation; in most cases shift
instructions.

MachineBasicBlock*
MyTargetLowering::EmitInstrWithCustomInserter(
   MachineInstr &MI, MachineBasicBlock *BB) const {
 switch (MI.getOpcode()) {
 default:
   break;
 case Target::ConvertByteaddressToTargetaddress_rr:
   return emitConvertByteaddressToTargetaddress(MI, BB);
 case Target::ConvertTargetaddressToByteaddress_rr:
   return emitConvertTargetaddressToByteaddress(MI, BB);
}

At this point it is possible to compile and run self-contained
applications that do not use external code.

Interaction With The System

We have to use existing target code, eg provided as libraries or even
hand-written assembler code. The functions in this code base will
probably use pointers as target addresses and not bytes addresses.
The most simpliest and most elegant solution is to use attributes to
annotate function prototypes and their parameters and results.

In the front-end (eg Clang) we add two attributes:

  1. byteaddress
  2. targetaddress

and handle these attributes by passing them to functions and their
parameters in the LLVM IR. We do use the attribute byteaddress
explicitly in functions, which are always inlined and loose their
attributes, to remind us that another attribute and its handling in
the function might lead to illegal code.

Additionally, while we edit the front-end we add two target builtins
to make address conversions in code explicit, again:

  1. __builtin_target_convert_byteaddresstotargetaddress
  2. __builtin_target_convert_targetaddresstobyteaddress

In the front-end we transform them to intrinsics in LLVM IR.

The function prototype for an potentially hand-written external
function could be:

__attribute__((targetaddress)) int *
 targetFooBar(int, __attribute__((targetaddress)) int*);

In a final transformation step during lowering, we lower the new
intrinsics to their corresponding operations. Then, we use the
function and parameter attributes

  1. to return pointer results (hint: ::LowerReturn) as
    • byte address or
    • target address pointers depending on their attribute
  2. to pass parameter pointers (hint: ::LowerCall) as
    • byte address or
    • target address pointers depending on their attribute
  3. to receive argument pointers (hint: ::LowerFormalArguments) and
    transform them to bytes address pointers!

Optimizations

We haven’t mentioned code optimizations, yet. There are quite some. We
give here just some pointers.

Combining

convertByteaddressToTargetaddress(convertTargetaddressToByteaddress(ptr)) -> ptr

convertTargetaddressToByteaddress(convertByteaddressToTargetaddress(ptr)) -> ptr

Instructions Matching

"MV $dst, #($src1 <shift_op> <shift_imm>)",
[(set Regs:$dst, (ConvertTargetaddressToByteaddress tglobaladdr:$src1))]

Conclusion

We presented an approach to build back-ends in LLVM for targets, which
do not follow the principal of a byte addressed memory, without
patching LLVM. We keep all transformations on a high level and
transparent. We implemented two out-of-source back-ends following
this approach successfully. Additionally, we upgraded to newer
versions of LLVM without going through patching hell.

5 Likes

I’ve worked with word-addressed machines in the past. One point I’m curious about is how you might handle byte insertion/extraction within a word. Apart from that it seems like a very smooth approach.

I’d love to see this as a tech talk at an LLVM conference!

That’s a good point.

On the scalar data path:
Users should not use char or short by convention; though there is some support. Data values of all types are stored on a 64bit boundary. This is not satisfying, but acceptable for our applications. Has not happened yet, but it is an open topic. In the worst case the users can solve this by using the vector data path. Do you have examples that I can’t think of?

On the vector data path:

In method VectorCombine::scalarizeLoadExtract() we can disable replacing the extraction of an element from a vector in memory by a scalar load from the elements address in memory by setting the instruction costs defined in TargetTransformInfo (TTI) properly; we can make scalar instructions quite expensive due to the VLIW feature of the target.
In method VectorCombine::foldSingleElementStore() no costs from TTI are used, so we have to exit the method early (by a source code patch). The costs of the transformation should be checked here too, as it is done in VectorCombine::scalarizeLoadExtract().

Back in my word-addressed-machine days, code still wanted to manipulate character strings. That’s what humans understand, after all! This particular target (PDP-10) had “byte pointers” that let you address arbitrary subsets of a word. A byte pointer was an address plus bitsize/bitoffset info, and there were instructions to load/store through a byte pointer, and increment a byte pointer to allow stepping through the bytes in a word. I guess in your scheme a byte pointer would be another kind of target address? that was not the same size as a normal address.

Maybe in your target things don’t work that way, but if you’re looking at generic changes to LLVM that would allow various non-byte-addressed targets to succeed with “just” target-specific code, I think you’d need to allow for a case like that.

1 Like

@borisboesler
Thanks for writing this up.

I’m glad this approach works for your targets. I also hapen to have such a target, but we’re currently taking the “straighforward” approach, i.e. fixing LLVM codebase. This causes much pain when upgrading, and we discover new bugs occasionally. So I’m happy to consider any alternatives.

I have some concerns about the proposed approach.

One is already raised by @pogo59 - there should be a way of handling strings, or, more generally, any valid C source should compile. “Users should not use char or short by convention” sounds like a blocker to me. It should be possible to use the language and the standard library to full extent.

Second, how do you deal with the fact that some value bits of a pointer are now lost? E.g. if I have a pointer value 0xFF0000FF (“target” address), what value will it have when converting it to a “byte” address? Do you use some kind of fat pointers, or the bits that were shifted out are just lost, limiting the size of addressable memory?

I had one more point about performance of string functions (e.g. strcmp), but you have already made it clear that chars are not supported.

I’m also curious about interoperability with hand-written assembly code. IIUC the assembly code generated by the compiler will be incompatible with the hand-written sources. Is that right?

It would be nice if you could provide some examples of C code and the corresponding machine (pseudo-)code, highlighting the key points.

@pogo59, @s-barannikov Thanks for your input.

About sub-word types like char:

When we patched the code base, we decided (as a starting point) that every value is stored in its own word. This means in our case that all types, even type char, are 64 bits aligned. We can define an array of char and work on it. We do this in tests, eg “Sieve Of Eratosthenes”. But it uses 8 times more memory than required. And that’s the main reason why it is not encouraged to use it.

But for the sake of memory consumption and debugging, we store constant strings with 8 char per 64 bits word!

It depends on the architecture’s features how to access sub-word elements. But I think byte addresses are needed to emulate byte accesses in software.

@s-barannikov How do you handle the type char if you have word addresses only?

Future: Without patching we have full byte addresses and we can lower scalar code to vector code, where we can use the word part of an address to access the memory word, and the lower byte address part to identify the element in the vector word and to extract it.

Please keep in mind, that our architecture is not intended for general purpose. The core has only 4096 instruction words. We focus on video applications.

About pointers and their sizes:

The architecture has special address registers which have exactly the bits to address all words in data memory, but not more. Additionally, the instructions on these registers are limited to MV and ADDI. Therefore byte addresses are stored and modified in general purpose registers and written to address registers when the word address is a pointer operand in LOAD or STORE.

About interoperability:

As I mentioned in Interaction With The System we use attributes in function prototypes and definitions to specify if a pointer is a byte or word pointer (default is byte).

We write something OpenCL-ish code. The function

void store_int_at_byteaddress(__local volatile int *vmpcl_result,
                              __local int const *argv) {
  *vmpcl_result = *argv >> 24;
}

is compiled to the initial SelectionDAG:

Initial selection DAG: %bb.0 'store_int_at_byteaddress:entry'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
  t5: i32 = Constant<0>
    t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t7: i32,ch = load<(load (s32) from %ir.argv, align 8, !tbaa !4, addrspace 2)> t0, t4, undef:i32
      t9: i32 = sra t7, Constant:i32<24>
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t10: ch = store<(volatile store (s32) into %ir.vmpcl_result, align 8, !tbaa !4, addrspace 2)> t7:1, t9, t2, undef:i32
  t11: ch = VMPISD::RET_FLAG t10

For LOAD and STORE the byte pointers have to be converted to word pointers. Finally, it is lowered and optimized to the final SelectionDAG:

Optimized legalized selection DAG: %bb.0 'store_int_at_byteaddress:entry' SelectionDAG has 15 nodes:
  t0: ch = EntryToken
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t16: i32 = VMPISD::convertByteAddressToWordAddress t4
  t18: i32,ch = load<(load (s32) from %ir.argv, align 8, !tbaa !4, addrspace 2)> t0, t16, undef:i16
      t9: i32 = sra t18, Constant:i32<24>
          t2: i32,ch = CopyFromReg t0, Register:i32 %0
        t12: i32 = VMPISD::convertByteAddressToWordAddress t2
    t15: ch = store<(volatile store (s32) into %ir.vmpcl_result, align 8, !tbaa !4, addrspace 2)> t18:1, t9, t12, undef:i16
  t11: ch = VMPISD::RET_FLAG t15

The standard code entry for tests is (similar to int main(int argc, char**argv)):

// required attribute definition from a header file
#define __byteaddress __attribute__((byteaddress))
#define __wordaddress __attribute__((wordaddress))

__kernel
void vmpcl_main(__wordaddress __local volatile int *vmpcl_result,
		int argc, __wordaddress __local int const *argv) {
  store_int_at_byteaddress(vmpcl_result, argv);
}

receives two word pointer from the boot-loader code (hand written assembler code). The pointers are converted to byte pointers as expected by LLVM. The code is compiled to the initial SelectionDAG:

Initial selection DAG: %bb.0 'vmpcl_main:entry'
SelectionDAG has 21 nodes:
  t0: ch = EntryToken
  t5: i32,ch = CopyFromReg t0, Register:i32 %1
  t9: i32 = GlobalAddress<void (i32 addrspace(2)*, i32 addrspace(2)*)* @store_int_at_byteaddress> 0
    t11: ch,glue = callseq_start t0, TargetConstant:i32<0>, TargetConstant:i32<0>
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t3: i32 = VMPISD::convertWordAddressToByteAddress t2
  t13: ch,glue = CopyToReg t11, Register:i32 $sr0, t3
      t7: i32,ch = CopyFromReg t0, Register:i32 %2
    t8: i32 = VMPISD::convertWordAddressToByteAddress t7
  t15: ch,glue = CopyToReg t13, Register:i32 $sr1, t8, t13:1
  t18: ch,glue = VMPISD::CALL t15, TargetGlobalAddress:i32<void (i32 addrspace(2)*, i32 addrspace(2)*)* @store_int_at_byteaddress> 0, Register:i32 $sr0, Register:i32 $sr1, RegisterMask:Untyped, t15:1
    t19: ch,glue = callseq_end t18, TargetConstant:i32<0>, TargetConstant:i32<0>, t18:1
  t20: ch = VMPISD::RET_FLAG t19

and lowered and optimized to the final SelectionDAG:

Optimized legalized selection DAG: %bb.0 'vmpcl_main:entry'
SelectionDAG has 18 nodes:
  t0: ch = EntryToken
    t11: ch,glue = callseq_start t0, TargetConstant:i32<0>, TargetConstant:i32<0>
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t3: i32 = VMPISD::convertWordAddressToByteAddress t2
  t13: ch,glue = CopyToReg t11, Register:i32 $sr0, t3
      t7: i32,ch = CopyFromReg t0, Register:i32 %2
    t8: i32 = VMPISD::convertWordAddressToByteAddress t7
  t15: ch,glue = CopyToReg t13, Register:i32 $sr1, t8, t13:1
  t18: ch,glue = VMPISD::CALL t15, TargetGlobalAddress:i32<void (i32 addrspace(2)*, i32 addrspace(2)*)* @store_int_at_byteaddress> 0, Register:i32 $sr0, Register:i32 $sr1, RegisterMask:Untyped, t15:1
    t19: ch,glue = callseq_end t18, TargetConstant:i32<0>, TargetConstant:i32<0>, t18:1
  t20: ch = VMPISD::RET_FLAG t19

If the called function expected word pointers, then the conversion word → byte → word would have been eliminated during combining.

Let’s clarify some terminology.
A byte is a smallest addressable unit of memory, whether is it 8-bit or not (link). It is usually 8 bits.
A word is the natural unit of data used by a particular processor design (link). A word is usually larger than a byte.

According to the standard, char is one byte in size, regardless of the number of bits in a byte. So, every char occupies exactly one byte of memory, which is 32 bits in our case. This may be memory inefficient, but it allows us to avoid bit arithmetic when working with strings, i.e. the code gets smaller and faster.

We store debugging data in 8-bit bytes, because it is convenient for debug information consumers that run on an x86 machine. The data has no use for the running program itself.

That’s nice, looks like we have pretty similar architectures. The thing I don’t get is how do you convert a “word” address stored in an address register to a “byte” address stored in a register without losing the top bits. Do you mean that general purpose registers are wider than address registers? I’m basically asking for the implementation of the convertByteAddressToWordAddress pseudo-instruction. Which machine instructions is it lowered to?

So, if we had a 256-bit machine {256-bit integers, 256-bit pointers, 256-bit FP}, we should call a container of 256-bits a “word” ??? The surprise factor is too high…

No, this flies in the face of reasonability:
A Byte should be 8-bits (at least on machines supporting this container size)
A Half should be 16-bits (intel be damned)
A Word should be 32-bits
A DoubleWord should be 64-bits
A QuadWord should be 128-bts
…

Otherwise computer scientists will slowly loose the ability to communicate with each other.

A. n.
I. Speech, utterance, verbal expression.
…
13. In extended use.
…
d. Computing . A consecutive string of bits (now typically 16, 32, or 64, but formerly fewer) that can be transferred and stored as a unit.

The size of a word has long been context-dependent.

No, as one of the sibling comments pointed out the size of a word is context-dependent. For instance, in M68k a word is 16 bits.

@Mitch_Alsup @jrtc27 @mshockwave
Thanks for your input.

I agree. And that’s what LLVM is using. And that fails for our/other targets.

Yes, GPRs are 32 bits wide, while address registers are less than 16 bits wide at the moment. The address conversions are replaced by shift left or shift right instructions on GPRs:

// somehow Rx has a byte address
LSR Rx, Rx, #3 // convert byte address -> word address
MV Ay, Rx
ADD Rz, (Ay), #1

I agree. The term word is too generic and depends on the target. It’s a little bit like the C type int that is getting wider and wider depending on the target.

I see. Unfortunately, this will not work for our target, which has both address and general purpose registers of 32 bits.
You said that you had a version that supported unusual byte widths in more “natural” way. Have you completely abandoned it or do you still see this as one of the possible ways forward?

I wouldn’t call the patches “natural”. Hm, I think the existing set of patches that we used introduced a BitsPerAddressableUnit to data layout and did all the alignment there. We replaced most constants 8 in the code by calling method DL.getBitsPerAddressableUnit() for address computation.
But in the end this was too buggy. And with every new release of LLVM we had to check too much code.

This is what we call a “byte width” :slight_smile:

Why not? You replaced hardcoded byte width of 8 with a configurable one. Sounds pretty natural to me.

Yeah, obviously. It would be much stable and maintainable if these changes were in upstream.
I was thinking about it lately, and I could find a reason why the middle end shouldn’t support it. It is easily testable with lit tests. I’d share the part I have (there isn’t much).

1 Like

In K&R C, (int) was supposed to be the “most efficient form” of integer.

Due to sins of the past, 64-bit and larger machines use I32LP64 default type model.
So, on many 64-bit (and larger) machines (int) is left with meaning int32_t and the
LLVM compiler performs arithmetic and then has to smash the calculated result so
it contains no bits of excess significance. So, (int) is no longer the most efficient
integer type !!