[GSoC 2025] Byte Type

LLVM IR can’t represent implementations of memcpy, memcmp, etc correctly due to the lack of a way to represent raw memory. This project aims to add a new ‘byte’ type to the LLVM IR to represent raw memory.

In addition to adding the new type, the project involves changing clang to lower chars to the new b8 type instead of i8, fixing incorrect lowerings of memory intrinsics, and tracking down the performance regressions.

There is already a prototype implementation of the byte type for an older version of LLVM.
More information here.

Expected result:
The minimum result is a port of the existing prototype to the current LLVM, fixing all known incorrect optimizations, add support for the byte type to Alive2, and a performance analysis.

Skills: Intermediate C++, familiarity with LLVM, profiling.
Project size: Large
Difficulty: Hard
Confirmed mentors: Nuno Lopes

2 Likes

Hey @nlopes, I’m quite interested in this project. Besides getting familiar with the prototype implementation, are there any other preliminary tasks you suggest?

I think reading through the documentation linked above is already a lot. If you can follow all that, you should be good.

Hey @nlopes! So I just went through the documentation and the initial RFC. It’s funny to see how you guys were fighting with the community as lone warriors back in 2021 :joy:

So I think I have an okay grasp of whats happening here but still have a few questions:

  • How exactly do byte types implicitly carry provenance information? From how I understand it, it’s still just a collection of bits like the i8 so I was wondering how provenance information, which I think of as metadata that distinguishes pointers to the same address, can be attached to this collection of bits.

  • The Alive2 toolkit could easily help us point out failing optimizations due to the new type right? Or are there any we would have to look out for due to its limitations (e.g. lack of inter-procedural analysis which would be nice for identifying escaping pointers or are the pointers theoretically safe now with this new type?).

  • “We modify the semantics of the bitcast to allow casts from and to the byte types. If the byte has a value of the same kind (integral or a pointer) as the other type, then the bitcast is a noop.” I was just wondering how the compiler can infer the underlying type of the byte. Was there a pass made for this or the compiler trivially sort this out.

  • Since the SelectionDAG component has been sorted out in the prototype, no need to worry about the backend anymore? Any hardware intrinsics to think about?

  • From where George left off, it seems like this is the main direction right now:
    Make optimizations more aware of the byte type (what would this look like?),
    Find a way to automatically fix lowering tests in clang
    Find sources of regressions
    Run benchmarks on machines other than a 64-bit ARM one
    Other than these, what are the main things my proposal should touch on? And do you have a desired end goal for the summer?

  • And just curious - is the community more open to changing the memory model now? Or we’re still trying to prove that it’s worth it with the results.

No, byte type is not just bits. It contains all information that is stored in memory including metadata like the type of the data.

Yes, provided that the byte type is implemented in Alive2.

The compiler doesn’t need to infer the underlying type. One thing is semantics, the other is what you can infer using static analysis, and the other is what can you optimize.

I think the backend part is done.

Yes, that’s roughly what’s left. That’s already a lot.

That’s a good question. Maybe; people have been more receptive to fixing long-standing bugs.

hi @nlopes ,It’s Abhinav Thakare here, a computer science engineer and am interested in contributing to this project , should I send my gsoc proposal here itself?

The LLVM IR currently lacks a way to represent raw memory access, which causes issues with correctly implementing functions like memcpy and memcmp. To address this, a project aims to introduce a new ‘byte’ type to LLVM IR. Here are the key points about this project:

LLVM IR uses integers (particularly i8) to represent raw memory access, which can lead to incorrect optimizations and aliasing issues

Add a new ‘byte’ type to LLVM IR to represent raw memory access
Modify Clang to lower char and unsigned char to the new b8 type instead of i8
Track down and address performance regression

This is what i gathered though brief overview of the task
Port the existing prototype to the current LLVM version.
Add support for the byte type to Alive2

Feedback on what I gathered was correct and how i should proceed forward would be appreciated

Hi @nlopes, I’m very interested in working on this project. I have gone through the shared links and the above conversation, and I follow so far. I will go through the prototype code and prepare further by contributing to llvm on some good first issues, probably something to do with the type system. If theres anything else you recommend I should do, please let me know. I’ll try and share a draft of my proposal in a couple days.

you can upload to the gsoc website directly. thanks!