PCH/preamble files that exceeds 512M

Hi all,

I worked on crashes and asserts on clangd and found that Clang doesn’t support PCH files that exceed 512M. It happens because PCH uses bit offsets from the beginning of the file for some indexes and other data structures. The simplest example is SOURCE_LOCATION_OFFSETS in source manager block. But there are other similar cases. I identified some cases and made them 64bit to double check my hypotheses. It indeed helped in my case but I see other places where uint32_t is used for storing bit offsets in the file. I see two possible approaches to fix this issue:

  1. Use uint64_t in all places where Clang needs bit offset from the beginning of the files. It is relatively straight forwarded approach but it increases file size even if uint32_t is enough. In my case I saw about 4% increase for 700Mb preamble file. Such increase sounds modest to me and I implemented it in https://reviews.llvm.org/D76295

  2. Store offsets from the beginning of corresponding data structure i.e. it will give 512M size limit on individual blocks instead of whole file. It won’t increase file size much bit will add some complication of loading/storing logic and may still require 64 bit offsets if it is not possible to find good anchors for relative offsets.

The question is which approach do you think is the best?

Dmitry

Subscribing because I’m interested and adding Richard for visibility.

+clangd-dev, this is definitely interesting! We hadn’t seen such large preambles before.

I have to confess from the clangd side we’ve mostly been reusing the modules/preamble format without deeply probing into it, so I don’t have an informed opinion (both your options seem reasonable to me, with 2 sounding nicer but more work).

+Michael, Bruno, and Volodymyr

I agree with Sam that option 2 seems cleaner. I’m not sure whether 4% (option 1) would be a blocker for us, but probably it would be fine.

+Michael, Bruno, and Volodymyr

I agree with Sam that option 2 seems cleaner.

+1

I’m not sure whether 4% (option 1) would be a blocker for us, but probably it would be fine.

Ultimately I believe it would be fine.

OK, it seems that there is kind of consensus that second option is better and additional complexity worths it. My main concern about second option is that it is kind of temporary solution that just delays the point then 32bit won’t be enough for bit offsets.
I’ll explore second option in more details and post results here.

https://reviews.llvm.org/D76594 implements relative offsets for 32-bit bit offsets. I tried to find all potentially problematic places and found one more that I didn’t identify initially PPD_ENTITIES_OFFSETS.

I found 2 more cases when 32-bit offsets are not enough: DeclOffests and TypeOffssts. They belong to DECLTYPES_BLOCK block that usually takes about 2/3 of the file and on 900M preamble file exceeds 512M alone. In such cases I’m using 64-bit offsets.
Diff https://reviews.llvm.org/D76594 was updated to handle these cases. If you are interested in large AST files support, please review.