Aggregate load/stores

Hi all,

As most of you may now, LLVM is completely unable to do anything reasonable with aggregate load/stores, beside just legalize them into something in the backend.

This is not a good state of affair. Aggregate are part of LLVM IR, and as such, LLVM should do something about them.

That is a bit of a chicken and egg issue: front end just implement their own tricks to avoid aggregate or plain don’t care about the resulting executable as long as it works. As such, pretty much everybody that care about this already implemented something in the front end.

Which is honestly rather stupid. Everybody is doing the same work again and again because LLVM is not doing it.

That being said, I now know why LLVM is not doing it. Any attempt at making things move on that front result in someone finding the solution not good enough and stalling the process.

Things is, pretty much anything is better than nothing. Comparing any current solution to an hypothetical nonexistant perfect solution is not constructive. And at this stage, this is close to being disrespectful. I have http://reviews.llvm.org/D9766 (from may) and no actionable item on it. It was done as per feedback on previous discussion on the subject. There is no proposal to improve the code, no proposal to do it another way, no nothing. FROM MAY !

I’d like to get things moving here. If you guys don’t give a s*** about it because clang already have a work around, then fine. The good thing is that it won’t affect clang, for the very same reason: it is not using it. But there are numerous front end out there, that do not have the manpower backing clang, and all have to jump through hoops to handle aggregate for LLVM not to mess up. So please be considerate for the smaller guys in town.

Hi all,

As most of you may now, LLVM is completely unable to do anything
reasonable with aggregate load/stores, beside just legalize them into
something in the backend.

This is not a good state of affair. Aggregate are part of LLVM IR, and as
such, LLVM should do something about them.

That is a bit of a chicken and egg issue: front end just implement their
own tricks to avoid aggregate or plain don't care about the resulting
executable as long as it works. As such, pretty much everybody that care
about this already implemented something in the front end.

Which is honestly rather stupid. Everybody is doing the same work again
and again because LLVM is not doing it.

That being said, I now know why LLVM is not doing it. Any attempt at
making things move on that front result in someone finding the solution not
good enough and stalling the process.

Things is, pretty much anything is better than nothing. Comparing any
current solution to an hypothetical nonexistant perfect solution is not
constructive. And at this stage, this is close to being disrespectful. I
have ⚙ D9766 Add support for complex aggregate store in InstCombine (from may) and no actionable item on
it. It was done as per feedback on previous discussion on the subject.
There is no proposal to improve the code, no proposal to do it another way,
no nothing. FROM MAY !

I'd like to get things moving here. If you guys don't give a s*** about it
because clang already have a work around, then fine. The good thing is that
it won't affect clang, for the very same reason: it is not using it. But
there are numerous front end out there, that do not have the manpower
backing clang, and all have to jump through hoops to handle aggregate for
LLVM not to mess up. So please be considerate for the smaller guys in town.

Hello,

I see things a little differently and I'll do my best to explain my
position.

First of all, LLVM's backend doesn't handle large aggregates very well and
there appears no desire from the greater community to fix this. Last year
I tried to fix PR21513 which involved a fairly large store of aggregate
type totaling around 64 KB. It turns out that the backend's representation
requires having a node with > 64,000 operands. I submitted a patch to fix
this, http://reviews.llvm.org/D6164, but others in the community reasoned
that the cure was worse than the disease as it results in all SDNodes
becoming a little larger.

Second of all, turning large aggregates memory operations into large scalar
memory operations, via the integer types, doesn't work for memory
operations beyond 1 MB because the largest integer type is (2**23)-1 bits.
I think it would be quite costly to make the scalarization scale
significantly beyond this. Beyond that, InstCombine is not supposed to
generate type which aren't considered legal by datalayout. Targets out
there rely on InstCombine to respect this to mitigate the creation of IR
which doesn't map well to the hardware.

What I tried, but perhaps failed, to intimate was that today's status quo
is considered to be a reasonable engineering compromise. LLVM doesn't
provide a *completely* abstract and normalized interface to computation but
we try our best in the face of the constraints we are faced with. This is
why clang's technique is, to me, reasonable.

I hope this explains where I am coming from.

Thanks,
David Majnemer

I understand these objections. They ends up being a problem at the limit (ie the example of the 64k store or 1Mb+ aggregate). These probably require their own fix or probably just shouldn’t be supported.

That being said, there is a vast space between what is done now and aggregate so big that it causes real hard problems like the ones you mention. reducing that gap seems like a win to me. Additionally, there is vast bag of trick that can be deployed to mitigate the problem (for instance, load to store forwarding can be changed into memcpy). Things is, one got to start somewhere.

More generally, it doesn’t seems right to me to reject something because it doesn’t cover all bases. The fact that it covers more bases than what exists now should be enough (unless it create some kind of bad precedent).

The argument that target are relying on InstCombine to mitigate IR requiring legalization seems dubious to me. First, because both aggregate and large scalar require legalization, so, if not ideal, the proposed change does not makes things any worse than they already are. In fact, as far as legalization is concerned, theses are pretty much the same. It should also be noted that InstCombine is not guaranteed to run before the target, so it seems like a bad idea to me to rely on it in the backend.

As for the big integral thing, I really don’t care. I can change it to create multiple loads/stores respecting data layout, I have the code for that and could adapt it for this PR without too much trouble. If this is the only thing that is blocking this PR, then we can proceed. But I’d like some notion that we are making progress. Would you be willing to accept a solution based on creating a serie of load/store respecting the datalayout ?

I understand these objections. They ends up being a problem at the limit
(ie the example of the 64k store or 1Mb+ aggregate). These probably require
their own fix or probably just shouldn't be supported.

That being said, there is a vast space between what is done now and
aggregate so big that it causes real hard problems like the ones you
mention. reducing that gap seems like a win to me. Additionally, there is
vast bag of trick that can be deployed to mitigate the problem (for
instance, load to store forwarding can be changed into memcpy). Things is,
one got to start somewhere.

More generally, it doesn't seems right to me to reject something because
it doesn't cover all bases. The fact that it covers more bases than what
exists now should be enough (unless it create some kind of bad precedent).

I would argue that a fix in the wrong direction is worse than the status
quo.

The argument that target are relying on InstCombine to mitigate IR
requiring legalization seems dubious to me. First, because both aggregate
and large scalar require legalization, so, if not ideal, the proposed
change does not makes things any worse than they already are. In fact, as
far as legalization is concerned, theses are pretty much the same. It
should also be noted that InstCombine is not guaranteed to run before the
target, so it seems like a bad idea to me to rely on it in the backend.

InstCombine is not guaranteed to run before IR hits the backend but the
result of legalizing the machinations of InstCombine's output during
SelectionDAG is worse than generating illegal IR in the first place.

As for the big integral thing, I really don't care. I can change it to
create multiple loads/stores respecting data layout, I have the code for
that and could adapt it for this PR without too much trouble. If this is
the only thing that is blocking this PR, then we can proceed. But I'd like
some notion that we are making progress. Would you be willing to accept a
solution based on creating a serie of load/store respecting the datalayout ?

Splitting the memory operation into smaller operations is not semantics
preserving from an IR-theoretic perspective. For example, splitting a
volatile memory operation into several volatile memory operations is not
OK. Same goes with atomics. Some targets provide atomic memory operations
at the granularity of a cache line and splitting at legal integer
granularity would be observably different.

With the above in mind, I don't see it as unreasonable for frontends to
generate IR that LLVM is comfortable with. We seem fine telling frontend
authors that they should strive to avoid large aggregate memory operations
in our performance tips guide <
http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type&gt;\.
Implementation experience with Clang hasn't shown this to be particularly
odious to follow and none of the LLVM-side solutions seem satisfactory.

I would argue that a fix in the wrong direction is worse than the status
quo.

How is proposed change worse than status quo ?

The argument that target are relying on InstCombine to mitigate IR
requiring legalization seems dubious to me. First, because both aggregate
and large scalar require legalization, so, if not ideal, the proposed
change does not makes things any worse than they already are. In fact, as
far as legalization is concerned, theses are pretty much the same. It
should also be noted that InstCombine is not guaranteed to run before the
target, so it seems like a bad idea to me to rely on it in the backend.

InstCombine is not guaranteed to run before IR hits the backend but the
result of legalizing the machinations of InstCombine's output during
SelectionDAG is worse than generating illegal IR in the first place.

That does not follow. InstCombine is not creating new things that require
legalisation, it changes one thing that require legalization into another
that a larger part of LLVM can understand.

As for the big integral thing, I really don't care. I can change it to
create multiple loads/stores respecting data layout, I have the code for
that and could adapt it for this PR without too much trouble. If this is
the only thing that is blocking this PR, then we can proceed. But I'd like
some notion that we are making progress. Would you be willing to accept a
solution based on creating a serie of load/store respecting the datalayout ?

Splitting the memory operation into smaller operations is not semantics
preserving from an IR-theoretic perspective. For example, splitting a
volatile memory operation into several volatile memory operations is not
OK. Same goes with atomics. Some targets provide atomic memory operations
at the granularity of a cache line and splitting at legal integer
granularity would be observably different.

That is off topic. Proposed patch explicitly gate for this.

With the above in mind, I don't see it as unreasonable for frontends to
generate IR that LLVM is comfortable with. We seem fine telling frontend
authors that they should strive to avoid large aggregate memory operations
in our performance tips guide <
http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type&gt;\.
Implementation experience with Clang hasn't shown this to be particularly
odious to follow and none of the LLVM-side solutions seem satisfactory.

Most front end do not have clang resources. Additionally, this tip is not
quite accurate. I'm not interested in large aggregate load/store at this
stage. I'm interested in ANY aggregate load/store. LLVM is just unable to
handle any of it in a way that make sense. It could certainly do better for
small aggregate, without too much trouble.

I would argue that a fix in the wrong direction is worse than the status
quo.

How is proposed change worse than status quo ?

Because a solution which doesn't generalize is not a very powerful
solution. What happens when somebody says that they want to use atomics +
large aggregate loads and stores? Give them yet another, different answer?
That would mean our earlier, less general answer, approach was either a
bandaid (bad) or the new answer requires a parallel code path in their
frontend (worse).

The argument that target are relying on InstCombine to mitigate IR
requiring legalization seems dubious to me. First, because both aggregate
and large scalar require legalization, so, if not ideal, the proposed
change does not makes things any worse than they already are. In fact, as
far as legalization is concerned, theses are pretty much the same. It
should also be noted that InstCombine is not guaranteed to run before the
target, so it seems like a bad idea to me to rely on it in the backend.

InstCombine is not guaranteed to run before IR hits the backend but the
result of legalizing the machinations of InstCombine's output during
SelectionDAG is worse than generating illegal IR in the first place.

That does not follow. InstCombine is not creating new things that require
legalisation, it changes one thing that require legalization into another
that a larger part of LLVM can understand.

I'm afraid I don't understand what you are getting at here. InstCombine
carefully avoids ptrtoint to weird types, truncs to weird types, etc. when
creating new IR.

As for the big integral thing, I really don't care. I can change it to
create multiple loads/stores respecting data layout, I have the code for
that and could adapt it for this PR without too much trouble. If this is
the only thing that is blocking this PR, then we can proceed. But I'd like
some notion that we are making progress. Would you be willing to accept a
solution based on creating a serie of load/store respecting the datalayout ?

Splitting the memory operation into smaller operations is not semantics
preserving from an IR-theoretic perspective. For example, splitting a
volatile memory operation into several volatile memory operations is not
OK. Same goes with atomics. Some targets provide atomic memory operations
at the granularity of a cache line and splitting at legal integer
granularity would be observably different.

That is off topic. Proposed patch explicitly gate for this.

Then I guess we agree to disagree about what is "on topic". I think that
our advice to frontend authors regarding larger-than-legal loads/stores
should be uniform and not dependent on whether or not the operation was or
was not volatile.

With the above in mind, I don't see it as unreasonable for frontends to
generate IR that LLVM is comfortable with. We seem fine telling frontend
authors that they should strive to avoid large aggregate memory operations
in our performance tips guide <
http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type&gt;\.
Implementation experience with Clang hasn't shown this to be particularly
odious to follow and none of the LLVM-side solutions seem satisfactory.

Most front end do not have clang resources. Additionally, this tip is not
quite accurate. I'm not interested in large aggregate load/store at this
stage. I'm interested in ANY aggregate load/store. LLVM is just unable to
handle any of it in a way that make sense. It could certainly do better for
small aggregate, without too much trouble.

I'm confused what you mean about "clang resources" here, you haven't made
it clear what the burden it is to your frontend. I'm not saying that there
isn't such a burden, I just haven't seen it been articulated and I have
heard nothing similar from other folks using LLVM. What prevents you from
performing field-at-a-time loads and stores or calls to the memcpy
intrinsic?

Because a solution which doesn't generalize is not a very powerful
solution. What happens when somebody says that they want to use atomics +
large aggregate loads and stores? Give them yet another, different answer?
That would mean our earlier, less general answer, approach was either a
bandaid (bad) or the new answer requires a parallel code path in their
frontend (worse).

It is expected from atomics/volatile to work differently. That is the whole
point of them.

A lot of optimization in InstCombine plain ignore atomic/volatile
load/store. That is expected.

The argument that target are relying on InstCombine to mitigate IR
requiring legalization seems dubious to me. First, because both aggregate
and large scalar require legalization, so, if not ideal, the proposed
change does not makes things any worse than they already are. In fact, as
far as legalization is concerned, theses are pretty much the same. It
should also be noted that InstCombine is not guaranteed to run before the
target, so it seems like a bad idea to me to rely on it in the backend.

InstCombine is not guaranteed to run before IR hits the backend but the
result of legalizing the machinations of InstCombine's output during
SelectionDAG is worse than generating illegal IR in the first place.

That does not follow. InstCombine is not creating new things that require
legalisation, it changes one thing that require legalization into another
that a larger part of LLVM can understand.

I'm afraid I don't understand what you are getting at here. InstCombine
carefully avoids ptrtoint to weird types, truncs to weird types, etc. when
creating new IR.

What I'm getting at is that is doesn't make the situation any worse. In
fact, it makes things actually better as you actually get something that do
not need legalization is some cases, when you always get something that
need legalization otherwize.

Think about it. aggregate require legalization. Large scalar require
legalization. Without the patch, you always need legalization. With the
patch you sometime need legalization. And get optimization back into the
game.

If your point is that less legalization is better, then the change is a
win. If you are willing to go the the multiple load/store solution for non
volatile/atomic, then even more so.

As for the big integral thing, I really don't care. I can change it to
create multiple loads/stores respecting data layout, I have the code for
that and could adapt it for this PR without too much trouble. If this is
the only thing that is blocking this PR, then we can proceed. But I'd like
some notion that we are making progress. Would you be willing to accept a
solution based on creating a serie of load/store respecting the datalayout ?

Splitting the memory operation into smaller operations is not semantics
preserving from an IR-theoretic perspective. For example, splitting a
volatile memory operation into several volatile memory operations is not
OK. Same goes with atomics. Some targets provide atomic memory operations
at the granularity of a cache line and splitting at legal integer
granularity would be observably different.

That is off topic. Proposed patch explicitly gate for this.

Then I guess we agree to disagree about what is "on topic". I think that
our advice to frontend authors regarding larger-than-legal loads/stores
should be uniform and not dependent on whether or not the operation was or
was not volatile.

Come on. It is expected from atomic/volatile to be handled differently.
That is the whole point of atomic/volatile. Bringing atomic/volatile as an
argument that something should not be done in the general case sounds like
backward rationalization.

With the above in mind, I don't see it as unreasonable for frontends to
generate IR that LLVM is comfortable with. We seem fine telling frontend
authors that they should strive to avoid large aggregate memory operations
in our performance tips guide <
http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type&gt;\.
Implementation experience with Clang hasn't shown this to be particularly
odious to follow and none of the LLVM-side solutions seem satisfactory.

Most front end do not have clang resources. Additionally, this tip is not
quite accurate. I'm not interested in large aggregate load/store at this
stage. I'm interested in ANY aggregate load/store. LLVM is just unable to
handle any of it in a way that make sense. It could certainly do better for
small aggregate, without too much trouble.

I'm confused what you mean about "clang resources" here, you haven't made
it clear what the burden it is to your frontend. I'm not saying that there
isn't such a burden, I just haven't seen it been articulated and I have
heard nothing similar from other folks using LLVM. What prevents you from
performing field-at-a-time loads and stores or calls to the memcpy
intrinsic?

clang has many developer behind it, some of them paid to work on it. That s
simply not the case for many others.

But to answer your questions :
- Per field load/store generate more loads/stores than necessary in many
cases. These can't be aggregated back because of padding.
- memcpy only work memory to memory. It is certainly usable in some cases,
but certainly do not cover all uses.

I'm willing to do the memcpy optimization in InstCombine (in fact, things
would not degenerate into so much bikescheding, that would already be done).

David, speaking as the guy who wrote the documentation you're quoting, you're twisting the intent of that document. The document was explicitly intended to document current status, warts and all. Please do not use it to justify not fixing those warts. :slight_smile:

In general, I feel that a solution which worked for FCAs under some fixed size (64k, 1MB, fine!) would be better than one that worked for none. We could just document the limitation and call it a day. (I'll note that I am not endorsing or discouraging any *particular* solution to said problem.)

Philip

I agree with this particular point. If we limited the optimizer to treating all load and stores the same, we’d have a much weaker optimizer. Treating atomic vs non-atomic FCAs differently from an optimization standpoint seems potentially reasonable. I would not want to treat them differently from a correctness/lowering strategy standpoint. (i.e. both the input and output from instcombine need to trigger somewhat sane results from the backend.) Philip

Yes, that was pretty much my point, put in a much clearer manner. I'm also
not particularly attached to this way to do it, but if this is the wrong
way, then let's discuss what alternative exists and would be better rather
than letting the matter stuck in limbos.

Hi,

+1 with David’s approach: making thing incrementally better is fine as long as the long term direction is identified. Small incremental changes that makes things slightly better in the short term but drives us away of the long term direction is not good.

Don’t get me wrong, I’m not saying that the current patch is not good, just that it does not seem clear to me that the long term direction has been identified, which explain why some can be nervous about adding stuff prematurely.
And I’m not for the status quo, while I can’t judge it definitively myself, I even bugged David last month to look at this revision and try to identify what is really the long term direction and how to make your (and other) frontends’ life easier.

Calling out “bikescheding” what other devs think is what keeps the quality of the project high is unlikely to help your patch go through, it’s probably quite the opposite actually.

Hi,

Because a solution which doesn't generalize is not a very powerful
solution. What happens when somebody says that they want to use atomics +
large aggregate loads and stores? Give them yet another, different answer?
That would mean our earlier, less general answer, approach was either a
bandaid (bad) or the new answer requires a parallel code path in their
frontend (worse).

+1 with David’s approach: making thing incrementally better is fine *as
long as* the long term direction is identified. Small incremental changes
that makes things slightly better in the short term but drives us away of
the long term direction is not good.

Don’t get me wrong, I’m not saying that the current patch is not good,
just that it does not seem clear to me that the long term direction has
been identified, which explain why some can be nervous about adding stuff
prematurely.
And I’m not for the status quo, while I can’t judge it definitively
myself, I even bugged David last month to look at this revision and try to
identify what is really the long term direction and how to make your (and
other) frontends’ life easier.

As long as there is something to be done. Concern has been raised for very
large aggregate (64K, 1Mb) but there is no way a good codegen can come out
of these anyway. I don't know of any machine that have 1Mb of register
available to tank the load. Even I we had a good way to handle it in
InstCombine, the backend would have no capability to generate something
nice for it anyway. Most aggregates are small and there is no good excuse
to not do anything to handle them because someone could generate gigantic
ones that won't map nicely to the hardware anyway.

By that logic, SROA should not exists as one could generate gigantic
aggregate as well (in fact, SROA fail pretty badly on large aggregates).

The second concern raised is for atomic/volatile, which needs to be handled
by the optimizer differently anyway, so is mostly irrelevant here.

clang has many developer behind it, some of them paid to work on it. That
s simply not the case for many others.

But to answer your questions :
- Per field load/store generate more loads/stores than necessary in many
cases. These can't be aggregated back because of padding.
- memcpy only work memory to memory. It is certainly usable in some
cases, but certainly do not cover all uses.

I'm willing to do the memcpy optimization in InstCombine (in fact, things
would not degenerate into so much bikescheding, that would already be done).

Calling out “bikescheding” what other devs think is what keeps the quality
of the project high is unlikely to help your patch go through, it’s
probably quite the opposite actually.

I understand the desire to keep quality high. That's is not where the
problem is. The problem lies into discussing actual proposal against
hypothetical perfect ones that do not exists.

OK, what about that plan :

Slice the aggregate into a serie of valid loads/stores for non atomic ones.

Use big scalar for atomic/volatile ones.

Try to generate memcpy or memmove when possible ?

I’ve definitely “run into this problem”, and I would very much love to remove my kludges [that are incomplete, because I keep finding places where I need to modify the code-gen to “fix” the same problem - this is probably par for the course from a complete amateur compiler writer and someone that has only spent the last 14 months working (as a hobby) with LLVM].

So whilst I can’t contribute much on the “what is the right solution” and “how do we solve this”, I would very much like to see something that allows the user of LLVM to use load/store withing things like “is my thing that I’m storing big, if so don’t generate a load, use a memcpy instead”. Not only does this make the usage of LLVM harder, it also causes slow compilation [perhaps this is a separte problem, but I have a simple program that copies a large struct a few times, and if I turn off my “use memcpy for large things”, the compile time gets quite a lot longer - approx 1000x, and 48 seconds is a long time to compile 37 lines of relatively straight forward code - even the Pascal compiler on PDP-11/70 that I used at my school in 1980’s was capable of doing more than 1 line per second, and it didn’t run anywhere near 2.5GHz and had 20-30 users anytime I could use it…]

…/lacsap -no-memcpy -tt longcompile.pas
Time for Parse 0.657 ms
Time for Analyse 0.018 ms
Time for Compile 1.248 ms
Time for CreateObject 48803.263 ms
Time for CreateBinary 48847.631 ms
Time for Compile 48854.064 ms

compared with:
…/lacsap -tt longcompile.pas
Time for Parse 0.455 ms
Time for Analyse 0.013 ms
Time for Compile 1.138 ms
Time for CreateObject 44.627 ms
Time for CreateBinary 82.758 ms
Time for Compile 95.797 ms

wc longcompile.pas
37 84 410 longcompile.pas

Source here:
https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas

Hi Mats,

The performance issue seems like a potential different issue.
Can you send the input IR in both cases and the list of passes you are running?

Thanks,

Even if I turn to -O0 [in other words, no optimisation passes at all], it takes the same amount of time.

The time is spent in

12.94% lacsap lacsap [.] llvm::SDNode::use_iterator::operator==
7.68% lacsap lacsap [.] llvm::SDNode::use_iterator::operator*
7.53% lacsap lacsap [.] llvm::SelectionDAG::ReplaceAllUsesOfValueWith
7.28% lacsap lacsap [.] llvm::SDNode::use_iterator::operator++
5.59% lacsap lacsap [.] llvm::SDNode::use_iterator::operator!=
4.65% lacsap lacsap [.] llvm::SDNode::hasNUsesOfValue
3.82% lacsap lacsap [.] llvm::SDUse::getResNo
2.33% lacsap lacsap [.] llvm::SDValue::getResNo
2.19% lacsap lacsap [.] llvm::SDUse::getNext
1.32% lacsap lacsap [.] llvm::SDNode::use_iterator::getUse
1.28% lacsap lacsap [.] llvm::SDUse::getUser

Here’s the LLVM IR generated:

https://gist.github.com/Leporacanthicus/9b662f88e0c4a471e51a

And as can be seen here -O0 produces “no passes”:
https://github.com/Leporacanthicus/lacsap/blob/master/lacsap.cpp#L76

…/lacsap -no-memcpy -tt longcompile.pas -O0
Time for Parse 0.502 ms
Time for Analyse 0.015 ms
Time for Compile 1.038 ms
Time for CreateObject 48134.541 ms
Time for CreateBinary 48179.720 ms
Time for Compile 48187.351 ms

And before someone says “but you are running a debug build”, if I run the “production”, it does speed things up quite nicely, about 3x, but still takes 17 seconds vs 45ms with that build of the compiler.

…/lacsap -no-memcpy -tt longcompile.pas -O0
Time for Parse 0.937 ms
Time for Analyse 0.005 ms
Time for Compile 0.559 ms
Time for CreateObject 17241.177 ms
Time for CreateBinary 17286.701 ms
Time for Compile 17289.187 ms

…/lacsap -tt longcompile.pas
Time for Parse 0.274 ms
Time for Analyse 0.004 ms
Time for Compile 0.258 ms
Time for CreateObject 7.504 ms
Time for CreateBinary 45.405 ms
Time for Compile 46.670 ms

I believe I know what happens: The compiler is trying to figure out the best order of instructions, and looks at N^2 instructions that are pretty much independently executable with no code or data dependencies. So it iterates over a vast number of possible permutations, only to find that they are all pretty much equally good/bad… But like I said earlier, although I’m a professional software engineer, compilers are just a hobby-project for me, and I only started a little over a year back, so I make no pretense to know the answer. Using memcpy instead solves this problem, as it

Forgot the “fast compile” version of the LLVM-IR:
https://gist.github.com/Leporacanthicus/b1f12005ef0c46582d39

The instruction selection for X86 turns:

define void @P.p1(%1* byval) {
entry:
%y = alloca %1, align 8
%1 = load %1, %1* %0
store %1 %1, %1* %y
%valueindex2 = bitcast %1* %y to [8000 x i32]*
%valueindex1 = getelementptr [8000 x i32], [8000 x i32]* %valueindex2, i32 0, i32 1
%2 = load i32, i32* %valueindex1
call void @__write_int(%0* @output, i32 %2, i32 1)
call void @__write_nl(%0* @output)
ret void
}

into 16014 instructions, it sounds pretty terrible :frowning:

That’s kind of my point - it turns load/store into lots of instrucitons, and the suggested solution that I got when I pointed that out was “well, you should use memcpy for large data structures, there is an intrinsic for it”. This led to this little function:
https://github.com/Leporacanthicus/lacsap/blob/master/expr.cpp#L305

along with a few other bits and pieces that do similar “if it’s big enough, call memcpy”.

I’m not sure if the results are better on any other processor architecture - since my home setup consists only of x86-64 machines, I haven’t experimented with anything else.

The IR is definitively loading a { [8000 x i32] }. I don’t see how this can end up to something else than a sludge of loads at the end. X86 do not have 32000 byte wide load instructions anywhere.