Optimization opportunity

There seems to be a disadvantage to the approach of allocating all
locals on the stack using alloca. Consider the following code:

extern void func(int*);
extern int xyz();

void abc()
{
  int a = xyz();
  int i;

  {
    int b[10];
    for (i = 0; i < 10; i++) b[i] = xyz();
    func(b);
  }

  {
    int c[10];
    for (i = 0; i < 10; i++) c[i] = xyz();
    func(c);
  }

  func(&a);
}

We have two arrays, b and c, only one of which can exist at any given
time. Both g++ and Microsoft VC++ map both arrays to the same bytes in
the stack frame. LLVM doesn't. As there isn't anything in the LLVM
assembly code to mark the scope of locals, it will take some analysis to
determine which locals can share the same memory locations.

Also, the store into the arrays generates two x86 machine instructions:

  lea %ECX, DWORD PTR [%ESP + 16]
  mov DWORD PTR [%ECX + <element offset>], %EAX

These can be combined into a single instruction. I'm tempted to work on
this one myself :slight_smile:

On the plus side, only LLVM unrolled the loops.

There seems to be a disadvantage to the approach of allocating all
locals on the stack using alloca. Consider the following code:

There is nothing intrinsic in LLVM that prevents this from happening, we
just have not yet implemented 'stack packing'.

We have two arrays, b and c, only one of which can exist at any given
time. Both g++ and Microsoft VC++ map both arrays to the same bytes in
the stack frame. LLVM doesn't. As there isn't anything in the LLVM
assembly code to mark the scope of locals, it will take some analysis to
determine which locals can share the same memory locations.

Note that using scoping is really only a hack. Those two arrays could
very well have been declared on entry to the function, and should be
optimized as such. LLVM will eventually support this.

Also, the store into the arrays generates two x86 machine instructions:

  lea %ECX, DWORD PTR [%ESP + 16]
  mov DWORD PTR [%ECX + <element offset>], %EAX

These can be combined into a single instruction. I'm tempted to work on
this one myself :slight_smile:

Yup, there are several things that can be improved in the X86 instruction
selector. If you submit a patch to implement this, we would be happy to
apply it. Continuing improvements in the code generator should eventually
make this kind of thing fall out automatically, but for now they must be
implemented manually.

-Chris

I succumbed to temptation and made the improvement. Diffs are attached
for X86ISelSimple.cpp and X86InstrBuilder.h.

I determined that the reason two instructions are generated in the first
place, instead of being folded immediately into one, is because locals
do not have a physical offset assigned at that time. There is a
peephole optimization pass after the stack frame is finalized, but the
problem with folding them there is that the physical registers have also
been assigned, so register CX (in this case) would be wasted.

So I generalized the concept of a "full address". There were four
variables, base reg, scale, index reg and displacement, that were being
passed around. I converted them into a single struct and added a new
type field that allows the base reg to instead be a frame index. The
extra lea instruction is now folded immediately, and the code for
processing store instructions shrinks quite nicely.

This should be generalizable even further to handle global variables. I
didn't go that far, because it appeared there may be other work
required. It wasn't clear a displacement to a global was supported.

Jeff

X86ISelSimple.cpp.diff (20.9 KB)

X86InstrBuilder.h.diff (2.16 KB)

Jeff,

Chris isn't likely to respond to this for a while as he's on vacation.
I'll take a look at it and will commit it if it looks good. Since code
gen isn't my specialty, could you increase my comfort level a little by
giving me some examples of the test results you got when testing your
patches? Ideally, I'd like to see some of the test/Programs/MultiSource
programs working and generating correct code with this path.

Thanks,

Reid.

Fair enough... The following tests under MultiSource fail:

Benchmarks/Olden/power
Benchmarks/OptimizerEval
Benchmarks/Ptrdist/ks
Benchmarks/MallocBench/perl
Applications/sgefa

However, they also fail in the exact same way without my change.
OptimizerEval appears to be non-deterministic; it produces different
output each time it runs. Everything else passes. I also compared a
few *.s files generated by the tests, with and without my change, to see
if everything is as I expected.

I tried running SingleSource also, and ran into a few problems. Some of
the tests failed because they cannot include alloca.h. This may be a
linux dependency. Anyway, again I saw the same results with or without
my change.

I cannot run qmtest at all. I don't seem to have it or any way of
building it.

There's no rush to commit this. If you don't feel comfortable doing so,
it can wait until Chris gets back. And as it's my first submission, you
shouldn't feel comfortable :slight_smile:

Jeff

Fair enough... The following tests under MultiSource fail:

Benchmarks/Olden/power
Benchmarks/OptimizerEval
Benchmarks/Ptrdist/ks
Benchmarks/MallocBench/perl
Applications/sgefa

However, they also fail in the exact same way without my change.
OptimizerEval appears to be non-deterministic; it produces different
output each time it runs. Everything else passes. I also compared a
few *.s files generated by the tests, with and without my change, to see
if everything is as I expected.

I know for certain that sgefa, perl, and power are XFAIL tests. Not sure
about the other two.

I tried running SingleSource also, and ran into a few problems. Some of
the tests failed because they cannot include alloca.h. This may be a
linux dependency. Anyway, again I saw the same results with or without
my change.

Which platform are you compiling on?

I cannot run qmtest at all. I don't seem to have it or any way of
building it.

qmtest is an external package. You need version 2.03. I suggest you
read:

http://llvm.cs.uiuc.edu/docs/TestingGuide.html

In the requirements section there's a link to the QMTest software.

There's no rush to commit this. If you don't feel comfortable doing so,
it can wait until Chris gets back. And as it's my first submission, you
shouldn't feel comfortable :slight_smile:

On the other hand, two weeks is a long time to wait for your first
submission :slight_smile: I'm going to review the code, if it looks good I'll test
it, if that looks okay, I'll commit it and we can see what it does to
the nightly tester.

Thanks for your willingness to help, Jeff.

Reid.

Jeff,

I tried the Benchmarks/Olden/power, Benchmarks/OptimizerEval, and
Benchmarks/Ptrdist/ks tests. They all worked with your patches. I
suggest you update your tree :slight_smile:

The changes also survived all the Feature and Regression tests on Linux.
So, your changes are committed.

Thanks for the patches!

Reid.

Jeff,

Thought you might like to know. Your changes caused a decrease in the
native code size of executables in the nightly test. Have a look at:
http://llvm.x10sys.com/testresults

and, in particular:
http://llvm.x10sys.com/testresults/running_Olden_machcode_large.png

TYMLTK

Reid.

> I tried running SingleSource also, and ran into a few problems. Some of
> the tests failed because they cannot include alloca.h. This may be a
> linux dependency. Anyway, again I saw the same results with or without
> my change.

Which platform are you compiling on?

FreeBSD 5.2.1

> I cannot run qmtest at all. I don't seem to have it or any way of
> building it.

qmtest is an external package. You need version 2.03. I suggest you
read:

http://llvm.cs.uiuc.edu/docs/TestingGuide.html

In the requirements section there's a link to the QMTest software.

I somehow missed that document...

Jeff,

I tried the Benchmarks/Olden/power, Benchmarks/OptimizerEval, and
Benchmarks/Ptrdist/ks tests. They all worked with your patches. I
suggest you update your tree :slight_smile:

I was using the 1.3 release, though I did pull the current code from CVS to verify no other changes were made to those files. Unfortunately this machine is too underpowered to frequently rebuild LLVM. It was collecting dust in the closet until I decided to put FreeBSD on it. For the record it's a 64MB 300Mhz Pentium II and it takes 30-35 hours to build LLVM from scratch. It's possible some of my test problems are due to LLVM generating X86 code that's not Pentium II-friendly. If LLVM was Windows-friendly I'd build it on my much more powerful Windows system, but relying on cygwin means combining the worst of both Unix and Windows. When FreeBSD 5 goes production in a few months I plan to put it on a more powerful system.

The changes also survived all the Feature and Regression tests on Linux.
So, your changes are committed.

Thanks for the patches!

You're welcome!

FYI,

We have someone in Denmark (Henrik) working on an Interix (Windows
Services For Unix) port, and possibly a native Win32 port. I wouldn't
expect anything for a couple months but hopefully the 1.4 release will
have native windows support. That should help a lot.

Reid.

I succumbed to temptation and made the improvement. Diffs are attached
for X86ISelSimple.cpp and X86InstrBuilder.h.

The patches look great, that's a very elegant way to address the problem,
and also simplified some code, thanks!

I determined that the reason two instructions are generated in the first
place, instead of being folded immediately into one, is because locals
do not have a physical offset assigned at that time. There is a
peephole optimization pass after the stack frame is finalized, but the
problem with folding them there is that the physical registers have also
been assigned, so register CX (in this case) would be wasted.

Yup.

So I generalized the concept of a "full address". There were four
variables, base reg, scale, index reg and displacement, that were being
passed around. I converted them into a single struct and added a new
type field that allows the base reg to instead be a frame index. The
extra lea instruction is now folded immediately, and the code for
processing store instructions shrinks quite nicely.

Sounds good.

This should be generalizable even further to handle global variables. I
didn't go that far, because it appeared there may be other work
required. It wasn't clear a displacement to a global was supported.

That would be cool as well. A displacement from a global should be
supported. Look at the code generated for:

int G;
void foo() { G = 4; }

... and you'll see what I mean. :slight_smile:

-Chris