LowerPacked pass

Hello,

Our software uses 4 x float vectors a lot, and I pass these to LLVM as packed types - but when I do the JIT compile it seems that the LowerPacked pass is never run so the code generation fails. I noticed that most other passes have a header file with a public createXXXPass() function so they can be added to the PassManager, but LowerPacked doesn't have this... What should I do?

m.

PS. Chris, thanks for the feedback on the memory cleanup patch - I'm a bit busy getting LLVM integrated in our app now, but I will incorporate your suggestions and submit a proper patch soon...

Our software uses 4 x float vectors a lot, and I pass these to LLVM as
packed types - but when I do the JIT compile it seems that the
LowerPacked pass is never run so the code generation fails. I noticed
that most other passes have a header file with a public createXXXPass()
function so they can be added to the PassManager, but LowerPacked
doesn't have this... What should I do?

I just added it. There was no reason to not expose it, we just never got
to that point. Note that packed support in LLVM is not complete yet. In
particular, here are some of the big missing pieces:

1. No code generators can generate vector instructions yet (SSE or
   altivec, for example). This should be fairly easy to add though.
2. The lowerpacked pass, which currently converts packed ops into their
   scalar counterparts, has a few limitations:
     A. It does not handle packed arguments to functions
     B. It always lowers all of the way to scalar ops, even if the target
        supports SOME packed types. For example, it would be nice for it
        to eventually lower <16 x float> into 4 <4 x float>'s if the
        target supports them.
     C. It has never been thoroughly tested, primarily because we don't
        have a producer of packed operations yet. I believe it should
        work reasonably well though.
3. LLVM is missing support for a bunch of important vector operations. In
   particular, we need at least 'extract element' and 'build vector out of
   scalars' operations. Given these, we can implement packed arguments to
   functions without a problem. There are problem many others we
   eventually want.

For your work, it might be most expedient to just ignore the lower packed
pass and add SSE support to the X86 backend: that will get you up and
running quickly and get you the performance you are obviously after. If
backwards compatibility with old hardware is an issue, revisiting the
lower packed pass would make sense.

Let me know what you think. In the very short term, the hook exposed to
create the lower packed pass can be plunked into the X86TargetMachine and
get intra function packed types working for you.

PS. Chris, thanks for the feedback on the memory cleanup patch - I'm a
bit busy getting LLVM integrated in our app now, but I will incorporate
your suggestions and submit a proper patch soon...

No problem.

-Chris

Chris Lattner wrote:

Note that packed support in LLVM is not complete yet. In
particular, here are some of the big missing pieces:

1. No code generators can generate vector instructions yet (SSE or
   altivec, for example). This should be fairly easy to add though.
2. The lowerpacked pass, which currently converts packed ops into their
   scalar counterparts, has a few limitations:
     A. It does not handle packed arguments to functions
     B. It always lowers all of the way to scalar ops, even if the target
        supports SOME packed types. For example, it would be nice for it
        to eventually lower <16 x float> into 4 <4 x float>'s if the
        target supports them.

     C. It has never been thoroughly tested, primarily because we don't
        have a producer of packed operations yet. I believe it should
        work reasonably well though.

It works reasonably well, quite impressive really considering it's not been tested :wink: B is not much of a problem for my use, but A is a bit annoying even though I mostly pass pointers to packed types anyway. Can you elaborate a bit on what is the problem with this? I have calls going back into our code by adding mappings to the JIT, but I'm not sure if I can get it to call functions with R32x4 (<float x 4>) args without making a wrapper that takes a pointer.

For your work, it might be most expedient to just ignore the lower packed
pass and add SSE support to the X86 backend: that will get you up and
running quickly and get you the performance you are obviously after. If
backwards compatibility with old hardware is an issue, revisiting the
lower packed pass would make sense.

Is it easy to add intrinsics to do things like dot product of packed types using SSE instructions? That's probably all I need...

Let me know what you think. In the very short term, the hook exposed to
create the lower packed pass can be plunked into the X86TargetMachine and
get intra function packed types working for you.

The patch you did was missing the actual implementation of createLowerPackedPass, so I'm including my own differences -- I guess you don't want to apply the changes to X86TargetMachine as I'm the only one actually generating packed types, but I include it for completeness..

m.

lowerpacked.patch.txt (2.51 KB)

Chris Lattner wrote:
> Note that packed support in LLVM is not complete yet. In
> particular, here are some of the big missing pieces:
>
> 1. No code generators can generate vector instructions yet (SSE or
> altivec, for example). This should be fairly easy to add though.
> 2. The lowerpacked pass, which currently converts packed ops into their
> scalar counterparts, has a few limitations:

> C. It has never been thoroughly tested, primarily because we don't
> have a producer of packed operations yet. I believe it should
> work reasonably well though.

It works reasonably well, quite impressive really considering it's not
been tested :wink:

You can thank Brad Jones for that, he did a great job! :slight_smile:

> B. It always lowers all of the way to scalar ops, even if the target
> supports SOME packed types. For example, it would be nice for it
> to eventually lower <16 x float> into 4 <4 x float>'s if the
> target supports them.

B is not much of a problem for my use,

Yup, I suspected not. You know what you're generating. :slight_smile:

> A. It does not handle packed arguments to functions

but A is a bit annoying even though I mostly pass pointers to packed
types anyway. Can you elaborate a bit on what is the problem with this?
I have calls going back into our code by adding mappings to the JIT, but
I'm not sure if I can get it to call functions with R32x4 (<float x 4>)
args without making a wrapper that takes a pointer.

The basic problem is that a FunctionPass, like lower packed cannot change
the prototype of the function it is running on, so you can't change:
void foo(<4 x float>) -> void foo(float,float,float,float).

Like you said, passing a pointer is a work-around. A better solution
would be to implement insert/extract operations, and require all code
generators to support passing Packed types by value, but only require them
to implement the insert/extract operations, all other ops would be
lowered. This is reasonable because the insert/extract ops can be
implemented as simple memory copies.

> For your work, it might be most expedient to just ignore the lower packed
> pass and add SSE support to the X86 backend: that will get you up and
> running quickly and get you the performance you are obviously after. If
> backwards compatibility with old hardware is an issue, revisiting the
> lower packed pass would make sense.

Is it easy to add intrinsics to do things like dot product of packed
types using SSE instructions? That's probably all I need...

Yes, it's quite easy, take a look at this for some more info:
http://llvm.org/docs/ExtendingLLVM.html#intrinsic

Before you start adding a bunch of X86 specific intrinsics, please ping
the list with information (ideally in the form of a LangRef.html patch :),
about the intrinsics. While it is not necessarily a problem to have X86
specific intrinsics, we only want them for truly X86 specific operations.
I would think that dot product can be implemented successfully in multiple
different vector ISA's.

Also, I would assume you will want simple things like add and multiply of
packed values as well. These can be added directly to X86ISelSimple.cpp,
like the intrinsics. If you have questions about that process, let me
know. :slight_smile:

> Let me know what you think. In the very short term, the hook exposed to
> create the lower packed pass can be plunked into the X86TargetMachine and
> get intra function packed types working for you.

The patch you did was missing the actual implementation of
createLowerPackedPass, so I'm including my own differences -- I guess

Sounds good, applied. Sorry for not doing it right in the first place!:
http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20041115/020968.html

you don't want to apply the changes to X86TargetMachine as I'm the only
one actually generating packed types, but I include it for completeness..

This should definitely go in in the future, but I'd rather wait until
packed types work 100% before doing so. Maybe under control of an
-enable-simd flag or something would work.

Speaking of flags, if you look at the top of X86TargetMachine.cpp, there
is a SSEArg command line argument that is currently #ifdef'd out. I would
appreciate it if you enable it and use it to control the instructions
being emitted by the X86 backend (SSE1-3), if you start working on it.

Thanks!

-Chris