Upstream PTX backend that uses target independent code generator if possible

Hi there,

I have a working prototype of PTX backend, and I would like to
upstream it if possible. This backend is implemented by LLVM's target
independent code generator framework; I think this will make it easier
to maintain.

I have tested this backend to translate a work-efficient parallel scan
kernel ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
) into PTX code. The generated PTX code was then executed on real
hardware, and the result is correct.

So far I have to hack clang to generate bitcode for this backend, but
I will try to patch clang to parse CUDA (or OpenCL) while I am
upstreaming this backend.

I am new to LLVM. Any comments are welcome.

Regards,
Che-Liang

Hi there,

I have a working prototype of PTX backend, and I would like to
upstream it if possible. This backend is implemented by LLVM's target
independent code generator framework; I think this will make it easier
to maintain.

I have tested this backend to translate a work-efficient parallel scan
kernel ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
) into PTX code. The generated PTX code was then executed on real
hardware, and the result is correct.

So far I have to hack clang to generate bitcode for this backend, but
I will try to patch clang to parse CUDA (or OpenCL) while I am
upstreaming this backend.

Clang already has an OpenCL front end, but I don't know what it would
take to make it work with your backend (I'm assuming there're quite a
few intrinsics that have to be dealt with).

I am new to LLVM. Any comments are welcome.

The GPU and multiprocessor possibilities are one of the many things
that have interested me in LLVM, and it's great to see someone working
on that.

I'm rather new here my self, so really the only other question I have
is is the code currently hosted anywhere to look at?

Hi Michael,

Sorry for late reply. I didn't check mail on last weekend.

Clang already has an OpenCL front end, but I don't know what it would
take to make it work with your backend (I'm assuming there're quite a
few intrinsics that have to be dealt with).

Cool. I will try that later.

The GPU and multiprocessor possibilities are one of the many things
that have interested me in LLVM, and it's great to see someone working
on that.

Thanks.

I'm rather new here my self, so really the only other question I have
is is the code currently hosted anywhere to look at?

I didn't put code on the web. Sorry.

Regards.
Che-Liang

Che-Liang Chiou <clchiou@gmail.com> writes:

Hi there,

I have a working prototype of PTX backend, and I would like to
upstream it if possible. This backend is implemented by LLVM's target
independent code generator framework; I think this will make it easier
to maintain.

How does this relate, at all, to the backend here:

If they are unrelated, can you do a comparison of the two? Perhaps
there are holes in each that can be filled by the other. It would be
a shame to have two completely different PTX backends.

I have tested this backend to translate a work-efficient parallel scan
kernel ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
) into PTX code. The generated PTX code was then executed on real
hardware, and the result is correct.

How much of the LLVM IR does this support? What's missing?

So far I have to hack clang to generate bitcode for this backend, but
I will try to patch clang to parse CUDA (or OpenCL) while I am
upstreaming this backend.

I think it's a lot of work to do CUDA support for not much benefit.
The OpenMP committee is working on accelerator directives and that's
the better long-term approach, IMHO. Clang/LLVM would be a great
vehicle to generate/test ideas for such directives.

http://openmp.org/wp/

                         -Dave

Hi David,

Thanks for asking.

Che-Liang Chiou <clchiou@gmail.com> writes:

Hi there,

I have a working prototype of PTX backend, and I would like to
upstream it if possible. This backend is implemented by LLVM's target
independent code generator framework; I think this will make it easier
to maintain.

How does this relate, at all, to the backend here:

PTX Backend for LLVM download | SourceForge.net

If they are unrelated, can you do a comparison of the two? Perhaps
there are holes in each that can be filled by the other. It would be
a shame to have two completely different PTX backends.

I surfed their code, and it seems that they didn't use code generator.
That means there design should be similar to CBackend or CPPBackend.
So I guess it can't generate some machine instructions like MAD,
and there are some PTX instruction set features that are hard to exploit
if not using code generator.

But I didn't study their code thoroughly, so I might be wrong about this.

I have tested this backend to translate a work-efficient parallel scan
kernel ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
) into PTX code. The generated PTX code was then executed on real
hardware, and the result is correct.

How much of the LLVM IR does this support? What's missing?

Have to add some intrinsics, calling conventions, and address spaces.
I would say these are relatively small changes.

So far I have to hack clang to generate bitcode for this backend, but
I will try to patch clang to parse CUDA (or OpenCL) while I am
upstreaming this backend.

I think it's a lot of work to do CUDA support for not much benefit.
The OpenMP committee is working on accelerator directives and that's
the better long-term approach, IMHO. Clang/LLVM would be a great
vehicle to generate/test ideas for such directives.

Thanks for suggestion.

http://openmp.org/wp/
Account Login | PGI

                    \-Dave

Regards,
Che-Liang

Hi!

Hi there,

I have a working prototype of PTX backend, and I would like to
upstream it if possible. This backend is implemented by LLVM's target
independent code generator framework; I think this will make it easier
to maintain.
      

How does this relate, at all, to the backend here:

PTX Backend for LLVM download | SourceForge.net

If they are unrelated, can you do a comparison of the two? Perhaps
there are holes in each that can be filled by the other. It would be
a shame to have two completely different PTX backends.

I surfed their code, and it seems that they didn't use code generator.
That means there design should be similar to CBackend or CPPBackend.
So I guess it can't generate some machine instructions like MAD,
and there are some PTX instruction set features that are hard to exploit
if not using code generator.

But I didn't study their code thoroughly, so I might be wrong about this.
  

Yes, we don't use the target-independent code generator and the backend is based on the CBackend.
We decided to not use the code generator because PTX code is also an intermediate language. The
graphics driver contains a compiler which compiles PTX code to machine code targeting a particular
GPU architecture. It performs register allocation, instruction scheduling, dead-code elimination, and
other late optimizations. Thus we don't need most of the target-independent code generator
features in the PTXBackend.

We already support most of the PTX instruction set. Texture lookup, structs&arrays, function calls, vector types,
different address spaces and many intrinsics. Not all intrinsics are implemented yet because they are not required
by our application, but it is easy to add them. Only the fused operations(e.g. MAD) are not supported and it will
probably be not as easy as in the target independent code generator. But it might be that they are also inserted by
the graphics driver compiler. I'm not sure about that, but I remember seeing it once(Indeed it would make sense to
do that during instruction selection). Too bad that NVIDIA does not release any detailed information on their
compiler.

How does this relate, at all, to the backend here:

PTX Backend for LLVM download | SourceForge.net

If they are unrelated, can you do a comparison of the two? Perhaps
there are holes in each that can be filled by the other. It would be
a shame to have two completely different PTX backends.
  

I don't know much about the target-independent code generator but I think we use distinct approaches which cannot
be merged in a reasonable way. Probably both approaches have their own pros and cons.

Is there work to upstream this? I've got a relatively unused NVIDIA
card at home. :slight_smile:

                                 -Dave
  

The PTXBackend probably needs more test cases. I'm currently covering a lot of LLVM and PTX features but the test suite is still not exhaustive.
I took the coding standards into account and the license is now compatible to LLVM. I don't know what else needs to be done?

Helge

Che-Liang Chiou <clchiou@gmail.com> writes:

I surfed their code, and it seems that they didn't use code generator.
That means there design should be similar to CBackend or CPPBackend.
So I guess it can't generate some machine instructions like MAD,
and there are some PTX instruction set features that are hard to exploit
if not using code generator.

But I didn't study their code thoroughly, so I might be wrong about this.

I haven't had a chance to look at it yet either.

I have tested this backend to translate a work-efficient parallel scan
kernel ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
) into PTX code. The generated PTX code was then executed on real
hardware, and the result is correct.

How much of the LLVM IR does this support? What's missing?

Have to add some intrinsics, calling conventions, and address spaces.
I would say these are relatively small changes.

Are you generating masks at all? If so, how are you doing that?
Similarly to how the ARM backend does predicates (handling all the
representation, etc. in the target-specific codegen)?

I've have been wanting to see predicates (vector and scalar) in the
LLVM IR for a long time. Perhaps the PTX backend is an opportunity
to explore that.

                           -Dave

Helge Rhodin <helge.rhodin@alice-dsl.net> writes:

But I didn't study their code thoroughly, so I might be wrong about this.
  

Yes, we don't use the target-independent code generator and the
backend is based on the CBackend. We decided to not use the code
generator because PTX code is also an intermediate language. The
graphics driver contains a compiler which compiles PTX code to machine
code targeting a particular GPU architecture. It performs register
allocation, instruction scheduling, dead-code elimination, and other
late optimizations. Thus we don't need most of the target-independent
code generator features in the PTXBackend.

Some of these could still be useful to aid the NVIDIA compiler. But I
don't have any hard data to support that assertion. :slight_smile:

We already support most of the PTX instruction set. Texture lookup,
structs&arrays, function calls, vector types, different address spaces
and many intrinsics.

Do you generate masked operations? If so, are you managing
masks/predicates with your own target-specific representation _a_la_ the
current ARM backend?

If they are unrelated, can you do a comparison of the two? Perhaps
there are holes in each that can be filled by the other. It would be
a shame to have two completely different PTX backends.
  

I don't know much about the target-independent code generator but I
think we use distinct approaches which cannot
be merged in a reasonable way. Probably both approaches have their own
pros and cons.

Certainly.

Is there work to upstream this? I've got a relatively unused NVIDIA
card at home. :slight_smile:

                                 -Dave
  

The PTXBackend probably needs more test cases. I'm currently covering a
lot of LLVM and PTX features but the test suite is still not exhaustive.
I took the coding standards into account and the license is now
compatible to LLVM. I don't know what else needs to be done?

Checking it in. :slight_smile: Really, we probably should do some sort of code
review, but Chris would have to indicate what he wants.

                            -Dave

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of David A. Greene
Sent: Tuesday, August 10, 2010 12:05 PM
To: Helge Rhodin
Cc: llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] PTX backend, BSD license

Helge Rhodin <helge.rhodin@alice-dsl.net> writes:

>> But I didn't study their code thoroughly, so I might be wrong about
this.
>>
> Yes, we don't use the target-independent code generator and the
> backend is based on the CBackend. We decided to not use the code
> generator because PTX code is also an intermediate language. The
> graphics driver contains a compiler which compiles PTX code to
machine
> code targeting a particular GPU architecture. It performs register
> allocation, instruction scheduling, dead-code elimination, and other
> late optimizations. Thus we don't need most of the target-independent
> code generator features in the PTXBackend.

Some of these could still be useful to aid the NVIDIA compiler. But I
don't have any hard data to support that assertion. :slight_smile:

[Villmow, Micah] For the AMD backend that I work on, having these turned on are invaluable. If the NVIDIA compiler is anything like the ATI graphics compiler, it is written for speed and assumes smaller graphics kernels, but with more generic compute kernels, doing some preliminary optimizations/scheduling/allocation helps generate better code.

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of David A. Greene
Sent: Tuesday, August 10, 2010 12:02 PM
To: Che-Liang Chiou
Cc: llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] Upstream PTX backend that uses target
independent code generator if possible

Che-Liang Chiou <clchiou@gmail.com> writes:

> I surfed their code, and it seems that they didn't use code
generator.
> That means there design should be similar to CBackend or CPPBackend.
> So I guess it can't generate some machine instructions like MAD,
> and there are some PTX instruction set features that are hard to
exploit
> if not using code generator.
>
> But I didn't study their code thoroughly, so I might be wrong about
this.

I haven't had a chance to look at it yet either.

>>> I have tested this backend to translate a work-efficient parallel
scan
>>> kernel (
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
>>> ) into PTX code. The generated PTX code was then executed on real
>>> hardware, and the result is correct.
>>
>> How much of the LLVM IR does this support? What's missing?
> Have to add some intrinsics, calling conventions, and address spaces.
> I would say these are relatively small changes.

Are you generating masks at all? If so, how are you doing that?
Similarly to how the ARM backend does predicates (handling all the
representation, etc. in the target-specific codegen)?

I've have been wanting to see predicates (vector and scalar) in the
LLVM IR for a long time. Perhaps the PTX backend is an opportunity
to explore that.

[Villmow, Micah] From looking at the llvmptxbackend, it does not fully support vector types.
This in my perspective is one of the greatest benefits of the backend code-generator, automatic support
for vector types in LLVM-IR that are not natively supported by the target machine via vector splitting.

I think AMD's stream SDK uses LLVM already, at least I've seen some llc
invocations and some amdil backend. I don't know if it compiles down to
GPU instructions or just to some IL which is then compiled again.
Unfortunately its not open source, and it only works with fglrx, not
Mesa so I stopped looking further.

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=120683&%20enterthread=y

Best regards,
--Edwin

[should have looked at the From address before pressing Reply. You
obviously know this already.
Thought you were trying to develop another LLVM backend.
]

Best regards,
--Edwin

It needs to be code reviewed. Please split it into reasonable sized chunks and get them reviewed.

-Chris

My implementation of predicated instructions is similar to ARM
backend. I traced ARM and PowerPC backend for reference.

If, David, you were saying a implementation of predication in LLVM IR,
I didn't do that. It was partly because I was not (and is still not)
very familiar with LLVM's design; so I didn't know how to do that.

I agree what Micah said; LLVM's code generator has a vector splitting,
among many other reusable components.

Regards,
Che-Liang

"Villmow, Micah" <Micah.Villmow@amd.com> writes:

I've have been wanting to see predicates (vector and scalar) in the
LLVM IR for a long time. Perhaps the PTX backend is an opportunity
to explore that.

[Villmow, Micah] From looking at the llvmptxbackend, it does not fully
support vector types. This in my perspective is one of the greatest
benefits of the backend code-generator, automatic support for vector
types in LLVM-IR that are not natively supported by the target machine
via vector splitting.

Actually, one doesn't really need vector types for PTX at the codegen
level unless one wants to make use of the 128-bit memory loads and
stores. This is because the programming model at the asm level is
almost entirely scalar. But we do certainly want masking capability on
scalar instructions. If the backend doesn't have that it's going to
generate horribly performing code.

                              -Dave

Che-Liang Chiou <clchiou@gmail.com> writes:

My implementation of predicated instructions is similar to ARM
backend. I traced ARM and PowerPC backend for reference.

Cool.

If, David, you were saying a implementation of predication in LLVM IR,
I didn't do that. It was partly because I was not (and is still not)
very familiar with LLVM's design; so I didn't know how to do that.

No, I wouldn't have expected you to do that, but I think long-term we
will want to consider it.

                               -Dave

Hi,

Chris Lattner wrote:

The PTXBackend probably needs more test cases. I'm currently covering a lot of LLVM and PTX features but the test suite is still not exhaustive.
I took the coding standards into account and the license is now compatible to LLVM. I don't know what else needs to be done?
      

Checking it in. :slight_smile: Really, we probably should do some sort of code
review, but Chris would have to indicate what he wants.
    
It needs to be code reviewed. Please split it into reasonable sized chunks and get them reviewed.

-Chris
  

Are there any volunteers out there? :slight_smile: Thanks!

David A. Greene wrote:

Do you generate masked operations? If so, are you managing
masks/predicates with your own target-specific representation _a_la_ the
current ARM backend?
  

No, currently not. I only insert perdicates for the conditional branch implementation. But I don't think they are that important. A divergent branch(inside one warp) is more or less the same. Still it would be nice to have them and investicate the integration into LLVM.

[Villmow, Micah] From looking at the llvmptxbackend, it does not fully support vector types.
This in my perspective is one of the greatest benefits of the backend code-generator, automatic support
for vector types in LLVM-IR that are not natively supported by the target machine via vector splitting.

You are right, my backend only supports vector types for load, store, texture fetches and extract element instructions(every vector instructions PTX supports). Nothing like vector splitting is done.

Villmow, Micah wrote:

Yes, we don't use the target-independent code generator and the
backend is based on the CBackend. We decided to not use the code
generator because PTX code is also an intermediate language. The
graphics driver contains a compiler which compiles PTX code to
      

machine
    

code targeting a particular GPU architecture. It performs register
allocation, instruction scheduling, dead-code elimination, and other
late optimizations. Thus we don't need most of the target-independent
code generator features in the PTXBackend.
      

Some of these could still be useful to aid the NVIDIA compiler. But I
don't have any hard data to support that assertion. :slight_smile:
    

[Villmow, Micah] For the AMD backend that I work on, having these turned on are invaluable. If the NVIDIA compiler is anything like the ATI graphics compiler, it is written for speed and assumes smaller graphics kernels, but with more generic compute kernels, doing some preliminary optimizations/scheduling/allocation helps generate better code.

I agree with you that the target-independent code generator approach has some benefits(vector splitting, may be late-optimizations..) in comparison to my implementation. But it I think my approach is the simpler one (at least in the short run) and it is further progressed. What do you think, does it make sense to put more work into my backend(code review..)? The backend already gives us the opportunity to develope and test new GPU related optimizations and improve LLVM(predicates) and Clang(address spaces). From my point of view the higher level optimizations, like choosing the "right" address space, are more intresting and profitable. Such optimizations are not dependent on the backend implementation and could seamlessly be used with the new ptx backend once it is finished.

--Helge

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of Helge Rhodin
Sent: Sunday, August 15, 2010 3:40 PM
To: llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] PTX backend, BSD license

Hi,

Chris Lattner wrote:
>
>
>>> The PTXBackend probably needs more test cases. I'm currently
covering a
>>> lot of LLVM and PTX features but the test suite is still not
exhaustive.
>>> I took the coding standards into account and the license is now
>>> compatible to LLVM. I don't know what else needs to be done?
>>>
>> Checking it in. :slight_smile: Really, we probably should do some sort of code
>> review, but Chris would have to indicate what he wants.
>>
>
> It needs to be code reviewed. Please split it into reasonable sized
chunks and get them reviewed.
>
> -Chris
>
Are there any volunteers out there? :slight_smile: Thanks!

David A. Greene wrote:
>
> Do you generate masked operations? If so, are you managing
> masks/predicates with your own target-specific representation _a_la_
the
> current ARM backend?
>
No, currently not. I only insert perdicates for the conditional branch
implementation. But I don't think they are that important. A divergent
branch(inside one warp) is more or less the same. Still it would be
nice
to have them and investicate the integration into LLVM.

[Villmow, Micah] From looking at the llvmptxbackend, it does not fully
support vector types.
This in my perspective is one of the greatest benefits of the backend
code-generator, automatic support
for vector types in LLVM-IR that are not natively supported by the
target machine via vector splitting.

You are right, my backend only supports vector types for load, store,
texture fetches and extract element instructions(every vector
instructions PTX supports). Nothing like vector splitting is done.

Villmow, Micah wrote:
>>> Yes, we don't use the target-independent code generator and the
>>> backend is based on the CBackend. We decided to not use the code
>>> generator because PTX code is also an intermediate language. The
>>> graphics driver contains a compiler which compiles PTX code to
>>>
>> machine
>>
>>> code targeting a particular GPU architecture. It performs register
>>> allocation, instruction scheduling, dead-code elimination, and
other
>>> late optimizations. Thus we don't need most of the target-
independent
>>> code generator features in the PTXBackend.
>>>
>> Some of these could still be useful to aid the NVIDIA compiler. But
I
>> don't have any hard data to support that assertion. :slight_smile:
>>
> [Villmow, Micah] For the AMD backend that I work on, having these
turned on are invaluable. If the NVIDIA compiler is anything like the
ATI graphics compiler, it is written for speed and assumes smaller
graphics kernels, but with more generic compute kernels, doing some
preliminary optimizations/scheduling/allocation helps generate better
code.
I agree with you that the target-independent code generator approach
has
some benefits(vector splitting, may be late-optimizations..) in
comparison to my implementation. But it I think my approach is the
simpler one (at least in the short run) and it is further progressed.
What do you think, does it make sense to put more work into my
backend(code review..)? The backend already gives us the opportunity to
develope and test new GPU related optimizations and improve
LLVM(predicates) and Clang(address spaces). From my point of view the
higher level optimizations, like choosing the "right" address space,
are
more intresting and profitable. Such optimizations are not dependent on
the backend implementation and could seamlessly be used with the new
ptx
backend once it is finished.

--Helge

[Villmow, Micah] I think having multiple backends make learning easier and this would allow
some experimentation that other backends do not. However, I do not believe that it will
be able to handle as broad a range of inputs like a target-independent backend could.

Hi there,

Thank Nick for kindly reviewing the patch. Here is the link to the
source code of the PTX backend; it would help Nick review the patch.
http://lime.csie.ntu.edu.tw/~clchiou/llvm-ptx-backend.tar.gz

The source code from above link is a working prototype. So it will
not be upstreamed as is; I will refactor and add unimplemented
features while upstreaming it. That said, the source code from above
link
* is not guarantee to be compilable on other machines,
* is not stable or bug-free, and
* should not be considered as the final version for upstream.

I decided to take the code generator approach (referred to as codegen
approach) rather than C backend appraoch (referred to as cbe approach)
for the following reasons (in fact, I had my first prototype in cbe
approach, but later I abandoned it and rewrote in codegen approach).
This would partly answer previous questions about comparison between
two approaches.

* LLVM should not rely on nVidia's design of its CUDA toolchain. To
my knowledge, nVidia does not make any commitment on how much
optimization would be implemented in its graphics driver compiler. A
backend with few optimization supports would screw up if nVidia
decides move most of optimizer to its CUDA compiler from its graphics
driver compiler.

* nVidia's CUDA compiler has a non-trivial optimizer; this should
suggest that late optimization alone is not sufficient. If LLVM's PTX
backend is trying to provide a comparable alternative to nVidia's CUDA
compiler, the backend should have a good code optimizer. In my
experiment, the prototype PTX backend generates better optimized code
than nVidia's CUDA compiler in some cases.

* PTX is a virtual instruction set that is not designed for an
optimizer; for one, it is even not in SSA form. So graphics driver
compiler's optimizer might not do its job very well, and I would
suggest we should not rely on its optimization.

* The codegen approach is actually simpler than the cbe approach. PTX
is mostly RISC-based; that said, the codegen approach leverages from
most of *.td and from implementations of existing matured RISC
backends such as ARM, PowerPC, and Sparc. Besides, I guess most
developers would be more familiar with *.td than C backend. In fact,
it only took me two weeks to write a working prototype from scratch --
and I had had no any prior experience on LLVM's codegen.

* So far my backend is less complete than other backends based on cbe
approach, but considering the simplicity of codegen approach, a
backend based on codegen approach should catch up with them in short
time.

* Masked operation, as well as branch folding and alike, is much
easier to implement in codegen approach. I am not sure how much
performance improvement could be achieved from these optimizations,
but it is worth trying.

All in all, I would propose a PTX backend in codegen approach after I
have implemented both.

Regards,
Che-Liang

Che-Liang Chiou <clchiou@gmail.com> writes:

Hi there,

Thank Nick for kindly reviewing the patch. Here is the link to the
source code of the PTX backend; it would help Nick review the patch.
http://lime.csie.ntu.edu.tw/~clchiou/llvm-ptx-backend.tar.gz

Great!

I decided to take the code generator approach (referred to as codegen
approach) rather than C backend appraoch (referred to as cbe approach)
for the following reasons (in fact, I had my first prototype in cbe
approach, but later I abandoned it and rewrote in codegen approach).
This would partly answer previous questions about comparison between
two approaches.

I think the codegen approad is the right on long-term but I don't
necessarily agree with all of your reasons. :slight_smile:

* LLVM should not rely on nVidia's design of its CUDA toolchain. To
my knowledge, nVidia does not make any commitment on how much
optimization would be implemented in its graphics driver compiler. A
backend with few optimization supports would screw up if nVidia
decides move most of optimizer to its CUDA compiler from its graphics
driver compiler.

This is true.

* nVidia's CUDA compiler has a non-trivial optimizer; this should
suggest that late optimization alone is not sufficient. If LLVM's PTX
backend is trying to provide a comparable alternative to nVidia's CUDA
compiler, the backend should have a good code optimizer. In my
experiment, the prototype PTX backend generates better optimized code
than nVidia's CUDA compiler in some cases.

LLVM will never completely replace the cuda compiler because PTX is not
the final ISA. We'll always need some piece of the cuda compiler to
translate to the metal ISA.

* PTX is a virtual instruction set that is not designed for an
optimizer; for one, it is even not in SSA form. So graphics driver
compiler's optimizer might not do its job very well, and I would
suggest we should not rely on its optimization.

Not being in SSA form is no problem. Converting to SSA is a well-known
transformation. LLVM IR doesn't start out in SSA either.

* The codegen approach is actually simpler than the cbe approach. PTX
is mostly RISC-based; that said, the codegen approach leverages from
most of *.td and from implementations of existing matured RISC
backends such as ARM, PowerPC, and Sparc. Besides, I guess most
developers would be more familiar with *.td than C backend. In fact,
it only took me two weeks to write a working prototype from scratch --
and I had had no any prior experience on LLVM's codegen.

I believe that. PTX is a really simple instruction set and quite
orthogonal.

* So far my backend is less complete than other backends based on cbe
approach, but considering the simplicity of codegen approach, a
backend based on codegen approach should catch up with them in short
time.

The one thing we'll have to add is mask support.

* Masked operation, as well as branch folding and alike, is much
easier to implement in codegen approach. I am not sure how much
performance improvement could be achieved from these optimizations,
but it is worth trying.

I'm not sure why these would be easier with one model over another.
It's a lot of hand-lowering and manual optimization either way. Can you
explain?

All in all, I would propose a PTX backend in codegen approach after I
have implemented both.

The fact that PTX is a moving target seals the deal for me. It's really
easy to generate variants of PTX using TableGen's predicate approach.

                            -Dave