avoid live range overlap of "vector" registers

a "vector" register r0 is composed of four 32-bit floating scalar
registers, r0.x, r0.y, r0.z, r0.w.

each scalar reg can be assigned individually, e.g.

  mov r0.x, r1.y
  add r0.y, r1,x, r2.z
  
or assigned simultaneously with vector instructions, e.g.

  add r0.xyzw, r1.xzyw, r2.xyzw
  
My question is how to define the register in .td file to avoid the
code generator overlaps the live ranges of vector registers?

i could define a 'definition' for each scalar register, but it's tedious:

class FooReg<string n> : Register<n> {}

def r0_x: FooReg<"r0.x">;
def r0_y: FooReg<"r0.y">;
def r0_z: FooReg<"r0.z">;
def r0_w: FooReg<"r0.w">;
def r1_x: FooReg<"r1.x">;
def r1_y: FooReg<"r1.y">;
def r1_z: FooReg<"r1.z">;
def r1_w: FooReg<"r1.w">;
...

and there are 32 vector registers!

i've read Target.rd:

// RegisterGroup - This can be used to define instances of Register which
// need to specify aliases.
// List "aliases" specifies which registers are aliased to this one. This
// allows the code generator to be careful not to put two values with
// overlapping live ranges into registers which alias.
class RegisterGroup<string n, list<Register> aliases> : Register<n> {
  let Aliases = aliases;
}

but RegisterGroup seems not to be what I need.

a "vector" register r0 is composed of four 32-bit floating scalar
registers, r0.x, r0.y, r0.z, r0.w.

each scalar reg can be assigned individually, e.g.

mov r0.x, r1.y
add r0.y, r1,x, r2.z

or assigned simultaneously with vector instructions, e.g.

add r0.xyzw, r1.xzyw, r2.xyzw

My question is how to define the register in .td file to avoid the
code generator overlaps the live ranges of vector registers?

If you want to access each part individually, I would suggest doing the tedious thing and including them all. The IA64 backend has 3*128 registers, so there is precedent for this...

-Chris

i could define a 'definition' for each scalar register, but it's tedious:

class FooReg<string n> : Register<n> {}

def r0_x: FooReg<"r0.x">;
def r0_y: FooReg<"r0.y">;
def r0_z: FooReg<"r0.z">;
def r0_w: FooReg<"r0.w">;
def r1_x: FooReg<"r1.x">;
def r1_y: FooReg<"r1.y">;
def r1_z: FooReg<"r1.z">;
def r1_w: FooReg<"r1.w">;
...

and there are 32 vector registers!

i've read Target.rd:

// RegisterGroup - This can be used to define instances of Register which
// need to specify aliases.
// List "aliases" specifies which registers are aliased to this one. This
// allows the code generator to be careful not to put two values with
// overlapping live ranges into registers which alias.
class RegisterGroup<string n, list<Register> aliases> : Register<n> {
let Aliases = aliases;
}

but RegisterGroup seems not to be what I need.

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

Chris Lattner wrote:

You're right, that would be a better way to go. To start, I would suggest adding extract/inject intrinsics (not instructions) because it is easier. If you're interested in doing this, there is documentation for this here:

http://llvm.cs.uiuc.edu/docs/ExtendingLLVM.html

-Chris

quote <http://llvm.cs.uiuc.edu/docs/LangRef.html#intrinsics&gt;:
"To do this, extend the default implementation of the
IntrinsicLowering class to handle the intrinsic. Code generators use
this class to lower intrinsics they do not understand to raw LLVM
instructions that they do."

but to which llvm instructions should the extract/inject (or
shuffle/pack) intrinsics be lowered? llvm instruction does not allow
to access the individual scalar value in a packed value.

None, that documentation is out of date and doesn't make a ton of sense for your application. I would suggest that you implement it in the context of the SelectionDAG framework that all of the code generators either currently use or are moving to. I updated the documentation here: http://llvm.cs.uiuc.edu/ChrisLLVM/docs/ExtendingLLVM.html#intrinsic

This will allow you to do something like this:

%i32v4 = type <4 x uint>

%f32v4 = type <4 x float>

declare %f32v4 %swizzle(%f32v4 %In, %i32v4 %Form)

%G = external global %f32v4

void %test() {
         %A = load %f32v4* %G
         %B = call %f32v4 %swizzle(%f32v4 %A, %i32v4 <uint 1, uint 1, uint 1, uint 1>) ;; splat XYZW -> YYYY
         store %f32v4 %B, %f32v4* %G
         ret void
}

... Except using llvm.swizzle instead of 'swizzle'.

Unfortunately the code generator currently does not support packed types, so this will require some work. However, this certainly is the closest match for your model.

-Chris

Chris Lattner wrote:

None, that documentation is out of date and doesn't make a ton of sense for your application. I would suggest that you implement it in the context of the SelectionDAG framework that all of the code generators either currently use or are moving to. I updated the documentation here: http://llvm.cs.uiuc.edu/ChrisLLVM/docs/ExtendingLLVM.html#intrinsic

This will allow you to do something like this:

%i32v4 = type <4 x uint>

%f32v4 = type <4 x float>

declare %f32v4 %swizzle(%f32v4 %In, %i32v4 %Form)

%G = external global %f32v4

void %test() {
        %A = load %f32v4* %G
        %B = call %f32v4 %swizzle(%f32v4 %A, %i32v4 <uint 1, uint 1, uint 1, uint 1>) ;; splat XYZW -> YYYY
        store %f32v4 %B, %f32v4* %G
        ret void
}

... Except using llvm.swizzle instead of 'swizzle'.

I much prefer the name chosen in the SSE instruction set: 'shuffle'

Unfortunately the code generator currently does not support packed types, so this will require some work. However, this certainly is the closest match for your model.

This work needs to be done for SSE code generation, which I think would be of interest to several people (including me) -- Our front-end generates code that uses packed datatypes a lot and I'm not entirely happy with the current situation using the LowerPacked pass... If SSE code generation was working, we would use LLVM for a lot more, at the moment we have a small runtime library with SSE optimized functions for things like trilinear interpolation, but the LLVM optimizer can't do very much with these functions since they are just external calls.

m.

Hi,

This work needs to be done for SSE code generation, which I think would be of interest to several people (including me) -- Our front-end generates code that uses packed datatypes a lot and I'm not entirely happy with the current situation using the LowerPacked pass... If SSE code generation was working, we would use LLVM for a lot more, at the moment we have a small runtime library with SSE optimized functions for things like trilinear interpolation, but the LLVM optimizer can't do very much with these functions since they are just external calls.

I've been working on using LLVM for compilation to vector architectures. One of the things I've been working on is a vector type (essentially an extension of the packed type to arbitrary vector lengths) with vector operations. I hope to contribute my vector-LLVM extensions to the LLVM source base, and integrate them with the packed type, by the end of the summer.

Code generation for subword-SIMD vector instructions (like SSE) is definitely on our radar screen, although we may be focusing on AltiVec.

Rob

Robert L. Bocchino Jr.
Ph.D. Student, Computer Science
University of Illinois, Urbana-Champaign

Chris Lattner wrote:

void %test() {
        %A = load %f32v4* %G
        %B = call %f32v4 %swizzle(%f32v4 %A, %i32v4 <uint 1, uint 1, uint 1, uint 1>) ;; splat XYZW -> YYYY
        store %f32v4 %B, %f32v4* %G
        ret void
}

... Except using llvm.swizzle instead of 'swizzle'.

I much prefer the name chosen in the SSE instruction set: 'shuffle'

Shuffle sounds fine to me :slight_smile:

Unfortunately the code generator currently does not support packed types, so this will require some work. However, this certainly is the closest match for your model.

This work needs to be done for SSE code generation, which I think would be of interest to several people (including me) -- Our front-end generates code that uses packed datatypes a lot and I'm not entirely happy with the current situation using the LowerPacked pass... If SSE code generation was working, we would use LLVM for a lot more, at the moment we have a small runtime library with SSE optimized functions for things like trilinear interpolation, but the LLVM optimizer can't do very much with these functions since they are just external calls.

I agree, many people are interested in it. The only question is who will step up to do it (first). :wink:

-Chris