Instruction selection problem with type i64 - mistaken as v8i64?

     I am writing a back end in which I combined the existing BPF LLVM back end with the Mips MSA vector extensions (from the Mips back end)
     I have encountered an error when compiling with llc: the instruction selector uses a vector register instead of a scalar register with type i64 .

     I have the following part of LLVM IR program:
       vector.body.preheader: ; preds = %min.iters.checked
         br label %vector.body

       vector.body: ; preds = %vector.body.preheader, %vector.body
         %index = phi i64 [, %vector.body ], [ 0, %vector.body.preheader ]
         %vec.phi = phi <8 x i64> [ %0, %vector.body ], [ zeroinitializer, %vector.body.preheader ]

     The ASM code generated from it is the following:
LBB0_3: // %vector.body.preheader
         REGVEC0 = 0
         mov r0, 0
         std -48(r10), r0
         std -128(r10), REGVEC0
         jmp LBB0_4
LBB0_4: // %vector.body
         ldd REGVEC0, -128(r10)
         ldd r0, -48(r10)

     I am surprised that the BPF scalar instructions ldd and std use vector register REGVEC0, which have type v8i64.
     For example, the TableGen definition of the LOAD instruction taken from is:
       class LOADi64<bits<2> SizeOp, string OpcodeStr, PatFrag OpNode>
           : LOAD<SizeOp, OpcodeStr, [(set i64:$dst, (OpNode ADDRri:$addr))]>;

     So I am surprised that the instruction selector finds as match for operand i64:$dst the vector register REGVEC0, which has type v8i64 as defined below, inspired from lib/Target/Mips/
       def MSA128D: RegisterClass<"Connex", [v8i64], 512,
                            (sequence "Wd%u", 0, 31)>;

     Can anybody help with an idea what I can do to fix this problem?

     Below are a few possibly useful lines from the output of llc, related to the instr. selection and register allocation of the above piece of code:
===== Instruction selection ends:
Selected selection DAG: BB#3 'foo:vector.body.preheader'
SelectionDAG has 11 nodes:
   t0: ch = EntryToken
         t1: i64 = MOV_ri TargetConstant:i64<0>
       t3: ch = CopyToReg t0, Register:i64 %vreg23, t1
         t11: v8i64 = VLOAD_D TargetConstant:i64<0>
       t6: ch = CopyToReg t0, Register:v8i64 %vreg24, t11
     t8: ch = TokenFactor t3, t6
   t9: ch = JMP BasicBlock:ch<vector.body 0xa61440>, t8


Spilling live registers at end of block.
Spilling %vreg31 in %R0 to stack slot #5
Spilling %vreg32 in %Wd0 to stack slot #6
BB#3: derived from LLVM BB %vector.body.preheader
     Predecessors according to CFG: BB#2
         %Wd0<def> = VLOAD_D 0
         %R0<def> = MOV_ri 0
         STD %R0<kill>, <fi#5>, 0
         STD %Wd0<kill>, <fi#6>, 0
         JMP <BB#4>
     Successors according to CFG: BB#4(0)


>> JMP <BB#5>
Regs: R0 R1=%vreg31* R2=%vreg0 Wd0=%vreg32* Wd1
<< JMP <BB#5>
Spilling live registers at end of block.
Spilling %vreg31 in %R1 to stack slot #5
Spilling %vreg32 in %Wd0 to stack slot #6
BB#4: derived from LLVM BB %vector.body
     Predecessors according to CFG: BB#3 BB#4
         %Wd0<def> = LDD <fi#6>, 0
         %R0<def> = LDD <fi#5>, 0
         INLINEASM <es:int index;
for (index = 0; index < N - (N % 8); index += 8) {.
         EXECUTE_IN_ALL(> [sideeffect] [attdialect]
         INLINEASM <es:connex->writeDataToArray(&C[index], /*numVectors*/ 1, /*offset*/ 3);> [sideeffect] [attdialect]
         %Wd1<def> = LD_D 3; mem:LD64[inttoptr (i64 3 to <8 x i64>*)](align=8)
         %Wd0<def> = ADDV_D %Wd1<kill>, %Wd0<kill>
         INLINEASM <es: );.
   connex->executeKernel(TEST_PREFIX + to_string((long long int)BatchNumber));


BB#6: derived from LLVM BB %for.body.preheader8
     Predecessors according to CFG: BB#1 BB#2 BB#5
         %R0<def> = LDD <fi#3>, 0
         %R1<def> = MOV_ri 0
         STD %R0<kill>, <fi#7>, 0
         STD %R1<kill>, <fi#8>, 0
         JMP <BB#7>
     Successors according to CFG: BB#7(0)

   Thank you,


I vaguely remember hitting something like this when I was implementing MSA. IIRC, there was an optimization (in DAGCombine or somewhere around there) that was folding CopyToReg instructions into the load without checking whether the new register class was acceptable. I remember adding a target hook to limit this optimization based on the EVT's involved but I'm not sure if that's the patch that I upstreamed or if it was just an initial attempt at fixing it. I had a quick look for a likely hook in the Mips backend and couldn't find it so I'm probably remembering an initial attempt.

Hello, Daniel,
     I was almost to argue now that since it works in Mips and not in BPF it's got to be related to the back end code not the common source code. But I just updated my local LLVM to the latest 3.9 version at the beginning of July and the bug disappeared - so I guess somebody fixed this problem in the common code - probably around llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (I didn't manage to ).
     If you know exactly what is the change please point it to me.

   Thank you,