Prologue and epilogue for vectorized code

Hello.
     I'd like to generate a sort of prologue+epilogue for a code block running on a SIMD architecture obtained from the LLVM loop vectorizer. My SIMD processor receives data from the CPU via DMA transfer and sends it via DMA transfer or a FIFO.
    It is exactly for these transfers that I need to write the prologue+epilogue - relatively simple, e.g. a call to a function like TransferViaDMA().
    Although it doesn't seem to be very difficult, I'm curious what is the best way to do it.

    I haven't found anybody to write prologue+epilogue for vector code (obtained from the loop vectorizer), and although it shouldn't be very different from the prologue+epilogue for function call, I'm still curious what's the best way to do it.

    Please let me know what do you recommend.

  Thank you,
    Alex

Hello.
     I come back with this question, rephrased a bit. Note that I guess this question should be useful also for the NVPTX LLVM back end, when it will generate automatically code for both CPU and NVIDIA device and generate automatically memory transfers, with cudaMemcpy().

     Given LLVM scalar and vector code I want to generate code for both the scalar CPU and for my research Connex SIMD unit. The CPU and SIMD unit have different memory spaces and we require to perform memory transfer from CPU to my Connex SIMD unit, via DMA, to "synchronize" the 2 memories.

     Therefore, in the LLVM code with vector instructions I need to add (on the way to code generation) a call to a function performing the memory transfer from CPU to my Connex SIMD unit. More exactly, for the LLVM code below (obtained from LLVM's opt tool):
       ...
       %8 = getelementptr inbounds [10000 x float], [10000 x float]* @A, i64 0, i64 %7
       %9 = bitcast float* %8 to <32 x float>*
       %wide.load = load <32 x float>, <32 x float>* %9, align 4
       [more...]
     I want on the CPU to add a call to an external function writeDataToArray() like this:
         ...
         %8 = getelementptr inbounds [10000 x float], [10000 x float]* @A, i64 0, i64 %7
         %9 = bitcast float* %8 to <32 x float>*
         call writeDataToArray(%9, 128, 0) ; 2nd parameter is the transfer size in bytes, 3rd param is the offset to write in the local memory of the SIMD unit
       and, then, run only the following code on the SIMD unit:
         %newVar = getelementptr inbounds i32, i32* inttoptr (i64 0 to i32*), i64 0
         %dst = load <32 x float>, <32 x float>* %newVar, align 4
         [more...]

     Should I perform the insertion of this function call in LLVM's llvm/lib/Transforms/Vectorize/LoopVectorize.cpp in method:
        /// Vectorize Load and Store instructions,
        virtual void vectorizeMemoryInstruction(Instruction *Instr) ?
     Or should I do it as a separate LLVM pass or maybe in the back end?

   Thank you,
     Alex