1. Motivation
View style operations are popular in deep learning algorithms, especially domain-specific layers, such as bounding boxes and anchors in detection (see the following example offset2bbox), agent-environment interaction in reinforcement learning. The output of the view-style operation shares the same physical memory as the input. The output is also named as a view of the physical memory. The typical examples are NumPy-style basic indexing operations and some operations in the deep learning scenario, such as reshape, squeeze, unsqueeze, and so on. In fact, in-place operations can be regarded as a special form of view behavior, whose input and output operands not only share the same physical memory but also have the same memory layout.
1.1 Related Work
However, the mainstream deep learning compilers, such as TVM, Halide, TorchScript, XLA, and dialects in MLIR, are limited for view-style operations. These works take rigid tensor descriptions as input while translating high-level languages (i.e. the above example offset2bbox) to such descriptions requires nontrivial efforts. The tensor description is subject to the SSA form. In this proposal, we first identify the traditional single-static-assignment compilation technique for scalar cannot be competent for the tensor-level view operations due to their complex memory layout relationships and read-write relationships. We design an intermediate representative to represent the layout and read-write relationships between multiple views of the physical memory. Then, we present an efficient tensor-level SSA resolution algorithm to avoid redundant memory accesses at the same time with correctness. Finally, we present the ViewDialect based on the LLVM MLIR compiler infrastructure and perform a throughout evaluation.
1.2 Motivational Example
We use an example in the following list. The input is a 2-dimension tensor. Let us take a visit to the function body. Line 3 and 4 are two indexing operations of tensor A, and tensor B and C do not allocate memory space. In line 5, tensor D will produce a new memory location. Then, C adds a constant 2 in place in Line 6. In fact, because C shares the same memory with A, the mutation will occur in A and may further influence B. In this case, B and C have different memory layouts on tensor A, and they have no overlap. Thus, the update in Line 6 will not influence B. Line 7 produces a new memory location of tensor E, and it is equal to A+D. Here, A has been partially updated in line 6. We can observe that the tensor-level multiple assignments are more complex than scalar level one because of the complex memory layout and read-write sequence. Hence, this motivates us to propose tensor-level SSA ViewDialect.
import torch
def view_example(A):
B = A[1,:]
C = A[2,:]
D = B + C
C += 2
E = A + D
return E
2. Rationale
Each view has a tuple (dims, strides, offset) to express the position relationship with the physical memory, where dims is a vector that stores the dimension information of the view, strides is also a vector that stores the physical memory step length information of adjacent elements in each dimension of the view, and offset is a scalar refers to the offset of the first element of the view in the physical memory. Naturally, the tensor can also be regarded as a view of itself. Given an N-dim tensor with shape (S1, S2,…, SN), the default tuple can be calculated. For more details, please refer to NumPy indexing.
In fact, multiple view operations can be associated together. For example, A is a tensor with the shape of (3,4,5). B =A[1][2] is equivalent to C = A[1] and B =C[2]. Thus, the view tuple of B can be obtained by two steps. We can get the view tuple of C, based on which, we can further get the view tuple of B. Then, we will present how to use the view tuple. We still use tensor A with shape (3,4,5) as an example. Here, we calculate B = A[:,1 : 4 : 2,:]. We first get the view tuple of the above indexing operation according to the above equations. The tuple is ([3,2,5],[20,10,1],5).
In summary, the view attribute has the following three features.
- The memory layout of input and output may not be the same, although they share the same physical memory. The key point is the view tuple.
- A single physical memory may have multiple views. Multiple view operations can be associated together. Also, a view of physical memory can derive a new view of itself as shown in the following Figure (a). Essentially, they are all the views of the physical memory like the disjoint-set data structure as shown in Figure (b).
- A mutation of a single view will reflect the physical memory, and further influence the other views as shown in Figure (c).
3. TensorSSA
3.1 IR for View
We use a ViewNode data structure to record the memory layout information as follows:
struct ViewNode{
Tensor *PhysicalMem; //pointer to physical addr
uint64 *Shape[]; //view tuple - Shape
uint64 *Stride[]; //view tuple - Stride
uint64 Offset; //view tuple - Offset
};
Then, we focus on the read-write relationships between multiple views and physical memory. Naturally, we categorize the operations into two types: read (Line 3 and 4 in the motivation example) and write (Line 6 in the motivation example).
- Read. We design an operator Access as follows: A_access = Access(A, view_tuple)
- Write. We design an operator Assign as follows: A_assign = Assign(A,B, view_tuple)
For the sake of the static single assignment feature, A_access and A_assign will produce a new memory location.
3.2 Lowering Algorithm
We have presented the IR designed for view operations. In this section, we will introduce the lowering algorithm. The input is an operation list that contains view operations and other operations. The output is an operation list without view attribute. The algorithm mainly contains two steps. The first step is an SSA converter, and the second step is consisting of two optimization solutions.
-
Step 1. We visit the operation in the input operation list one by one. According to whether the operation has view attribute or not, we choose different methods.
- View Read Operation. We will create a ViewTree when we meet the first view operation. The root node of the ViewTree is the source physical memory. Once we visit a view read operation, we first add a node and a related edge in the ViewTree. Then, we add an access operation to the output operation list.
- View Write Operation. View write operation will update the physical memory and further update the value of the related views. When we meet the view write operation, we first add a view node and an edge in the view disjoint-set. We need to cope with the influence of in-place mutation on other views. Thus, we will first perform pass-up assign process by adding an assign operation to the source tensor. Then, we perform pass-down access process to broadcast the influence to all views by adding related access operations.
- Operation without view attribute. We update the operands based on their latest version. Then, we add the operation to the output operation list.
-
Step 2. After the lowering process, the output operation sequence is subject to the static single assignment form. In this part, we will propose two optimization solutions.
- Access/Assign Remove. In the pass-down access process, we add access operations for all other views. According to liveness analysis and memory layout overlap analysis, we can perform dead code elimination and remove unnecessary access operations.
- Access/Assign Merge. We can merge the Access or Assign operation with its following use operation according to the use-define chain analysis. This can inline the computation within the same loop nest.
4. Implementation of ViewDialect
4.1 Operation Definition Specification
In ViewDialect, we implement five operations, which can be divided into three groups as follows:
- ViewOnTensorOp and CopyOnTensorOp. They represent the view syntax in MLIR front-end as the input dialect. The Operation Definition Specification(ODS) of these two operations are as follows:
def view_ViewOnTensorOp : View_Op<"view"> {
let arguments = (ins
AnyTensor: $source,
I64ArrayAttr: $shape,
I64ArrayAttr: $stride,
I64Attr: $offset
);
let results = (outs
AnyTensor: $ref
);
}
def view_CopyOnTensorOp : View_Op<"copy"> {
let arguments = (ins
AnyTensor: $from,
AnyTensor: $to
);
}
};
- AccessOnTensorOp and AssignOnTensorOp. They represent the SSA Operation proposed in Section 3.1.
def view_AccessOnTensorOp : View_Op<"access",
[NoSideEffect]> {
let arguments = (ins
AnyTensor: $source,
I64ArrayAttr: $shape,
I64ArrayAttr: $stride,
I64Attr: $offset
);
let results = (outs
AnyTensor: $access
);
}
def view_AssignOnTensorOp : View_Op<"assign",
[NoSideEffect]> {
let arguments = (ins
AnyTensor: $source,
AnyTensor: $assignee,
I64ArrayAttr: $shape,
I64ArrayAttr: $stride,
I64Attr: $offset
);
let results = (outs
AnyTensor: $result
);
}
- LinkOp. It represents the edges in the view disjoint-set.
def View_LinkOp : View_Op<"link"> {
let arguments = (ins
AnyTensor: $source,
AnyTensor: $ref,
I64ArrayAttr: $shape,
I64ArrayAttr: $stride,
I64Attr: $offset
);
}
4.2 Lowering Pipeline
In the ViewDialect, we implement the lowering algorithm in Section 3.2. Firstly, ViewOnTensorOp and CopyOnTensorOp are converted to AccessOnTensorOp and AssignOnTensorOp. LinkOp records the snapshot of the disjoint-set. After the lowering process, the output operations are subject to SSA form, and we can perform fuse pass on them in linalg. Next, linalg dialect operations can be lowered into scf dialect. Then, scf dialect will be lowered to MLIR NVVM Dialect. Finally, NVVM Dialect and further to ptx or cubin.
5. Evaluation
In this proposal, we use Python3.6, NVIDIA GeForce GTX 1660 (6GB) GPUs and CUDA 10.1 as the evaluation environment. We implement a simple python CUDA runtime wrapper to run PTX code generated by MLIR. We compare TensorSSA implemented in MLIR with PyTorch (v1.7.0) and TorchScript (within PyTorch). We use some representative operators in the novel deep learning models. We can find that ViewDialect can have a consistent performance speedup than the state-of-the-art works. The performance speedup is 11.23X and 5.26X than PyTorch and TorchScript. On the whole, the performance benefits come from two main aspects. The first one is that ViewDialect can convert multiple view ops into one access op by our algorithm. As a result, we do not generate unnecessary memory assignment statements. The second is that we can fuse view ops and element-wise compute ops together into a single kernel since there is no implicit assignment after tensorSSA.
Because I am a newer, I can only embed one media in this post. More details can refer to our paper: Overleaf, Online LaTeX Editor