Hello,
My work deals with non-temporal loads and stores i found non-temporal meta data in llvm documentation but its not shown in IR.
How to get non-temporal meta data?
Hello,
My work deals with non-temporal loads and stores i found non-temporal meta data in llvm documentation but its not shown in IR.
How to get non-temporal meta data?
llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt vector loads in IR - is that what you're after?
Simon.
Actually i am working on vector accelerator which will perform those instructions which are non temporal.
for instance if i have this loop
for(i=0;i<2048;i++)
a[i]=b[i]+c[i];
currently it emits following IR;
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1
However, i want it to emit following IR
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, !nontemporal !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1, !nontemporal !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal !1
so that i can offload load, add, store to accelerator hardware. is it possible here? do i need a separate pass to detect whether the loop has non temporal data or polly will help here? what do you say?
Actually i am working on vector accelerator which will perform those instructions which are non temporal.
for instance if i have this loop
for(i=0;i<2048;i++)
a[i]=b[i]+c[i];currently it emits following IR;
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1However, i want it to emit following IR
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, !nontemporal !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1, !nontemporal !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal !1so that i can offload load, add, store to accelerator hardware. is it possible here? do i need a separate pass to detect whether the loop has non temporal data or polly will help here? what do you say?
From C/C++ you just need to use the __builtin_nontemporal_store/__builtin_nontemporal_load builtins to tag the stores/loads with the nontemporal flag.
for(i=0;i<2048;i++) {
__builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + __builtin_nontemporal_load(c + i), a + i );
}
There may be an attribute you can tag pointers with instead but I don’t know off hand.
i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly?
Yes, but we don’t have anything that does that right now. The cost modeling is non-trivial, however. In the loop below, which of those accesses would you expect to be nontemporal? All of those accesses span only 8 KB, and that’s certainly smaller than many L1 caches. Turning those into nontemporal accesses could certainly lead to a performance regression for that loop, subsequent code, or both. If we do this more generally, I suspect that we’d need to split the loop so that small trip counts don’t use them at all, and for larger trip counts, we don’t disturb data-reuse opportunities that would otherwise exist. -Hal
Thank You.
If i execute the same vector sum code with greater number of iterations like 100000000000 will the non temporal loads and stores effective?