Eliminating redundant loads

Hi,

I am generating following code:

  %base = getelementptr inbounds %ravi.CallInfo* %6, i32 0, i32 4, i32 0
  %7 = load %ravi.TValue** %base
  %8 = bitcast %ravi.TValue* %7 to i8*
  %9 = bitcast %ravi.TValue* %5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %8, i8* %9, i32 16, i32 8, i1 false)
  %10 = load %ravi.CallInfo** %L_ci
  %base1 = getelementptr inbounds %ravi.CallInfo* %10, i32 0, i32 4, i32 0
  %11 = load %ravi.TValue** %base1
  %12 = getelementptr inbounds %ravi.TValue* %11, i32 1
  %13 = getelementptr inbounds %ravi.TValue* %5, i32 1
  %14 = bitcast %ravi.TValue* %12 to i8*
  %15 = bitcast %ravi.TValue* %13 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %14, i8* %15, i32 16, i32 8, i1 false
)
  %16 = load %ravi.CallInfo** %L_ci
  %base2 = getelementptr inbounds %ravi.CallInfo* %16, i32 0, i32 4, i32 0
  %17 = load %ravi.TValue** %base2

Now base, base2 and base1 are really the same - i.e., nothing's
happened to change the pointer held at this location. So should I
expect the redundant getelementptr and load instructions to be
eliminated during optimization phase?

I am using the optimisation passes as described in Kaleidoscope
tutorial but it seems that the redundant loads are not being removed.

Is it the call to memcpy that's preventing this?

Thanks and Regards
Dibyendu

Not sure if this is your problem, but it was mine:

You must create (or obtain) a DataLayout and install it into the Module.

It is possible to generate machine code for IR and not install the DataLayout into the Module. Rather, the DataLayout is used locally at the point where code is generated. However, if you do this, then the alias analyses required to get rid of your redundant loads and stores cannot reason about possible aliasing.

Hi David,

I tried setting the module's DataLayout to the engine's DataLayout.
Don't see any improvement.
The memcpy() is to perform a struct assign, so I tried replacing that
with member by member store.
But even then the loads are not being eliminated so I guess the
memcpy() isn't the issue.

Regards
Dibyendu

From: "Dibyendu Majumdar" <mobile@majumdar.org.uk>
To: "David Jones" <djones@xtreme-eda.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Sunday, February 22, 2015 4:40:52 PM
Subject: Re: [LLVMdev] Eliminating redundant loads

> Not sure if this is your problem, but it was mine:
>
> You must create (or obtain) a DataLayout *and install it into the
> Module*.
>
> It is possible to generate machine code for IR and not install the
> DataLayout into the Module. Rather, the DataLayout is used locally
> at the
> point where code is generated. However, if you do this, then the
> alias
> analyses required to get rid of your redundant loads and stores
> cannot
> reason about possible aliasing.
>

Hi David,

I tried setting the module's DataLayout to the engine's DataLayout.
Don't see any improvement.
The memcpy() is to perform a struct assign, so I tried replacing that
with member by member store.
But even then the loads are not being eliminated so I guess the
memcpy() isn't the issue.

If you run the IR through opt -O3 do you get the optimization you expect?

-Hal

Hi,
Tried that - no improvement.

Also tried removing the redundant GEP instructions, leaving just the
loads. Here is a dump that shows the output after running the
optimizer passes (this is from the passes in my program not opt -O3;
this version does not use memcpy() either):

  %L_ci = getelementptr inbounds %ravi.lua_State* %L, i64 0, i32 6
  %0 = load %ravi.CallInfo** %L_ci, align 8
  %base = getelementptr inbounds %ravi.CallInfo* %0, i64 0, i32 4, i32 0
  %1 = bitcast %ravi.CallInfo* %0 to %ravi.LClosure***
  %2 = load %ravi.LClosure*** %1, align 8
  %3 = load %ravi.LClosure** %2, align 8
  %Proto = getelementptr inbounds %ravi.LClosure* %3, i64 0, i32 5
  %4 = load %ravi.Proto** %Proto, align 8
  %k = getelementptr inbounds %ravi.Proto* %4, i64 0, i32 14
  %5 = load %ravi.TValue** %k, align 8
  %6 = load %ravi.TValue** %base, align 8
  %srcvalue = getelementptr inbounds %ravi.TValue* %5, i64 0, i32 0, i32 0
  %destvalue = getelementptr inbounds %ravi.TValue* %6, i64 0, i32 0, i32 0
  %7 = load double* %srcvalue, align 8
  store double %7, double* %destvalue, align 8
  %srctype = getelementptr inbounds %ravi.TValue* %5, i64 0, i32 1
  %desttype = getelementptr inbounds %ravi.TValue* %6, i64 0, i32 1
  %8 = load i32* %srctype, align 4
  store i32 %8, i32* %desttype, align 4
  %9 = load %ravi.TValue** %base, align 8
  %srcvalue1 = getelementptr inbounds %ravi.TValue* %5, i64 1, i32 0, i32 0
  %destvalue2 = getelementptr inbounds %ravi.TValue* %9, i64 1, i32 0, i32 0
  %10 = load double* %srcvalue1, align 8
  store double %10, double* %destvalue2, align 8

Regards
Dibyendu

Hi Dibyendu,

It would be very helpful if you could post the original source code or snippet.
That way, one can investigate deeper to understand the problem.

Regards,
Kamal Sharma

Hi Kamal,

Sure. I guess I ought to create a test that one can look in isolation.

I am working on building a JIT compiler for Lua (actually a derivative
of Lua). It is currently work in progress.
The approach is to compile Lua bytecodes into LLVM IR.

The IR generated from my code is here (after optimization I should add):
https://github.com/dibyendumajumdar/ravi/blob/master/clang-output/lua_op_loadk_return_ravi.ll

I am using the output from Clang as a guide to generating IR. So I
write small snippets of code in C which are equivalent to Lua
bytecodes - then use Clang to emit the IR. I use this to work out the
IR I need to build.

The C equivalent of the program I am compiling is here:
https://github.com/dibyendumajumdar/ravi/blob/master/clang-output/lua_op_loadk_return.c

The difference between the C version and what I generate is that I put
a load of the "base" pointer at the beginning of every Lua opcode.
This is because some Lua opcodes can reallocate the memory pointed to
by base. I was hoping that the optimizer will get rid of the redundant
stuff.

The code generation is all done here:

https://github.com/dibyendumajumdar/ravi/blob/master/src/ravijit.cpp

I don't expect you to wade through all this - but I will be grateful
for any help / guidance you can provide.

Thanks and Regards
Dibyendu

You have not installed the DataLayout in the Module, as I had pointed out earlier.

Compile a small program using clang, to get IR. You will notice a “target datalayout” declaration at the top of the IR. There is no such declaration in your IR.

This is precisely the problem I had. You need to add the DataLayout to the Module, at which point a “target datalayout” declaration will appear in your IR, and the optimization passes will have enough alias information to be able to eliminate your redundant loads.

Hi David,

I reported earlier that I tried this but there was no improvement.
Well I ran another test to be sure. The results are below. As you can
see the loads are still present.

; ModuleID = 'ravi_module_ravif1'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc-elf"

%0 = type { %ravi.TValue*, i32*, i64 }

....

  %6 = load %ravi.TValue** %base
  %srcvalue = getelementptr inbounds %ravi.TValue* %5, i32 0, i32 0, i32 0
  %destvalue = getelementptr inbounds %ravi.TValue* %6, i32 0, i32 0, i32 0
  %7 = load double* %srcvalue
  store double %7, double* %destvalue
  %srctype = getelementptr inbounds %ravi.TValue* %5, i32 0, i32 1
  %desttype = getelementptr inbounds %ravi.TValue* %6, i32 0, i32 1
  %8 = load i32* %srctype
  store i32 %8, i32* %desttype
  %9 = load %ravi.TValue** %base
  %10 = getelementptr inbounds %ravi.TValue* %9, i32 1
  %11 = getelementptr inbounds %ravi.TValue* %5, i32 1
  %srcvalue1 = getelementptr inbounds %ravi.TValue* %11, i32 0, i32 0, i32 0
  %destvalue2 = getelementptr inbounds %ravi.TValue* %10, i32 0, i32 0, i32 0
  %12 = load double* %srcvalue1
  store double %12, double* %destvalue2
  %srctype3 = getelementptr inbounds %ravi.TValue* %11, i32 0, i32 1
  %desttype4 = getelementptr inbounds %ravi.TValue* %10, i32 0, i32 1
  %13 = load i32* %srctype3
  store i32 %13, i32* %desttype4
  %14 = load %ravi.TValue** %base
  %15 = getelementptr inbounds %ravi.TValue* %14, i32 2
  %16 = getelementptr inbounds %ravi.TValue* %5, i32 2
  %srcvalue5 = getelementptr inbounds %ravi.TValue* %16, i32 0, i32 0, i32 0
  %destvalue6 = getelementptr inbounds %ravi.TValue* %15, i32 0, i32 0, i32 0
  %17 = load double* %srcvalue5
  store double %17, double* %destvalue6
  %srctype7 = getelementptr inbounds %ravi.TValue* %16, i32 0, i32 1
  %desttype8 = getelementptr inbounds %ravi.TValue* %15, i32 0, i32 1
  %18 = load i32* %srctype7
  store i32 %18, i32* %desttype8
  %19 = load %ravi.TValue** %base

Hi,

Is it the case that I need to generate tbaa metadata and associate
with store/load instructions? If so, is there any doc on how to get
started with this?

Thanks and Regards
Dibyendu

Hi Dibyendu,

Aliasing might pose a problem in your case.
I didn’t look into the depth of your code.
But, you could try running -tbaa optimization pass with opt.
You should appropriate metadata information generated for this.

There are other optimization routines as well. Try exploring them.

Regards,
Kamal Sharma

I cannot find any doc on how to get started with tbaa. Any pointers?

Thanks and Regards
Dibyendu

Hi,

I added tbaa metadata and that seems to have improved things - now the
redundant loads are getting removed.
I did not find any documentation on how to use the API, but eventually
figured out by looking at what Clang produces, and how the Julia
project is using this.

I will write this up in the next few days but if anyone wants to see
how I am doing it then here is the link to the commit:

https://github.com/dibyendumajumdar/ravi/commit/87f3720f7eb661385be1199e48f4107d04a66671

Regards
Dibyendu