MCJIT generating loads of just-stored constants

Hello,

I end up with the following IR, exhibiting an apparent missed
optimisation opportunity, namely loading of just-stored constants:

...
  %5 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 0
  store i32 1, i32* %5, align 4
  %6 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 1
  store i32 1, i32* %6, align 4
  %7 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 2
  store i32 0, i32* %7, align 4
  %8 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 6
  store i32 2, i32* %8, align 4
  %9 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 8
  store i32 2, i32* %9, align 4
  %10 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 10
  store i32 16, i32* %10, align 4
  %11 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 11
  store i32 16, i32* %11, align 4
  %12 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 12
  store i32 0, i32* %12, align 4
  %13 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 13
  store i32 0, i32* %13, align 4
  %14 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 15
  store i32 8, i32* %14, align 4
  %15 = getelementptr inbounds %class.A* %self, i32 0, i32 9, i32 17
  store i32 0, i32* %15, align 4
  %16 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 0
  %17 = load i32* %16, align 4
  %18 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 3
  %19 = load float* %18, align 4
  %20 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 4
  %21 = load float* %20, align 4
  %22 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 5
  %23 = load float* %22, align 4
  %24 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 6
  %25 = load i32* %24, align 4
  %26 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 7
  %27 = load float* %26, align 4
  %28 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 8
  %29 = load i32* %28, align 4
  %30 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 9
  %31 = load float* %30, align 4
  %32 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 10
  %33 = load i32* %32, align 4
  %34 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 11
  %35 = load i32* %34, align 4
  %36 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 13
  %37 = load i32* %36, align 4
  %38 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 14
  %39 = load float* %38, align 4
  %40 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 15
  %41 = load i32* %40, align 4
  %42 = getelementptr inbounds %class.A* %self, i64 0, i32 9, i32 16
  %43 = load float* %42, align 4
...

The above happens after a callee gets inlined - all the stores are
from the caller, and the loads are from the inlined callee. Please
note the partial overlap between stored and loaded fields.

The general steps leading to the above:

1. Load a module containing a function A::foo(), which function starts
with reading fields from an object of class A.
2. Add to the module a wrapper function bar() which takes as an
argument an object of class A, stores literals to (most of the) fields
of the object, then calls A::foo() with the same object.
3. Update the original A::foo() with an AlwaysInline attribute.
4. Pass the module to MCJIT from clang 3.4.2, set up as:

...
                llvm::PassRegistry &registry =
*llvm::PassRegistry::getPassRegistry();
                llvm::initializeCore(registry);
                llvm::initializeScalarOpts(registry);
                llvm::initializeObjCARCOpts(registry);
                llvm::initializeVectorization(registry);
                llvm::initializeIPO(registry);
                llvm::initializeAnalysis(registry);
                llvm::initializeIPA(registry);
                llvm::initializeTransformUtils(registry);
                llvm::initializeInstCombine(registry);
                llvm::initializeTarget(registry);
                llvm::initializeCodeGen(registry);
                llvm::initializeLoopStrengthReducePass(registry);
                llvm::initializeLowerIntrinsicsPass(registry);
                llvm::initializeUnreachableBlockElimPass(registry);

                llvm::TargetOptions opt;
                opt.PositionIndependentExecutable = false;

                const std::string& triple = llvm::sys::getProcessTriple();
                const std::string& hostcpu = llvm::sys::getHostCPUName();
                const std::string& features = "";
                std::string error;
                const llvm::Target *const target =
llvm::TargetRegistry::lookupTarget(triple, error);

                llvm::TargetMachine *const tm = target->createTargetMachine(
                    triple, hostcpu, features, opt,
                    llvm::Reloc::Default,
                    llvm::CodeModel::JITDefault,
                    llvm::CodeGenOpt::Aggressive);

                // Set up IR pass management
                llvm::FunctionPassManager fpm(module);
                llvm::PassManager pm;

                tm->addAnalysisPasses(pm);
                tm->addAnalysisPasses(fpm);

                // Use a pass manager builder for C-style optimisations
                llvm::PassManagerBuilder passBuilder;
                passBuilder.OptLevel = 3;
                passBuilder.SizeLevel = 0;
                passBuilder.Inliner =
llvm::createAlwaysInlinerPass(false); // suppress llvm.lifetime.*
intrinsics
                passBuilder.BBVectorize = true;
                passBuilder.SLPVectorize = true;
                passBuilder.LoopVectorize = true;
                passBuilder.LateVectorize = true;

                passBuilder.populateFunctionPassManager(fpm);
                passBuilder.populateModulePassManager(pm);

                fpm.doInitialization();
                for (llvm::Module::iterator it = module->begin(),
endit = module->end(); it != endit; ++it) {
                    fpm.run(*it);
                }
                fpm.doFinalization();
                pm.run(*module);

                execEngine =
llvm::EngineBuilder(module).setEngineKind(llvm::EngineKind::JIT).setUseMCJIT(true).create(tm);
                execEngine->finalizeObject();
...

I guess there's something apparent I'm missing from the MCJIT setup in
order to get these results. Any hits are greatly appreciated.

Regards,
Martin