[Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

Hi,

Here’s another case, different in high-level, but similar in low-level. When Fortran allocatable array is defined in module, its actual dimensions are kept in internal structure. Loads originated from reading these dimensions confuse Polly on any use of this array.

Attachments:

  1. Sample Fortran source code (to be compiled with and without -DMODULE to see failing and working version, respectively).
  2. LLVM IR for both cases right before Polly analysis (initialization loop “array = 2”)

Below is diff for quick look:

marcusmae@M17xR4:~/forge/kernelgen/tests/behavior/module_array$ diff -u array.loop.9.ll module_array.loop.9.ll
— array.loop.9.ll 2013-01-04 01:37:40.312259953 +0100
+++ module_array.loop.9.ll 2013-01-04 01:37:50.036259544 +0100
@@ -12,34 +12,39 @@
br label %“10.cloned”

“16.cloned”: ; preds = %“15.cloned”

  • %1 = add i64 %indvar3, 1
  • %exitcond12 = icmp eq i64 %1, 64
  • br i1 %exitcond12, label %“17.exitStub”, label %“10.cloned”
  • %1 = add i64 %indvar1, 1
  • %exitcond9 = icmp eq i64 %1, 64
  • br i1 %exitcond9, label %“17.exitStub”, label %“10.cloned”

“10.cloned”: ; preds = %“Loop Function Root.split”, %“16.cloned”

  • %indvar3 = phi i64 [ 0, %“Loop Function Root.split” ], [ %1, %“16.cloned” ]
  • %2 = mul i64 %indvar3, 4096
  • %indvar1 = phi i64 [ 0, %“Loop Function Root.split” ], [ %1, %“16.cloned” ]
  • %indvar.next2 = add i64 %indvar1, 1
  • %2 = load i64* inttoptr (i64 47280713800 to i64*), align 8
  • %3 = mul i64 %2, %indvar.next2
  • %4 = add i64 %3, -4160
    br label %“12.cloned”

“15.cloned”: ; preds = %“14.cloned”

  • %3 = add i64 %indvar1, 1
  • %exitcond = icmp eq i64 %3, 64
  • %5 = add i64 %indvar3, 1
  • %exitcond = icmp eq i64 %5, 64
    br i1 %exitcond, label %“16.cloned”, label %“12.cloned”

“12.cloned”: ; preds = %“10.cloned”, %“15.cloned”

  • %indvar1 = phi i64 [ 0, %“10.cloned” ], [ %3, %“15.cloned” ]
  • %4 = mul i64 %indvar1, 64
  • %5 = add i64 %2, %4
  • %indvar3 = phi i64 [ 0, %“10.cloned” ], [ %5, %“15.cloned” ]
  • %indvar.next4 = add i64 %indvar3, 1
  • %6 = load i64* inttoptr (i64 47280713776 to i64*), align 16
  • %7 = mul i64 %6, %indvar.next4
  • %8 = add i64 %4, %7
    br label %“14.cloned”

“14.cloned”: ; preds = %“12.cloned”, %“14.cloned”

  • %indvar = phi i64 [ 0, %“12.cloned” ], [ %7, %“14.cloned” ]
  • %6 = add i64 %5, %indvar
  • %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %6
  • %indvar = phi i64 [ 0, %“12.cloned” ], [ %10, %“14.cloned” ]
  • %9 = add i64 %8, %indvar
  • %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %9
    store float 2.000000e+00, float* %scevgep, align 4
  • %7 = add i64 %indvar, 1
  • %exitcond9 = icmp eq i64 %7, 64
  • br i1 %exitcond9, label %“15.cloned”, label %“14.cloned”
  • %10 = add i64 %indvar, 1
  • %exitcond7 = icmp eq i64 %10, 64
  • br i1 %exitcond7, label %“15.cloned”, label %“14.cloned”

“17.exitStub”: ; preds = %“16.cloned”
ret void

2013/1/2 Dmitry Mikushin <dmitry@kernelgen.org>

array.loop.9.ll (1.87 KB)

module_array.loop.9.ll (2.1 KB)

module_array.F90 (2.62 KB)

Hi Dmitry,

> Here's another case, different in high-level, but similar in low-level.

I think it's completely different at low-level too!

  When

Fortran allocatable array is defined in module, its actual dimensions are kept
in internal structure. Loads originated from reading these dimensions confuse
Polly on any use of this array.

Attachments:
1) Sample Fortran source code (to be compiled with and without -DMODULE to see
failing and working version, respectively).
2) LLVM IR for both cases right before Polly analysis (initialization loop
"array = 2")

Below is diff for quick look:

The problem here seems to be:

- at the start of @main the externally visible global variable @__mod_MOD_array
is initialized with explicit values;
- later, functions like _gfortran_st_write are called, which as far as the
optimizers know may be modifying the contents of @__mod_MOD_array;
- thus after these calls the array contents are reloaded.

Remarks:

(1) Does @__mod_MOD_array have to be externally visible? If you give it
internal linkage and reoptimize then all of the loads you don't like go away.
It is the Fortran front-end that gives it external linkage, but maybe that's
a bug in the Fortran front-end?

(2) If it does have to be externally visible, maybe LLVM's alias analysis can be
taught stuff about _gfortran_st_write and friends, which I think only read/write
memory via their arguments (so won't be sneakily accessing @__mod_MOD_array).

Ciao, Duncan.

PS: Another possibility is to do link-time optimization, since at that point the
optimizers are capable of finding out if that global is used anywhere else or
not.

Hi Duncan,

Thanks for looking into this!

(1) Does @__mod_MOD_array have to be externally visible?

Fortran modules by definition share their entire contents with user sites (functions, other modules), unless some module member is explicitly marked private. So, all data coming from modules should be essentially global by default.

(2) If it does have to be externally visible, maybe LLVM’s alias analysis can be
taught stuff about _gfortran_st_write and friends

Another possibility is to do link-time optimization

Right, but this is a limited solution, so is LTO. The parallizing compiler with are developing must be able to handle the general case, otherwise people would never be happy with it.

I think it’s completely different at low-level too!

I meant these cases are common with respect to unexpected loads. Trying to eliminate them at compile-time using function-specific knowledge may suite only well-known APIs, e.g. C99 math, libgfortran, etc. But for general case this approach leads nowhere. I agree here with Tobi, that runtime analysis injection may help very well, if inserted after memory analysis from ISL is available, but before Polly starts to drop away invalid SCoPs. This is currently impossible, because ISL only works on valid scopes supplied by Polly. I’m looking for design routes to let ISL work on regions, that are not yet valid SCoPs, but may become valid, if according to memory analysis we are eligible to replace problematic loads with constants.

  • D.

2013/1/4 Duncan Sands <baldrick@free.fr>

PS: Another possibility is to do link-time optimization, since at that point the
optimizers are capable of finding out if that global is used anywhere else or
not.

2013/1/4 Duncan Sands <baldrick@free.fr>

By the way, do the GCC optimizers get this, i.e. remove the loads?

Ciao, Duncan.

Hi Duncan,

You mean - what happens if DragonEgg is invoked with GCC optimization? I tried -O3 and -fplugin-arg-dragonegg-llvm-ir-optimize=1, but nothing changed.

  • D.

2013/1/8 Duncan Sands <baldrick@free.fr>

Hi Dmitry,

Hi Duncan,

You mean - what happens if DragonEgg is invoked with GCC optimization? I tried
-O3 and -fplugin-arg-dragonegg-llvm-ir-optimize=1, but nothing changed.

no, I meant if you run gcc with optimization and don't use dragonegg at all.
If I understand right, your problem is that the LLVM optimizers don't remove
some loads, and this blocks polly. If the GCC optimizers are able to remove
the loads, then that means that the gcc optimizers have access to some info
that the LLVM optimizers don't have (because removing the loads is invalid
given the information currently in the IR). We would then have to just have
to teach dragonegg to extract that information and pass it on to the LLVM
optimizers.

Ciao, Duncan.

Hi Duncan,

Got it. So, it looks like in [case] it is really sufficient to have -O1 for GCC to remove loads (not really remove, but rather pull outside of loops, which is sufficient).

And in this case - GCC -O1 does not help.

Still, is the [case] sufficient for you to track the missing GCC’s info and make it accessible to LLVM? Meanwhile, I think we will use GCC -O1, however, I guess, it would likely introduce other issues in turn. Oh.

[case] http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-February/059313.html

Thanks,

  • D.

2013/1/16 Duncan Sands <baldrick@free.fr>

Update: in the case discussed in this thread, GCC optimization also helps: with -O2 GCC pulls out dimensions loads from loops. Polly still has a problem due to bitcast, but this is another issue.

  • D.

2013/2/10 Dmitry Mikushin <dmitry@kernelgen.org>