speculative parallelization in LLVM

Hi,

I plan to do some speculative parallelization in LLVM using Polly and I target loops that contain pointers and indirect references. As far as I know, Polly generates optimized code starting from the SCoPs, therefore I plan to replace all pointer accesses with array accesses, such that Polly will accept the code. Each array access should use a liner function of the enclosing loops indices.

This is how I plan to implement it:

  1. parse the alloca instructions and find declarations of pointers
  2. for each pointer
  3. delete the alloca instruction
  4. create a new alloca instruction for the new array
  5. replace all uses of the pointer with the new array

Feed this code into Polly, get an optimized version and replace back the array accesses with the original corresponding pointers.

Please advise me whether this is a good strategy to replace the pointer accesses.

Thank you,
Alexandra

Hi,

I'm no expert in auto-vectorization, but I guess you'll be restricted
to pointers that are guaranteed not to overlap, which depending on how
you structure your code, are not many.

You can't just transform *any* pointer access to array access by
replacing them by allocas (that, afaik, cannot overlap).

Also, the allocas inside functions with pointer arguments store a
pointer to the original pointer, so you'd have to go back to the
caller and change *that* alloca, only you can't do that because you're
not sure the callee will always be called, unless you want to make a
whole refactoring of all pointer allocas in all functions in all
modules, and even so, you'll be restricted to the accesses that don't
overlap.

I could be talking non-sense (please let me know if I am), but I feel
that if it were that simple, we'd see much more mainstream
auto-vectorizing compilers...

cheers,
--renato

Hi Alexandra,

good to read you and even better that you plan to look into Polly. :wink:

In respect of your plan I still do not exactly get what you want to do. (I still feel pretty new to all this stuff, so just let me know if something I say sounds wrong)

In LLVM load and store instructions always load and store from a pointer and there is no way to express an array access inside LLVM-IR. What comes closes to an array access is the combination of a getElementPtr instruction that calculates a pointer into an array plus a load/store instruction. However, this is actually not very different from directly calculating the pointer into the array based on normal pointer arithmetic.

As a result, in Polly we always analyze the pointer arithmetic (or the effects of it) and try to prove that it is conceptually equivalent to an access to a normal array. To prove this we use the normal LLVM provided alias analysis. Here we can also derive that the virtual arrays do not overlap (as mentioned by Renato). However, we never transform any pointers into arrays. We just treat them conceptually as virtual arrays.

Maybe an example would help me to understand what you plan to do. Can you give a very simple example (at best in both C and LLVM-IR) that shows what transformation you plan to perform and what kind of code you would be able to optimize.

There are currently quite some parts in and around Polly that need to be improved. Preparing code for Polly such that it fits the format we expect is definitely one of this. Indirect references is something, I did not even start to think about. Hence, ideas in this direction are more then welcome.

All the best to Strasbourg
Tobi

Hi Tobi,

Thank you for your reply :).

I know that array accesses are handled as pointers in LLVM, but as I understood Polly is focused on statically analysable code. As you mentioned: proving that pointer accesses actually represent virtual array accesses.

In the case of a linked list for example, parsed with a pointer p = p->next, I expect that Polly will not handle this code. So I intend to change this pointer into an array (I do not expect to get correct code, but a correct SCoP).

Something along the lines of:

struct linked{
int val;
struct linked* next;
};

typedef struct linked item;
item *curr, *head;

while(curr) {
do_something();
curr = curr->next ;
}

This becomes in LLVM IR:

%curr = alloca %struct.linked*, align 8

while…

%tmp15 = load %struct.linked** %curr, align 8

%tmp16 = getelementptr inbounds %struct.linked* %tmp15, i32 0, i32 1 ( curr = curr->next :wink:

And I want to transform this into a SCoP like this:

%curr_array = alloca [10 x %struct.linked], align 8

while…
%tmp16 = getelementptr inbounds [10 x %struct.linked]* %curr_array, i32 0, i32 1

(replace all pointers similarly)

I only want Polly to accept the code (although incorrect at this point) and to generate optimized code. Next I will replace back all the fake arrays that I introduced, with the original pointer accesses and correct the code. Finally, at runtime, I perform code instrumentation on the optimized code to check its correctness.

Is this approach going to work with Polly? Or can I generate optimized code versions with Polly in a different manner when there are pointers and indirect references inside the code?

Thanks again and good luck with all your work on Polly!

Alexandra

Hi Alexandra,

Can you guarantee that the linked list will be allocated in contiguous memory?

cheers,
--renato

Hi Renato,

No, I cannot, but in case it is, I want to take advantage of this. In case it is not, the instrumentation code will detect this at runtime and simply roll back to the original version. I will always keep an original version available, in addition to the ones I modify with Polly. However, initially I will speculate that it is allocated contiguously.

Thanks,
Alexandra

Keeping the original version should be easy, as Polly always leaves both versions in the LLVM-IR.

At the moment we have

if (true)
   new_version
else
   old_version

You could just replace the true with your runtime check.

Cheers
Tobi

Hi Tobi,

Thank you for your reply :).

I know that array accesses are handled as pointers in LLVM, but as I
understood Polly is focused on statically analysable code. As you
mentioned: proving that pointer accesses actually represent virtual
array accesses.

[...]

Is this approach going to work with Polly? Or can I generate optimized
code versions with Polly in a different manner when there are pointers
and indirect references inside the code?

OK. Perfect. Here we are. Thanks for the nice explanation. Now I get what you plan to do and I am extremely interested in how you will solve this.

So yes, I assume the translation of your code into statically analyzable code should work. The only problem I see is that it may take some time to generate code that is really statically analyzable and that at the same time can easily be converted back to the original code. Especially if afterwords the code is/was further optimized.
Furthermore, it you may trigger some cases that Polly cannot yet handle.

One thing I was reasoning about for a while, is if it is possible to
simplify the generation of code that Polly can recognize, such that frontends like clang, but also your tool can generate code that we can be sure Polly can handle. Here, I think it would be especially interesting to be able to make Polly also parse some kind of virtual accesses, were the details of the accesses are hidden and Polly only gets the information, that this access acts like an access to a virtual array.

Describing this virtual accesses by using some kind of intrinsics combined with meta data may be possible.

%1 = polly.virtual_read() !polly !"{A[i][2j][3k]}"
polly.virtual_write(%ptr) !polly !"{A[i][2j][3k]}"

Like this you can simply transform your linked list into virtual accesses, and Polly generates the code that executes these accesses,
and at the end you replace them with the actual code.

(The definition of this is far from complete and the example above needs probably be changed to be actually usable. Still, I hope it gives the general idea.)

Let me know what you think and what you need exactly. Maybe we can work out together a good way to represent such virtual accesses.

Cheers
Tobi

I might need a switching mechanism a little bit more complicated to handle instrumentation, several parallel versions and the original version. And also, the “original” version I send to Polly, is not the original one generated from the source code, but the one with pointers replaced with arrays.

But I will investigate the solution you propose, maybe I can adapt it and use it more easily, as it is already integrated in Polly.

Alexandra

Sorry, I got it wrong, let me rephrase my question...

How are you going to find out in compile time that the objects are in
contiguous memory?

In a pointer access such as:

int foo(int* a, int* b) { ... }

if a and b point to an list (or if the code in foo() assume they do),
at least you know that the elements of a and b are in contiguous
memory.

But in the case of:

int foo(llist a, llist b) { ... }

since the next element is in the elm->next pointer, it could be
pointing anywhere in the memory space.

I mean, in some cases you could have them in contiguous memory, but
how will you identify such cases in compile time?

Depending on the caller, and depending on the run time path the
program takes, the outcome could be different. For instance, even if
there is only one caller, and that caller iterates through a list to
allocate objects on the heap for the linked list, there is no
guarantee that the OS (or the run-time library) will give you
contiguous memory blocks.

cheers,
--renato

This is exactly want I need to achieve with Polly actually. I think a good idea would be to define intrinsics / metadata, as you mentioned, to notify Polly that even though it cannot analyse these accesses, to ignore them and perform the code transformations. We can go even further and maybe describe these accesses with some parametric linear functions.

For instance:

while (cond1){
while(cond2){
p=p->next;
}
}

to introduce virtual iterators of the enclosing loops, i and j , and replace the accesses inside the loop with virtual accesses that have the form ai + bj + c

%1 = polly.virtual_read() !polly !" {a1i + b1j + c1}"
polly.virtual_write(%ptr) !polly !" {a2i + b2j + c2}"

Next at runtime it will be easier to change the virtual accesses to the original pointers, and to compute the values to the coefficients a1, b1 … to check if they follow the linearity. I perform dynamic instrumentation to compute the coefficients.

However, for applying the transformations, Polly should either totally ignore the virtual accesses, or assign some default values to the coefficients and take them into consideration. Our plan is to create several versions, some with different values, lets say a = 1, b= 1, c = 0, and one version where all the virtual accesses are ignored.

What is important is to be able to track the virtual accesses and to be able to replace them with the original ones.

Do you think this represents a lot of work in Polly? And do you plan to include this kind of support to handle non-statically analysable code? In case this doesn’t imply significant changes in Polly, I could start working on this. It might be a better approach than converting the code into a form accepted by Polly :slight_smile:

Alexandra

This is exactly want I need to achieve with Polly actually. I think a
good idea would be to define intrinsics / metadata, as you mentioned, to
notify Polly that even though it cannot analyse these accesses, to
ignore them and perform the code transformations. We can go even further
and maybe describe these accesses with some parametric linear functions.

For instance:

while (cond1){
while(cond2){
p=p->next;
}

to introduce virtual iterators of the enclosing loops, i and j , and
replace the accesses inside the loop with virtual accesses that have the
form a*i + b*j + c

%1 = polly.virtual_read() !polly !" {a1*i + b1*j + c1}"
polly.virtual_write(%ptr) !polly !" {a2*i + b2*j + c2}"
Next at runtime it will be easier to change the virtual accesses to the
original pointers, and to compute the values to the coefficients a1, b1
... to check if they follow the linearity. I perform dynamic
instrumentation to compute the coefficients.

Mh. I believe we should distinguish the data for Polly and for your calculations. I assumed we would use affine linear relations in the access functions (Actually isl_maps like {[i,j] -> List[10i + 30j + 10]) to define accesses such that Polly can use this access functions to calculate dependences and reschedule the code accordingly.

The possibly non-affine accesses would then be hidden behind the virtual access.

However, for applying the transformations, Polly should either totally
ignore the virtual accesses, or assign some default values to the
coefficients and take them into conosideration. Our plan is to create
several versions, some with different values, lets say a = 1, b= 1, c =
0, and one version where all the virtual accesses are ignored.

What do you mean by totally ignoring the virtual accesses? This would mean Polly would not detect any dependences at all and we would generate always fully parallel code? Is this what you want even if the code accesses an element of a list several times and actually has dependences between these elements?

I was thinking in your tool you could generate several versions of the code, where each version has another set of affine linear access functions initialized from your parametric affine linear functions. Like this Polly would schedule each version differently.

What is important is to be able to track the virtual accesses and to be
able to replace them with the original ones.

What information do you need from the surrounding code? Do you need the values of i and j?

Then I would propose we add them as arguments of our intrinsics:
%1 = polly.virtual_read(%i, %j) !polly !" {a1*i + b1*j + c1}"
polly.virtual_write(%ptr, %i, %j) !polly !" {a2*i + b2*j + c2}"

Do you need anything else?

Do you think this represents a lot of work in Polly? And do you plan to
include this kind of support to handle non-statically analysable code?

Adding support for those virtual access functions should not be too difficult and I am happy to help you to add such support. I have currently no plans to work on support for non-statically analyzable code, but am also happy to support and/or add your work.

In case this doesn't imply significant changes in Polly, I could start
working on this. It might be a better approach than converting the code
into a form accepted by Polly :slight_smile:

Sounds good. What about discussing a full blown example to understand what needs to be changed in Polly and how exactly we want to represent this?

Cheers
Tobi

Hi

while (cond1){
while(cond2){
p=p->next;
}
}

to introduce virtual iterators of the enclosing loops, i and j , and
replace the accesses inside the loop with virtual accesses that have the
form ai + bj + c

%1 = polly.virtual_read() !polly !" {a1i + b1j + c1}"
polly.virtual_write(%ptr) !polly !" {a2i + b2j + c2}"
Next at runtime it will be easier to change the virtual accesses to the
original pointers, and to compute the values to the coefficients a1, b1
… to check if they follow the linearity. I perform dynamic
instrumentation to compute the coefficients.

Mh. I believe we should distinguish the data for Polly and for your calculations. I assumed we would use affine linear relations in the access functions (Actually isl_maps like {[i,j] → List[10i + 30j + 10]) to define accesses such that Polly can use this access functions to calculate dependences and reschedule the code accordingly.

The possibly non-affine accesses would then be hidden behind the virtual access.

However, for applying the transformations, Polly should either totally
ignore the virtual accesses, or assign some default values to the
coefficients and take them into conosideration. Our plan is to create
several versions, some with different values, lets say a = 1, b= 1, c =
0, and one version where all the virtual accesses are ignored.

What do you mean by totally ignoring the virtual accesses? This would mean Polly would not detect any dependences at all and we would generate always fully parallel code? Is this what you want even if the code accesses an element of a list several times and actually has dependences between these elements?

I was thinking in your tool you could generate several versions of the code, where each version has another set of affine linear access functions initialized from your parametric affine linear functions. Like this Polly would schedule each version differently.

Yes, this is what I meant, to assign different values at compile time to the coefficients and build several versions with Polly. By “ignoring them” I was thinking that Polly could build one more version in which it takes into consideration only the statically computable accesses, generates the transformations, and at runtime I will check if the pointer accesses comply (if the schedule is still valid with the dependencies they introduce). This might be suitable for codes with a lot of array accesses, and maybe just a few pointers. But this is just an idea, not necessarily relevant. The main strategy is to generate several versions using different access functions.

What is important is to be able to track the virtual accesses and to be
able to replace them with the original ones.

What information do you need from the surrounding code? Do you need the values of i and j?

Then I would propose we add them as arguments of our intrinsics:
%1 = polly.virtual_read(%i, %j) !polly !" {a1i + b1j + c1}"
polly.virtual_write(%ptr, %i, %j) !polly !" {a2i + b2j + c2}"

Do you need anything else?

Do you think this represents a lot of work in Polly? And do you plan to
include this kind of support to handle non-statically analysable code?

Adding support for those virtual access functions should not be too difficult and I am happy to help you to add such support. I have currently no plans to work on support for non-statically analyzable code, but am also happy to support and/or add your work.

In case this doesn’t imply significant changes in Polly, I could start
working on this. It might be a better approach than converting the code
into a form accepted by Polly :slight_smile:
Sounds good. What about discussing a full blown example to understand what needs to be changed in Polly and how exactly we want to represent this?

Ok, I will keep this in mind. For the moment we did not decide yet upon the strategy to apply, but in case I will start extending Polly, I will reply to this message with a concrete example.
Thanks for your support.

Cheers,
Alexandra