Parallel C++

Chris Lattner suggested that I post to this mailing list.

I used Clang/LLVM to build a prototype for parallel
interpretation of C++. It’s based on the idea that C++
objects can be constructed remotely, and accessed via
remote pointers, without changing the C++ syntax.
I am a mathematician, not an expert on compilers.
I am proposing changes to the C++ standard and to the
compiler architecture, so I’m very interested to hear from
experts.
My paper is
https://arxiv.org/abs/1811.09303
Best regards,
Ed

I’ll probably have more detailed comments later but the related work you may wish to consider includes:

There are quite a few dead parallel C++ dialects from the last millennium but it’s probably not worth your time to find and read about them.

I’m very glad that you used MPI as your communication runtime. This will save you lots of pain.

Jeff

Jeff,
Multi-core CPUs and all the associated software technologies (shared memory, threads, etc) are a technological dead end.
I argue more than that: all software technologies that use processes
are dead on arrival. This includes the technologies you mention in
your presentation
https://www.ixpug.org/images/docs/KAUST_Workshop_2018/IXPUG_Invited2_Hammond.pdf

People got used to processes over decades, so when they talk about parallelism they immediately talk about processes, but this is the root of the problem. I propose object-level parallelism. An object is more than a process. It is a virtual machine.

I propose to introduce remote pointers into C++. I am very surprised nobody thought of this before. I’d be curious to know how much work
people think this would be to do it in LLVM. I know it may be possible to introduce something like remote_ptr, but I don’t think it is a good idea.

I am also proposing a new way of executing code, which I call causal asynchronous execution. I’d like to know if people find it natural to write code like this.

Ø Multi-core CPUs and all the associated software technologies (shared memory, threads, etc) are a technological dead end.

Wow.

Ø I propose to introduce remote pointers into C++. I am very surprised nobody thought of this before.

They have and this is why Jeff pointed you to a variety of related work that is not cited in the paper. Just as one example, UPC has the concept of a pointer to shared data that may exist with another thread on another node. Two recent attempts at a PGAS model for C++ (UPC++ and Coarray C++) have similar concepts.

Ø I am also proposing a new way of executing code, which I call causal asynchronous execution.

Again, that’s also not new. There are a lot of existing concepts in your paper, which isn’t by itself bad – perhaps you are combining them all together in some nice new way – but please don’t represent the component ideas as brand new.

Ø The object-oriented computing framework provides the foundation for a new computer architecture, which we call a multiprocessor computer. [From the linked paper]

That really really needs a new name. J

-Troy

Troy,

“Multi-core CPUs are a dead end” is not a sensationalist statement. Multi-core CUs are limited to about a dozen cores. Nobody knows how to program CPUs with 1000 cores.
Some “religious” people may actually like to program with threads, but there are at least 20 million programmers who use OO programming. How many of them enjoy threads or MPI?
Not to mention PGAS…

About remote pointers: my question is specifically why can’t we write
Object * some_object = new Object(a, b, c, d);
in C++ where some_object is not an ordinary pointer, but a remote pointer?
It is possible to implement this without breaking existing code.

Could you please provide me with a reference to causal asynchronous execution?
I think you may very well be right about it, but I searched and I did not find anything like this.
It seems to me logical that someone thought about it before me, but I’d like to see and read the work.

Ø Could you please provide me with a reference to causal asynchronous execution?

https://en.wikipedia.org/wiki/Futures_and_promises

-Troy

Thank you. I do know about the futures mechanism though I didn’t know that futures can be implicit. Causal asynchronous execution sounds indeed like implicit future mechanism.
I will check it out a bit more and I’ll reference it accordingly.

There are similar mechanisms in actor-based languages, but I am proposing to execute
a compound C++ statement in a non-sequential way, with causal asynchronous execution.
This would certainly break a lot of code, but I am proposing this as a compiler flag.
There is a lot of code that can be automatically translated from sequential to causal asynchronous code. There is also a lot of code that can be translated with not too much human programming effort. And then people could write new causal asynchronous code,
which will be naturally parallel. I show some extensive examples of such code. Would be nice to see if people like such code.

About remote pointers: my question is specifically why can’t we write
Object * some_object = new Object(a, b, c, d);
in C++ where some_object is not an ordinary pointer, but a remote pointer?

I gave you a couple of trouble areas in my private-email response, but I’ll repeat them for the record here too.

Circa 2012 I worked for a startup […] allowing Objective-C objects to live anywhere, even on other machines. Then we used hooks in the Objective-C runtime to intercept messages passed to those objects, marshal them, and transfer them across the wire to where the object lived. We were doing this for the purpose of “Mobile Device Management” — that is, we wanted to allow an employer to provide basically an “iPhone app as a service,” so that the app would run on the employer’s server but all the UIKit objects would live on the employee’s mobile device.

We had two problems with this:

  • First, what does the app do when you go into a tunnel and lose connectivity? You have something that looks to the program like a method call — say, result = object->ExecuteMethod(some, parameters) — which can return a value, or throw an exception, or abort; but which can also “hang” due to a lost network connection. And if the caller treats this as a TimeoutException, then we have the problem that the callee might unexpectedly “resume” sometime later with a return value that the caller (who has unwound the stack and moved on) is no longer equipped to deal with. ExecuteMethod is acting spookily like a coroutine, here.

  • Second, how does the marshaller deal with memory, and how does it deal with behaviors? We need to be able to implement qsort across a wire boundary. That means we need to be able to pass an arbitrarily large chunk of memory (the array to sort), and we also need to be able to pass a function pointer (the comparator). These are both extremely intractable problems. Our startup solved these problems by cheating. You need to solve them for real.

To elaborate on the “behaviors” part: Let’s suppose I have

class Object {
int a, b, c, d;
public:
Object(int a, int b, int c, int d) : a(a), b(b), c(c), d(d) {}
virtual int method() { return a+b+c+d; }
~Object();
};

int test() {
remote_ptr some_object = handwave(new Object(1,2,3,4));
int x = some_object->method();
if (x < 0) throw “oops”;
return x;
}

First of all, if you don’t understand why I wrote remote_ptr<T> instead of T*, you’re probably in trouble already, C+±language-wise.
Second, you’re trying to make method execute on the remote machine, right? How does the remote machine get a copy of the code of Object::method?
Third, when we hit the throw and unwind the stack, destructors get called. Object has a non-virtual destructor. Where does it run: on our machine, or on the remote machine? Presumably it must run on the remote machine, which means our stack-unwind is held up waiting for the result of each destructor we have to run (and those destructors must happen in serial, not in parallel).
Fourth, consider

void helper(Object& o) {
o.Object::method();
}
void nexttest() {
remote_ptr some_object = handwave(new Object(1,2,3,4));
Object onstack(5,6,7,8);
helper(*some_object);
helper(onstack);
}

Any attempt to invent non-trivial “fancy pointers” needs to come with a full-fledged idea of how to invent “fancy references,” or it will not be able to get off the ground in C++. (See P0773R0.) Also, I snuck a non-virtual method call in there; you need a way to handle that.
This problem is easier in Objective-C, where every handle is the same kind of pointer and every method call is virtual by definition. It is very hard in C++. And when I say “very hard,” I’m fairly confident that I mean “impossible.”

Fifth, consider subobjects of remote objects:

struct FifthObject { Object m{1,2,3,4}; };
void fifthtest() {
remote_ptr some_object = handwave(new FifthObject());
helper(some_object->m);
}

Finally, even if you invent a new system for dealing with remote objects, you still have to figure out where the code is going to physically run: on which CPU, which process, which thread… avoiding cache ping-pong(*)… all this real-world-multi-processing kind of stuff. And you must be able to handle it all with zero runtime cost, or else people concerned about performance will just go under you and use those “dead-end” but efficient mechanisms, leaving your neat abstraction without a userbase.

(* — In a domain without global shared memory, the analogue of “cache ping-ponging” would be “marshalling the entire array across a wire boundary every time someone calls qsort.” We hit this problem, and, as I said, solved it by cheating on the demo.)

–Arthur

Arthur,

Your post is very helpful to me, with the exception of remarks like
“if you don’t understand why I wrote remote_ptr<T> instead of T*, you’re probably in trouble already, C+±language-wise”
because although you cannot avoid using something like remote_ptr today,
I don’t understand why the ordinary star cannot be interpreted as a remote pointer.
In fact, I am sure it can be.

The rest of your post actually provides reasons why it should not be possible, but I think
it is worth tackling them. I think it is a good list. As I wrote to you in private, I think there are many issues that I am not addressing in my paper, and there are likely many issues that I
am missing entirely. The reason I posted the paper here was to get this kind of feedback.

The first issue of loss of connectivity is an operating system issue. it is external to the application. A similar issue is: what if the application deadlocks? The problem is that on clusters you don’t have a concept of an application, and you don’t have an OS. I define an application as a collection of objects. You need an OS that can track applications (= collections of objects). In that sense the situation is not different from a single CPU where processes can hang. In the new OS you get applications that hang.

How does the remote machine get a copy of the code? Good question, lots of possible answers. Let’s start with the simplest: the compiler does it, like in MPI.

About destructors: yes, like constructors they run on the hardware where the remote object lives. As you say, it is likely that there are lots of dependencies that force sequential execution,
not parallel. Well, that’s life. We get parallelism elsewhere.

The problem of references is interesting. I’m not sure I understand what you mean.
I mention this very briefly in my paper. When you pass an object by reference, you will
need to copy the entire object to the remote processor and then copy the entire object back
in the end. I suppose this sounds monstrous to you… in some cases it may be possible to send less than the full object, but this is a matter of optimization.

Basically, there is no choice: your objects cannot “all live in the single CPU”, and you cannot pretend that you can modify a remote object without copying it. I think this is a big topic for discussion.

On subobjects: yes, it’s a good question where do subobjects live. I think this can also be relegated to the compiler and the OS. (In other words, let other people solve this… :-))

Seriously, I’d like to think of a desktop computer with many thousands of processors (or cores), and I’d like to run code on such a thing. For this, all ideas of shared memory and processes are irrelevant. I am proposing a way to write ordinary code, which is almost like the code that you’re used to, but you will probably need to make some changes.
Basically, you’ll write your code as a collection of many objects that talk to each other.
Because there are so many processors in the system, you will not be handing how and where these objects live: the compiler and the OS will do this for you. They will map millions of your objects to thousands of processors.

The concerns that you raise about speed are very serious. I think it should be possible to run such applications as fast, and perhaps faster than they run now, but I admit this is a difficult and huge project. However, I think the benefit of lower power consumption is huge and this is why I said that this can affect the world energy consumption. it may sound bombastic, but this is likely easier to achive than actual
speed-up from parallelization.

Thank you very much for your input.

I know from personal testing that building LLVM with ninja scales well to at least the 96 cores on a Cavium ThunderX Aarch64 server.

In x86 land, going from a m5.12xlarge (48 thread, 8m05s) to a m5.24xlarge (96 thread, 6m40s) gives only about a 20% boost, but it scales almost linearly (maintains constant cost on AWS around $0.30 to $0.35) from a single thread instance up to a 48 thread instance.

Yes, 1000 cores is more of a challenge.

The path to remote pointers in ISO C++ is long but likely goes through the heterogeneous working group led by Michael Wong. I recommend taking a look at https://github.com/codeplaysoftware/standards-proposals to see what has been discussed so far.

Jeff

I propose to introduce remote pointers into C++. I am very surprised nobody thought of this before.

0.) People have thought of this before. The earliest work I know of
was published in 1993.
1.) People have proposed this before.*

I read your paper this afternoon. I think your paper needs to explain
how it differentiates itself from the plethora of related work in this
problem space.

SG1, the C++ subcommittee for concurrency and parallelism, has decided
to not pursue "inaccessible memory" (e.g. remote memory) at this time;
we'd like to tackle affinity, memory access, and heterogeneous memory
first. This decision was made via a straw poll at the 2017 Toronto
committee meeting:

Straw poll: In the near term, ignore inaccessible memory except for
cross-process sharing on the same device.
SF F N A SA
7 7 5 2 1

Before any work can be done on extending C++ to support diverse memory
kinds, we must first explore how the C++ memory model and abstract
machine will be impacted - what breaks, what needs to be extended,
etc. We still don't have a proposal for that. I would welcome an
SG1-targeted paper on that subject matter.

*: I don't have a full list of prior proposals in this space, but
here's some material to get you started:

https://wg21.link/P0009

https://wg21.link/P0567

http://stellar.cct.lsu.edu/pubs/pgas14.pdf

Bruce, Jeff, Bryce,

I am just a user of C++ and am not familiar with all the work that goes into language development and standardization, so this is certainly interesting to me.
Perhaps due to my being an outsider, and unaware of all the difficulties,
my proposal is radical.
It is not about extending C++, but redefining it at its core.
The only definition of an object that I found in the standard is “a region of storage”.
Perhaps there is a better definition somewhere. I’d be interested to read it.
I am proposing that an object is a virtual machine.
When I talk about a remote pointer, it is not just another pointer into some memory.
It is a way to access a virtual machine. I don’t know what the architecture will look like,
and I don’t know what is memory and where is it.

If I may be so rude, I propose to get rid of the C++ memory model and the C++ abstract machine enitirely. I did not know about the existence of SG1, the C++ subcommittee for concurrency and parallelism. An object-oriented language (like C++) is conceptually inherently parallel, and it is bizarre that for several decades it has been a sequential language. SG1 should not even exist, or it should be the committee for the whole of C++ language.

I have sketched a trivial abstract model of object-oriented computation in my paper. It does not mention memory and it does not mention network. It doesn’t even mention a processor. I propose it as a starting point for a more detailed and workable model.
I wrote in my paper that
“We perceive the world primarily as a world of objects,
a world where a multitude of objects interact simultaneously.
From the programmer’s point of view
interactions between objects are meaningful and memorable,
unlike interactions of processes exchanging messages.”
The fact that C++ is a sequential language is kinda perverse!

To summarize, I am not proposing a mechanism, a language extension, a library, a tool, a framework or a paradigm. I am proposing to redefine the existing language.
I know it sounds completely crazy and impractical, but I believe that continuing to build on the current base is not going to work. The hardware can only go in the direction of parallelism. There is a need to build desktops with thousands, or millions, of processors, and the obstacle is programmability. The sequential framework of C++ won’t work, and adding threads, language extensions, libraries, tools is futile.

The reason that I so presumptuously named my paper as a solution to the problem of parallel programming is that while my proposal is radical, I also think that a fairly smooth and gradual transition is possible from sequential C++ to parallel C++. Parallel interepretation will break a lot of code, but a lot of code can be translated to a parallel interpretation automatically, or with a relatively small programming effort. Moreover, I have shown in my paper that you can run your sequential code on parallel hardware without breaking it, and since the hardware will be more energy efficient, you will be able to run your code at a lower cost.

Bryce, thanks for the pointers. I am looking over them.
Again, thanks to everybody for their remarks. The topic is too complex for one man, and this input has helped me tremendously already.

AFAICT, it seems like what you’re suggesting to build isn’t a good fit for C++. I maybe be wrong.

If you think it should in fact be how C++ evolves, you should take your proposal to the C++ standards committee. To do so, you really need to dig into the references folks have sent and integrate them into your thinking, which you seem on track to do. I also suggest having some believable implementation, because that type of idea basically doesn’t see the light of day without someone committing to implementing it.

If you think it’s a new language you’re creating, do you think LLVM has the right model for what you’re trying to do? Again, it seems like implementation experience would help figure that out.

I like Clang a lot, but it seems to me that creating a serious prototype needs more than
one person’s effort. I feel that the best approach may be to first create a dialect of C++ and see its adoption by the community.

Is LLVM/Clang suitable for the project?
Naively, it seems to me that LLVM is a sequential VM, so perhaps its architecture needs be extended.

I am proposing an intermediate representation which encodes object operations,
let’s call it IOR. IOR is translated to interconnect hardware instructions, as well as LLVM’s IR.

I am proposing a dedicated back end to generate code for the interconnect fabric.

I am just throwing some ideas here, but people here should know better than me what needs to be done. I’d like to know what people think on this.

Since I don’t think I can do this on my own, I’m interested in any ideas about how this
could be done. I think that many different kind of companies should be interested in this for different reasons.
For example, if you are planning to manufacture a 1000 core CPU, you have a problem programming it. You probably are implementing some kind of communications libraries for
the network on the chip, but with a compiler like this you’ll be able to

(1) run standard sequential C++ code, where the compiler allocates objects onto many different cores. This may not be much slower than sequential execution, and it would be possible to run many such applications simultaneously, so there should be large savings in electricity.

(2) do programming in parallel C++

Presumably, you have ~4 choices here:

1- Figure out how to do the work on your own.

2- Figure out how to convince enough people to help you implement it with this for free.

3- Pay a bunch of people to implement it for you.

4- Figure out how to convince a company/multiple companies to either:

a. Pay a bunch of people to implement it for you.

b. Pay a bunch of people to put together a standards document for a new language, then presumably pay a bunch of people to implement it.

Edward, it sounds to me like you are trying to reinvent Smalltalk. Its core is really about message passing and perhaps people have made attempts to make it parallel already.

On a more serious and specific note, I think you are ignoring the "abstract C machine" on which both C and C++ languages are built. Fundamentally, objects are laid out in memory (let's ignore the stack for now) and are built off primitive and user-defined types. These types are known (and stable) throughout the compilation process of a single program and so are the offsets of various fields that comprise the objects. All these objects (and often their sub-objects) can be read and written anywhere in a single-threaded program. Multi-threaded programs must be data-race-free, but essentially follow the same model.

The point I am trying to make is that the whole model is built on memory accesses that are eventually lowered to the ISA. There is no rigid protocol for invoking a member function or reading a member variable - things just happen in the program's address space. And then there is code optimizer. The memory accesses (expressed via LLVM IR, for example) go through various techniques that reduce and eliminate pointless work... at which point you have the target's ISA and absolutely no notion of a "method" or "object" (as a well-formed program cannot tell that the code has been re-arranged, reduced, reordered etc).

I suggest that you take a look at https://godbolt.org and see what the compiler emits with -O3 for a few short class/function templates as well as normal procedural code.

Oleg.

Oleg,

May be I am misunderstanding what you’re saying…

Since I am proposing a different framework for execution,

the architecture which has an abstract machine

and a memory model will have to change.

Since I’d like to have remote objects,
which are native to C++, unlike the existing objects, which are all local,
I am proposing this IOR layer. Access to objects will have to change.
An object access will not longer be a memory access, unless some
compiler optimization determines that the object is local.

So this probably means that it requires changes to the LLVM IR?
As I said, I don’t know enough about the current LLVM architecture
to make a detailed plan, but I think it is an interesting problem.

Ed

Ed, it sounds like you have an idea for a new language with a new execution model and a special object representation. I was merely trying to point out that these ideas have little to do with what C++ compilers do today.

Oleg.

Yes, I agree. The new language is what C++ should be, a parallel language,

and in my opinion what it will inevitably become with hardware evolution.