PHP Zend LLVM extension (SoC)

Hi,

PHP has a Google Summer of Code project approved to create an LLVM extension
for the PHP's VM (Zend).
(http://code.google.com/soc/2008/php/appinfo.html?csaid=73D5F5E282F9163F).
I'll be mentoring that project (and the student is CC'ed).
Although I've already contributed a few patches to clang, I haven't hacked
LLVM much, so I would like to gather some advise before misleading the
student too much :stuck_out_tongue:

So my idea is to use the current PHP parser to produce PHP bytecode and then
convert the PHP bytecode to LLVM's bitcode. The extra pass to create PHP
bytecode seems necessary for now, as it makes things simpler in the PHP end.
The first step would be to convert the PHP bytecode to LLVM by just
producing function calls to the PHP interpreter opcode handlers. This has
two advantages: it's a simple task and we can put something working fast.
The disadvantage is that it would only bypass the opcode dispatcher, leaving
no much room for optimizations.
In the second phase, we would start to inline some simple PHP bytecodes,
like arithmetic operations and so on, by dumping LLVM assembly instead of
calling the opcode handler. Eventually we could reach a point that no opcode
handlers are necessary.

So does this looks like a sane thing? Any helpful advise?
Other question: After having the LLVM assembly, how should the binary code
be produced, loaded to memory, and then executed? I assume we can link
directly to the LLVM code generation and optimization libs. And does it
support dumping the code directly to the memory so that we can run it from
there without much magic (and then cache it somewhere)?

Thanks,
Nuno

Hi Nuno,

PHP has a Google Summer of Code project approved to create an LLVM extension for the PHP's VM (Zend). (http://code.google.com/soc/2008/php/appinfo.html?csaid=73D5F5E282F9163F). I'll be mentoring that project (and the student is CC'ed). Although I've already contributed a few patches to clang, I haven't hacked LLVM much, so I would like to gather some advise before misleading the student too much :stuck_out_tongue:

This is very exciting!

So my idea is to use the current PHP parser to produce PHP bytecode and then convert the PHP bytecode to LLVM's bitcode. The extra pass to create PHP bytecode seems necessary for now, as it makes things simpler in the PHP end. The first step would be to convert the PHP bytecode to LLVM by just producing function calls to the PHP interpreter opcode handlers. This has two advantages: it's a simple task and we can put something working fast. The disadvantage is that it would only bypass the opcode dispatcher, leaving no much room for optimizations.

As far as I know, this is exactly how Apple's OpenGL shader JIT works in Mac OS X. Unfortunately, LLVM will rarely make dramatic changes to your memory representation, so this probably won't be as effective as it is in the OpenGL context. (LLVM will only do aggregate->scalar memory reorganizations; it probably won't be able to prove this safe for a dynamic language very often.) Your challenge in generating very-fast code would likely be one of type inference.

In the second phase, we would start to inline some simple PHP bytecodes, like arithmetic operations and so on, by dumping LLVM assembly instead of calling the opcode handler. Eventually we could reach a point that no opcode handlers are necessary.

So does this looks like a sane thing? Any helpful advise? Other question: After having the LLVM assembly, how should the binary code be produced, loaded to memory, and then executed? I assume we can link directly to the LLVM code generation and optimization libs. And does it support dumping the code directly to the memory so that we can run it from there without much magic (and then cache it somewhere)?

You can use the facilities of ExecutionEngine to run code in-memory without ever touching the filesystem. The LLVM tutorial has information on how to do this.

http://llvm.org/doxygen/classllvm_1_1ExecutionEngine.html
http://llvm.org/docs/tutorial/LangImpl4.html

You'll probably want to provide your opcode handlers as an LLVM IR module. Your JIT can start up and “seed” the execution environment with the predefined handlers, then progressively incorporate more functions into the module as execution progresses.

Hope that helps,
Gordon

Nuno Lopes wrote:

The first step would be to convert the PHP bytecode to LLVM by just
producing function calls to the PHP interpreter opcode handlers.

>[...]

In the second phase, we would start to inline some simple PHP bytecodes,
like arithmetic operations and so on, by dumping LLVM assembly instead of
calling the opcode handler. Eventually we could reach a point that no opcode
handlers are necessary.

There is some presentation on the LLVM website (by Chris, I guess) mentioning that this can be done almost automatically, by letting LLVM compile the PHP opcode handlers themselves, via the gcc or clang front-end. LLVM can then inline the opcode handlers and apply further optimizations.

-- Alain

Thank you both for your answers!
That part of type inference was my second question. PHP uses a structure with a union to represent a variable (because a variable can have different types, like a long, a double, a stream, etc..), but often a single variable will only have one type throughout the program (e.g. iterating through $i in a loop). Will LLVM automagically see that we always use the same type for a certain variable and discard the whole union and use a single scalar (and also discard all the type checking done in the opcode handlers)? We can do some type inference on our side if we do a pass on the bytecode, but I would like to be sure if that's needed or if LLVM will do it on its own.

Well, about the opcode handlers, that's great news that we don't need to inline them by hand. Now I only need to fix clang to compile PHP :stuck_out_tongue:

Thanks,
Nuno

LLVM likely won't be able to do type inference for you. That kind of high level language will be lost by the time you hit LLVM IR. Take a look at how llvm-gcc or the online demo does codegen for C unions, especially for ones of any complexity. If you want to see a real performance win from this, some degree of type inference at a stage where high level information has not yet been discarded will be very helpful.

--Owen

I'd put it another way: an existing llvm pass won't do type inference for you. The right way to tackle this is to write an language-specific pass on LLVM IR that knows your runtime and can propagate types around.

-Chris

Or just do it at the Zend bytecode level.

The meta-point is that performing that kind of optimization requires higher level knowledge than what is explicitly represented in the LLVM IR. To obtain it, you either need to optimize at a higher level or write an LLVM optimization that encodes language-specific high-level knowledge. Either approach will work, and it's your call which one is easier for you to write.

--Owen

Owen, Chris,

Owen Anderson wrote:

LLVM likely won't be able to do type inference for you.

I'd put it another way: an existing llvm pass won't do type inference
for you. The right way to tackle this is to write an language-
specific pass on LLVM IR that knows your runtime and can propagate
types around.

The thing is, LLVM IR performs type checking when it's being constructed (e.g BinaryOperator::CreateAdd(A, B) checks A->type == B->type or BinaryOperator::CreateFDiv(A,B) checks if A is floating point, B is floating point and A->type == B->type). That's a stopper for performing type inference on the LLVM IR. To perform type inference on the IR, one should add/modify:

1) Setting the arguments of a function as Opaque (or something like that, at least something that is able to be redefined later)
2) IR construction: no type check, no type-oriented instructions (eg no CreateFDiv)
3) Add some meta-information on each instruction (is that the Annotation thing?)
4) Perform the type inference.
5) Perform type checking.

Or just do it at the Zend bytecode level.

I think that's currently the only solution. However:

A) Suppose you don't have a bytecode for your language or you would like to directly transform your language to LLVM IR. You don't want to end up creating an instruction set which would be very similar to LLVM IR. That's highly time and memory consuming since you're creating new objects for your personal instruction set, very similar to LLVM IR objects. (I may be thinking too much about JIT here, since it's less a problem with static compilation).

B) This is the kind of pass that many languages will need, and everyone performing its own is not very consistent with LLVM philosophy :wink:

I'm not saying 1) to 5) are the right steps to achieve this. But I do think we need the LLVM IR to get some high-level type information for type inference or even for type-based optimizations such as type-based alias-analysis.

Nicolas

Hi Nuno,

this can be a great project. Some PHP opcodes can be optimised a lot by llvm (like branches or function calls) while others like operations on variables can't be so easy optimized due to the dynamic nature of PHP. For the latest maybe you can use some automatic type inference, like the ones used in languages like Haskell, but this is is a big project and there are also mixed cases like adding a number to a string. I think for these you can use for now the PHP handlers. Even so, I feel that the speed gain will be considerable.
Another thing you can do with only a little more work is to create an abstraction layer between the webserver module and the content source, abstraction layer which will work only with LLVM compiled files (.bc). In that scenario you can compile PHP files to LLVM .bc file format. These files can also be used as a cache, thus eliminating future parsing and compiling times. The speed gain can be very high, because for very much accessed sites some pages are needed hundreds of times per minute. The generated .bc files will call where needed the handlers from the PHP runtime and libraries.
On long term this abstraction layer, which in fact is a webserver module, can be used with many frontends which will generate .bc code from different source languages (now Ruby, Python, Lua, etc comes into my mind), transforming all the thing into a framework similar with the ones based on .class or .NET cli formats. This of course can be done if the .bc format is mature and stable, else it can only be used as a cache.

Good luck,
Razvan

Writing a webserver module is probably not the first thing one should
do. All webserver modules have serious trouble with security in a
multiuser environment (not a surprise: the module runs as the Apache
user, so the scripts of the multi users could interfere with each
other).
If you target the mass hosting / mass scripting market, start with a
FastCGI application server; these are standalone processes that can be
run with the proper owner set and hence don't have these problems.
Besides, you can use the same FastCGI server for all web serves, while
you'd need to write separate interface stuff for Apache, Lightpd, Zope,
or whatever you'd want to target.

Regards,
Jo

Ok, thank you for your help and suggestions! Will digest your e-mails and we'll get back to you when we have something working :slight_smile:

Thanks,
Nuno