A "Java Backend"

My idea was to create a complete backend treating Java as a normal platform, to enable LLVM to compile programs to Java Bytecode (.class) and Java Archive files (.jar). This could be useful in situations where we need to compile a program for a platform still not natively supported by LLVM.

I don't know if it exists already, I've heard about this "LLJVM" but I don't think it does the same thing as my idea.
What do you think?

I think that it will be difficult. Java bytecode is intrinsically designed to be memory safe, whereas LLVM IR is not. There is no equivalent of inttoptr or ptrtoint in Java bytecode and the closest equivalent of a GEP is to retrieve a field from an object (though that’s only really for GEP + load/store).

You could potentially do something a bit ugly and treat all of LLVM memory as one big ByteBuffer object, and make pointers indexes into this, but then you’d make it very hard for your LLVM-originating code to interoperate with Java-originating code and so you’d have to write a lot of code to take the place of the system call layer.

Oh, and I doubt that you’ll find many more platforms that have a fully functional JVM than are LLVM targets. Even big-endian MIPS64 is not well-supported by Java (JamVM - a pure interpreter - is the only thing that we’ve managed to find that works).

David

Hi Lorenzo,

One related project (though not exactly what you want) is GitHub - graalvm/sulong: Obsolete repository. Moved to oracle/graal.

-- Sanjoy

Hi,

>> My idea was to create a complete backend treating Java as a normal platform, to enable LLVM to compile programs to Java Bytecode (.class) and Java Archive files (.jar). This could be useful in situations where we need to compile a program for a platform still not natively supported by LLVM.
>>
>> I don't know if it exists already, I've heard about this "LLJVM" but I don't think it does the same thing as my idea.
>> What do you think?
>
> I think that it will be difficult. Java bytecode is intrinsically designed to be memory safe, whereas LLVM IR is not. There is no equivalent of inttoptr or ptrtoint in Java bytecode and the closest equivalent of a GEP is to retrieve a field from an object (though that’s only really for GEP + load/store).
>
> You could potentially do something a bit ugly and treat all of LLVM memory as one big ByteBuffer object, and make pointers indexes into this, but then you’d make it very hard for your LLVM-originating code to interoperate with Java-originating code and so you’d have to write a lot of code to take the place of the system call layer.

The caveat here is that Java has this "private"
but-not-really-in-practice API called sun.misc.Unsafe that can be used
to access native memory. So you can have (I'm paraphrasing, the
method names may not match):

   long addr = unsafe.allocateMemory()
   unsafe.putInt(addr + 48, 9001);
   int val = unsafe.getInt(addr + 48);

etc. You may even get decent performance out of this since JIT
compilers tend to have to optimize these well (they're commonly uses
in the implementation of some popular JDK classes).

But you're right that it will still be difficult to naively
inter-operate between Java and C++ objects. Which is why it will be
an interesting research project. :slight_smile:

-- Sanjoy

> Oh, and I doubt that you’ll find many more platforms that have a fully functional JVM than are LLVM targets. Even big-endian MIPS64 is not well-supported by Java (JamVM - a pure interpreter - is the only thing that we’ve managed to find that works).

If that’s your goal, then you might have better luck doing source-to-source than going via LLVM IR. In the past, I’ve managed to do most of Objective-C (not goto, pretty much everything else) -> JavaScript and Dart using clang AST visitors, with Objective-C classes being represented as native objects, with a bit of glue code to paper over the differences in the object models. With C++, you would most likely want to implement subclassing as composition and make each C++ class a Java interface so that you could do all of the kinds of casts required for multiple inheritance.

There have been a few attempts at C to Java compilation, though I’m not sure of the current status of any of them. gcj implements Java classes using the same ABI as C++ classes, effectively treating Java as a subset of C++ (plus garbage collection). It might be interesting to start with the same subset of C++ that Microsoft has used for one of their various C++-on-the-CLR implementations and work from there. I think Alp Toker had some parsing for MS managed C++ extensions and CLR code generation working a few years back, but I’m not sure what happened to his code.

David

If you're trying to bypass/replace JNI, you're in for a surprise. :slight_smile:

The number of bugs I found while interacting with Java from C or C++
on different VMs (MS, Sun, OpenJDK) were astounding.

Apart from the usual C++ class layout (which may be better in gcj as
David says), we had corruption in the stack because the VMs weren't
understanding the unwind information.

I originally found the stack bug in 2002 on Windows, later checked in
2008 and it was still there. I'd be surprised if that's fixed, and
even more surprised if that's the only remaining problem.

And those were only through JNI, a relatively safe interface. If you
try to send C++ directly to Java Bytecode, you'll find a huge list of
"implementation details" that are not just undefined, but thoroughly
undocumented and different on purpose (like memory allocation,
signals, asynchronous I/O, threads, etc).

Good luck! :slight_smile:

cheers,
--renato

Also potentially interesting;
http://nestedvm.ibex.org/

I thought about something like that but I think it's not a good idea.
Like writing an AST visitor on Clang for example would be cool but it isn't open to other frontends, and I think that this is a job for LLVM.
What about java-* attributes that can be put on certain IR operations to indicate structures that are needed to know about the Java Bytecode structure, or operations that should be translated in a specific way for Java?
These attributes can be added to the IR modules optionally like debug info

That’s an interesting project but I’d like to give LLVM this “power” lol
Isn’t it applied to GCC?

I’m not sure what problem you’re trying to solve. Providing these attributes from the front end and ensuring that they’re correctly preserved by optimisers would likely be more effort than writing a Java bytecode back end (and a lot more for any language that has a simple mapping to the Java object model). By the time code is in LLVM IR, it’s already got a lot of assumptions about things like calling conventions and data layout embedded in it. You’d need to treat the Java back end as another target in your front end (and a particularly weird one, at that).

If I had a front end that I wanted to use to target the JVM then I would not go via LLVM IR to get there. The same probably holds for the CLR, though it’s a little bit less clear cut.

David

Passing through LLVM IR would enable LLVM to apply its optimizers and, as I said, I know that creating a custom standalone backend could be easier, but enabling LLVM to translate to Bytecode would enable all of its frontends to do it.
I know that it might be quite hard to treat Java or .NET as LLVM targets, but could this be a challenge? Let LLVM IR be compiled to Bytecode and CLI, which are higher level than our IR.

Passing through LLVM IR would enable LLVM to apply its optimizers and, as I said, I know that creating a custom standalone backend could be easier, but enabling LLVM to translate to Bytecode would enable all of its frontends to do it.

Generally, this is a bad idea (I believe, Sanjoy is far more experienced here and can correct me if I’m wrong). Modern JVMs benefit a lot more from Java bytecode that is easy to analyse. The first thing that Hotspot does when it loads bytecode is undo a bunch of optimisations that javac does, because they make it harder to do the optimisations that the JIT performs.

I know that it might be quite hard to treat Java or .NET as LLVM targets, but could this be a challenge? Let LLVM IR be compiled to Bytecode and CLI, which are higher level than our IR.

Going from a high-level language to a high-level IR via a low-level IR is certainly a challenge. Many things are difficult, but not all of them are worthwhile.

David