Hello!

This is ok if the code is too small to be worth optimizations or
complex codegen and need to be generated in huge volumes (something
that Google has, for sure). For all other cases, I can't see the
point.

In that sense, they've created a domain-specific compiler for a
domain-specific language, which will obviously perform better than
anything else in their case and worse than anything else for all other
cases. I don't see any novelty in that...

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

Even though LLVM does little optimization at -O0, there is still a
fair amount of work involved in translating to LLVM IR.

As said on my previous message, the most significant part of the work is
in generating native code from the LLVM IR.

And register allocation. That said, John's email sums it up rather well.

-eric

Right. Another common comparison is between clang and TCC. TCC generates terrible code, but it is a great example of a one pass compiler that doesn't even build an AST. Generating code as you parse will be much much much faster than building an AST, then generating llvm ir, then generating assembly from it. On X86 at -O0, we use FastISel which avoids creating the SelectionDAG intermediate representation in most cases (it fast paths LLVM IR -> MachineInstrs, instead of going IR -> SelectionDAG -> MachineInstrs).

I'm still really interested in making Clang (and thus LLVM) faster at -O0 (while still preserving debuggability of course). One way to do this (which would be a disaster and not worth it) would be to implement a new X86 backend directly translating from Clang ASTs or something like that. However, this would obviously lose all of the portability benefits that LLVM IR provides.

That said, there is a lot that we can do to make the compiler faster at O0. FastISel could be improved in several dimensions, including going bottom-up instead of top-down (eliminating the need for the 'dead instruction elimination pass'), integrating simple register allocation into it for the common case of single-use instructions, etc. Another good way to speed up O0 codegen is to avoid generating as much horrible code in the frontend that the optimizer (which isn't run at O0) is expected to clean up.

-Chris

Hi there,
Just my tuppence worth:

I for one would love it if the code-gen pass was quicker.
It makes LLVM even more appealing for JIT compilers.

One approach might be to generate code directly from the standard IR.
(Rather than create yet another IR (the instruction DAG)).

Would it be possible to convert the standard IR DAG to a forest of trees with a simple linear pass, either before or after register allocation, then use a BURG code generator on the trees?

BURG selectors are both fast and optimal(In theory, assuming all instructions can be given a cost and ignoring scheduling issues).

Chris Lattner wrote:

Chris Lattner writes:

I'm still really interested in making Clang (and thus LLVM) faster at -O0 (while still preserving debuggability of course).

Why?

Arnt

I want the compiler to build things quickly in all modes, but -O0 in particular is important for a fast compile/debug/edit cycle. Are you asking why fast compilers are good?

-Chris

Yes, we already do this, that is what FastISel is. a BURG code generator is slower than what we already do.

-Chris

Chris Lattner writes:

Chris Lattner writes:

I'm still really interested in making Clang (and thus LLVM) faster at -O0 (while still preserving debuggability of course).

Why?

I want the compiler to build things quickly in all modes, but -O0 in particular is important for a fast compile/debug/edit cycle. Are you asking why fast compilers are good?

Sort of. Why you think more speed than LLVM currently provides is a significant benefit.

Here's a suggestion, anyway. Teach LLVM to store an optional input-related opaque byte array for each symbol. Define it only as something of arbitrary size that must change if it exists and the source changes.

In clang, use that to store the preprocessed source for each function, and if clang sees that the byte array in the output file is the same as its preprocessed source, it can leave the output and skip compiling that function. Then think hard about how this can be extended to skip some optimisation passes too.

Someone else said that nothing beats a splat compiler except another splat compiler. I think a compiler that mostly avoids compiling may have a fair chance.

Arnt

Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> writes:

I'm still really interested in making Clang (and thus LLVM) faster
at -O0 (while still preserving debuggability of course).

Why?

I want the compiler to build things quickly in all modes, but -O0 in
particular is important for a fast compile/debug/edit cycle. Are you
asking why fast compilers are good?

Sort of. Why you think more speed than LLVM currently provides is a
significant benefit.

My compiler supports LLVM as a backend. The language heavily relies on
compile-time environment-dependent code generation, so it needs the
JIT. One of the things that is holding back LLVM on production systems
is that it needs minutes to JIT a medium-sized application. That's
barely tolerable for long-running server applications, and a big no-no
for client apps.

Here's a suggestion, anyway.

[snip]

I'm afraid that you are not being too original with that suggestion :slight_smile:

Faster is always better, and I see why increasing it on all
optimization levels is paramount to any compiler. I just fail to see
why should we hack the O0 to make it faster.

Whenever you increase the speed of O0, you're obviously increasing for
all levels, but hacking your way through just to compete with Go (or
any other) won't provide the same benefit. Besides, every program has
to make compromises between generality and speed. As LLVM is a
compilation infrastructure, I fail to see the point of domain-specific
changes to its internals.

Why is O0 any good for general purposes, besides debugging?

On the other hand, creating a clang-o0 with the changes and keep it
outside of the core libraries would achieve the same results without
compromising generality...

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

Óscar Fuentes writes:

Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> writes:

Here's a suggestion, anyway.

[snip]

I'm afraid that you are not being too original with that suggestion :slight_smile:

Yeah. It's the kind of suggestion one expects to see either implemented or in a FAQ, "why doesn't LLVM do X", followed by a nontrivial explanation. But I didn't see either.

Arnt

>> I've tested it and LLVM is indeed 2x slower to compile, although it
>> generates
>> code that is 2x faster to run...
>>
>>> Compared to a compiler in the same category as PCC, whose pinnacle of
>>> optimization is doing register allocation? I'm not surprised at all.
>>
>> What else does LLVM do with optimizations turned off that makes it
>> slower?
>
> I haven't looked at Go at all, but in general, there is a significant
> overhead to creating a compiler intermediate representation. If you
> produce assembly code straight out of the parser, you can compile
> faster.

Right. Another common comparison is between clang and TCC. TCC generates
terrible code, but it is a great example of a one pass compiler that
doesn't even build an AST. Generating code as you parse will be much much
much faster than building an AST, then generating llvm ir, then generating
assembly from it. On X86 at -O0, we use FastISel which avoids creating the
SelectionDAG intermediate representation in most cases (it fast paths LLVM
IR -> MachineInstrs, instead of going IR -> SelectionDAG -> MachineInstrs).

I found LLVM was 2x slower than Go at a simple 10,000 Fibonacci functions
test. Do you have any data on Clang vs TCC compilation speed?

I'm still really interested in making Clang (and thus LLVM) faster at -O0
(while still preserving debuggability of course). One way to do this
(which would be a disaster and not worth it) would be to implement a new
X86 backend directly translating from Clang ASTs or something like that.
However, this would obviously lose all of the portability benefits that
LLVM IR provides.

That sounds like a lot of work for relatively little gain.

That said, there is a lot that we can do to make the compiler faster at O0.
FastISel could be improved in several dimensions, including going
bottom-up instead of top-down (eliminating the need for the 'dead
instruction elimination pass'), integrating simple register allocation into
it for the common case of single-use instructions, etc. Another good way
to speed up O0 codegen is to avoid generating as much horrible code in the
frontend that the optimizer (which isn't run at O0) is expected to clean
up.

HLVM generates quite sane and efficient IR directly and I've been more than
happy with LLVM's JIT compilation times when using it interatively from a
REPL.

So I'm not sure that LLVM so slow as to make it worth ploughing much effort
into optimizing compilation times. If you want to go down that route then I'd
certainly start with higher level optimizations like memoizing previous
compilations and reusing them. What about parallelization?

Sort of. Why you think more speed than LLVM currently provides is a
significant benefit.

My compiler supports LLVM as a backend. The language heavily relies on
compile-time environment-dependent code generation, so it needs the
JIT. One of the things that is holding back LLVM on production systems
is that it needs minutes to JIT a medium-sized application. That's
barely tolerable for long-running server applications, and a big no-no
for client apps.

Not that I'm against faster jitting but have you tried an interpreter/jit solution, where you're jitting only hot functions in your application? Or using an O0 jit and then recompiling hot functions?

Not that I know exactly what you're doing, but thought I'd try to help a bit.

-eric

I thought about that for a while, but if you keep your classes/files
small, intra-unit parallelization gains are probably not worth the
time invested. Compiling multiple files is embarrassingly parallel.
[1]

MHO is that, though inter-unit optimizations can take much longer, the
benefits are worthwhile. Multiple threads/processes with a message
passing interface in between them would be a start, but compiling a
unix kernel that way would be tricky memory-wise. :wink:

cheers,
--renato

[1] http://en.wikipedia.org/wiki/Embarrassingly_parallel

Eric Christopher <echristo@apple.com> writes:

Sort of. Why you think more speed than LLVM currently provides is a
significant benefit.

My compiler supports LLVM as a backend. The language heavily relies on
compile-time environment-dependent code generation, so it needs the
JIT. One of the things that is holding back LLVM on production systems
is that it needs minutes to JIT a medium-sized application. That's
barely tolerable for long-running server applications, and a big no-no
for client apps.

Not that I'm against faster jitting but have you tried an
interpreter/jit solution, where you're jitting only hot functions in
your application? Or using an O0 jit and then recompiling hot
functions?

My compiler produces pseudo-instructions for a virtual stack-based
machine. This code can be executed or translated to some other backend
(LLVM, for instance). If the LLVM backend is used, all the code needs to
be jitted, as the code produced by LLVM can't mix with the
pseudo-instructions.

The emulated stack-machine is quite fast (several times faster than
Python, for instance) and I guess that is faster than the LLVM
interpreter.

It is very difficult to determine which functions are "hot". There are
several possible heuristics, but failing to determine just one or two
functions as hot may severely impact the final perfomance, due to the
slowness of the interpreter.

Finally, my measurements say that optimization is not where most of the
time is used (IIRC, going from -O0 to -O2 adds 30% of compile
time). Generating the LLVM IR takes almost no time. Is the process of
turning the LLVM IR into executable code what is "slow" and, worst
still, the time required grows faster than the LLVM IR code to be
JITted.

Not that I know exactly what you're doing, but thought I'd try to help
a bit.

I apprecite your help.

Just as as some anecdotical data, I'll mention that one of the backends
that my compiler uses simply maps each pseudo-instruction to assembler
code doing *very* simple optimizations and register allocation. The
resulting code still emulates the stack machine, so it is possible to
mix assembled code with pseudo-instructions, which makes the resulting
assembler suck even more. The assembler code is saved to a file, an
assembler is invoked and the resulting raw binary file is loaded into
memory, ready to be executed. The process is very fast compared to LLVM
-O0 (about 3x faster for my current test application) but the most
surprising part is that the native code runs significantly faster than
LLVM JITted code at -O0 level. At LLVM -O2 level, you can come with
benchmarks that runs several times faster than the assembled code, but
on real-world applications within my application domain (database
managers) the difference turns the be only 20% on average in favor of
LLVM -O2 for cpu-intensive tasks.

I guess that an approach that simply translates LLVM IR to assembler
doing the simplest register allocation would be several times faster
than the current JIT, and even produce faster code.

Renato Golin wrote:

What about parallelization?

I thought about that for a while, but if you keep your classes/files
small, intra-unit parallelization gains are probably not worth the
time invested. Compiling multiple files is embarrassingly parallel.
[1]

Compiling multiple files is embarrassingly parallel ... which doesn't matter if your developers are doing a build+edit cycle on a single file.

I don't know about the backend, but I've thought about how to make the middle parallelized. We already have the notion of FunctionPass, but this running FPs in parallel will encounter a lot of contention around the reference counts for things like ConstantInt 'i32 0'. That led me to my next idea, which is to turn the use lists for llvm::Constant (except llvm::GlobalValue) off. Just remove that and you lose the lock contention. That means that you can't look at a Constant and iterate over its list of users, but then again, you already weren't doing that.

I might experiment with trying out a parallel optimizing LLVM -- mostly to find out what the problems really are -- but I suspect it won't really matter for performance until the backend is parallel.

Nick

The place to start would be to make 'llc' parallel on functions. This would directly benefit lto and normal single file compiles.

-Chris