[RFC] ConstExprPreter — clang constexpr interpreter

Nandor_Licker1 · July 3, 2019, 5:48pm

TL;DR: Fast interpreter for C++ constexpr to replace the existing tree evaluator

I am a PhD student at the University of Cambridge currently interning at Apple,
working with JF Bastien and Michael Spencer on improving constexpr performance.
Constexpr evaluation is presently slow and difficult since it relies on a
monolithic AST walker. We propose replacing the tree evaluator with a bytecode
interpreter to improve compilation times. The tree evaluator also poses
significant maintenance and scalability problems, which we intend to ameliorate
using the interpreter. While being generally faster, the interpreter does
present two critical issues: a slightly increased memory footprint and added
complexity to the compiler. This tradeoff is justified, as efficient constexpr
evaluation could prove to be valuable as the language evolves.

We would like to integrate this interpreter into clang. This RFC details the
benefits of an interpreter and describes an initial implementation, along with
a roadmap for replacing almost all of the tree evaluator with the interpreter.
Even without optimizations, the performance of the interpreter matches that of
the tree evaluator, thus the short-term focus is on feature completeness, not
evaluation speed, as reflected by known inefficiencies in the current
implementation. We would highly appreciate comments related to integration into
clang and our roadmap towards replacing the evaluator, as well as feedback on
the initial patch. This project serves mostly as a prototype in order to
determine what kind of bytecode and compiler is required to efficiently evaluate
code and emit useful diagnostic messages, paving the way for a future fast,
potentially JIT-ed, interpreter.

What?

The ConstExprPreter is an interpreter for C++ constexpr embedded into the clang
frontend: instead of evaluating constant expressions by walking the AST, the
constexpr interpreter compiles C++ to safe bytecode and executes the bytecode
in accordance with the constexpr semantics, emitting all appropriate
diagnostics. It aims to replace the existing AST walker, which is less efficient
and does not scale well in complexity as the constexpr subset of the C++
language is expected to increase in the future.

Why?

The present constexpr evaluator is a 12.000 LOC monolith defined in
clang/lib/AST/ExprConstant.cpp and poses a performance and maintenance problem.
The tree interpreter limits the complexity of constexpr evaluated in a module by
bounding recursion depth (-fconstexpr-depth=) and bounding the number of
execution steps (-fconstexpr-steps). This severely limits the complexity of
expression which can be evaluated in a given time budget. Furthermore, because
of complexity, the implementation of certain potential future features on top
of the evaluator, such as exception handling, pose serious difficulties. An
efficient constexpr interpreter is expected to be faster and easily extensible,
ameliorating some of the limitations of the tree evaluator, especially regarding
performance. By improving the evaluation speed of constexpr, we expect to enable
C++ users to replace instances of automatically generated code with complex
constexpr, simplifying and improving the reliability of builds.

Proposed Roadmap

* Commit the initial patch which embeds a simple bytecode compiler and
interpreter into clang, alongside the existing constexpr evaluator. This
interpreter only supports a subset of constexpr and is disabled by default.

* Add features to the interpreter, reaching a point where it supports all the
features of the existing evaluator.

* Turn the interpreter on by default.

* Move the entry point of the interpreter from function calls to arbitrary
  toplevel expressions. In order to avoid performance regressions, a small
  subset of features will be evaluated without entering the bytecode compiler
  and the VM: for example, frequent integer constant definitions, such as
  constexpr int kMask = 1 << sizeof(T). This strategy requires keeping parts of
  the existing Int evaluator, but allows the removal of the LValue, Complex,
  etc. evaluators, significantly reducing the complexity of ExprConstant.cpp.

* Remove most of the toplevel evaluator, minus the parts required to interpret
simple expressions. Roles will be reversed: if the evaluator encounters an
unsupported feature, it falls back to the interpreter.

* Remove the flags enabling/forcing the interpreter.

Initial Implementation

The initial contribution is available in D64146. This only contains the
minimal interpreter required to compile a basic example. Further patches have
been developed, implementing pointers, which will be submitted for review
afterwards.

The implementation of the interpreter resides in lib/AST/ExprVM, in the
clang::vm namespace. The sole entry point is from ExprConstant.cpp, in the
HandleFunctionCall method. When this method is invoked in order to evaluate a
function application, the vm::Context attached to each module is retrieved from
ASTContext and an attempt is made to compile and interpret the function with
the given parameters. If this fails, the tree interpreter is used as a fallback,
unless the command line flags explicitly ask for a diagnostic pointing to
unimplemented features. If the interpreter succeeds, the result is converted to
a format compatible with the tree evaluator.

The bytecode compiler simply walks the AST of each function generating a
vm::Function, linking them together into a vm::Program. Certain peephole
optimizations are performed in the compiler in order to optimize
local/parameter accesses and remove redundant instructions, avoiding the use of
pointers whenever possible. The compiler heavily relies on type information:
the Classify method in vm::Context is used to approximate a type from the AST
with a base type supported by the interpreter. The types and the necessary
utilities are defined in Type.cpp. Presently, only 32-bit integers are
supported, however 8, 16 and 64 bit integers will be added and a fallback onto
APSInt is planned for 128-bits and beyond.

Internally, the compiler relies on a type classification–PrimType–to decide what
opcodes to emit for a particular operation. The Context::Classify method relies
on target information to decide what internal types to map C/C++ types to,
returning llvm::Optional<PrimType>. In the future, the classification is
expected to be complete, removing a large number of the nullopt-checks from the
compiler.

The opcodes supported by the VM are defined in Opcodes.inc, a header used to
generate most of the interpreter, as well as the disassembler. The VM is stack
based, since such bytecode is fast to generate and the high upfront cost of
register allocation is avoided. In order to accurately emit diagnostics, the VM
needs to cooperate with the tree interpreter—this is achieved by isolating
diagnostics into the vm::State class, inherited by both EvalInfo and
InterpState. Stack frames now use vm::Frame as their base class, inherited by
InterpFrame and CallStackFrame. This allows the stack frame to be traced through
both the VM and the tree walker effectively. The present focus is on
correctness — a path is kept open to optimize the interpreter and improve
instruction dispatch and memory layout, however this is not the current
priority. The interpreter could also be specialized into two variants: one that
only detects problems, falling back to a slower version which tracks
significantly more metadata for informative diagnostics.

The interpreter needs to detect pointer accesses that would result in Undefined
Behavior: dereferences and arithmetic on pointers which point to invalid memory
locations. To achieve this, each allocation site has a unique descriptor,
containing the metadata required to emit a diagnostic message. Each allocation
creates a block tracking the descriptor, along with all live pointers to that
block. Whenever a block goes out of scope, all the pointers are invalidated:
instead of pointing to the block, they now point to the descriptor. If such a
pointer is used, a diagnostic is generated and execution stops. This scheme
adds an overhead to pointer arithmetic, as the intrusive linked list of
pointers to a block needs to be maintained. If pointers correctly track the
lifetime of stack objects, no additional cost is paid at deallocation sites as
there are no pointers to invalidate. In the future, we might investigate
different schemes, such as the one in asan (shadow memory + epoch counters), or
garbage collection to keep invalid blocks alive as long as there are pointers
to them.

Usage

The proposed patch adds two new flags to clang:

-fexperimental-constexpr-interpreter:

Enables the interpreter, but falls back to the tree evaluator when encountering
language features which are not supported.

-fexperimental-constexpr-force-interpreter:

Forces the use of the interpreter, generating an error diagnostic whenever an
unsupported feature is encountered.

The behaviour of the compiler and the format of emitted diagnostics remains
unchanged, unless a bug in clang’s diagnostics is identified. In such a case,
we emit what we consider to be the correct diagnostic.

Performance

Since the present evaluator is not optimised for performance, a fair comparison
is difficult to obtain, but we expect the interpreter to outperform the tree
evaluator on most repetitive structures. Benchmarks on 1.000.000 iterations of
the sum_pow function in test/ExprVM/loop.cpp show a ~10× improvement in the
running time of the evaluator. Further tests are required in order to compare
the performance of the interpreter to the tree evaluator when evaluating
smaller, non-repetitive expressions. Once a reasonable subset of constexpr is
implemented in the VM, performance benchmarks on open-source constexpr code
will help us compare the cost of compilation and interpretation to that of
direct evaluation. Since the interpreter is significantly faster, the currently
tight limits on the number of statements and the number of steps can be relaxed.

Memory Usage

The constexpr interpreter requires a significant amount of metadata in order to
detect cases of undefined behavior and correctly report diagnostics. The
existing evaluator imposes a massive overhead, since all integers are
represented as APInt and all pointers keep a significant amount of metadata.
This overhead is lowered in the bytecode interpreter by better exploiting type
information: integer constants are fixed width and require no additional
information, while the size of pointer metadata is significantly reduced - 3x
increase in pointer size and a 16-byte overhead per allocation (the interpreter
tracks actual pointers for fast dereferencing, while the evaluator maintains an
lvalue and a path into that lvalue for structure fields, incurring a massive
overhead). Since the present implementation focuses on maintainability and
feature completeness, this can be further reduced in the future.

The compiled bytecode, quite dense due to the type-specialized opcodes, is
maintained in memory as long as the AST is live, which currently happens to be
live throughout all compilation phases. This adds to the total memory used by
the compiler. In the future, mitigations might be required. Given that the
ground truth—the AST—is present in memory, compiled functions could be
discarded and recompiled on demand, reducing peak memory usage.

Complexity

The interpreter duplicates the functionality of the existing evaluator,
presently adding significant complexity to the frontend. Unlike the monolithic
ExprConstant.cpp implementation, the constexpr interpreter is significantly more
modular, spreading the complexity across the compiler and the interpreter. It
will be possible to test the compiler separately from the interpreter, allowing
for easier maintenance. We expect the frontend to become simpler and more
maintainable after the AST walker is removed.

The full implementation of the interpreter is expected to involve significant
engineering effort. While development is in progress, an implementation of
reduction rules is required in both the tree walker and interpreter, adding
redundancy.

Extensibility

The interpreter should evolve alongside the language in the future, allowing for
new features included in future standards to be supported. We refrain from
performing any optimizations that would hinder the implementation of additional
features, such as constexpr support for alloca and exception handling.

Thanks for reading!

Nandor Licker

Tim_Rakowski · July 3, 2019, 6:54pm

Hi,

I'd just like to let you know that I am looking forward to trying out
the "ConstExprPreter".

I've had issues with the speed of constexpr evaluation in the past:
http://clang-developers.42468.n3.nabble.com/constexpr-JIT-td4058592.html
So your mail comes as a godsend

I cannot promise anything, but I'll try to come back with feedback "soon".

Regards,
Tim Rakowski

zygoloid · July 3, 2019, 7:14pm

Hi Nandor,

This sounds like a really cool project, and something that we would significantly benefit from.

This is an area I’ve put some thought into in the past. A lot of your decisions below seem good to me; I think there are a few other things you should consider:

The current APValue representation is extremely bloated. Each instance is 72 bytes, and every subobject of an object is stored as a distinct APValue, so for instance a single char[128] variable will often occupy 9288 bytes of storage and incurs 128 distinct memory allocations.
The current representation of an lvalue as an explicit subobject path is likewise very expensive in terms of memory and distinct allocations.

I have some pretty concrete ideas for how to solve these problems that I could write up if you’re interested in tackling them.

(Your project name makes me cringe a little; maybe referring to it as ExprVM to match the subdirectory name would be better?)

TL;DR: Fast interpreter for C++ constexpr to replace the existing tree evaluator

I am a PhD student at the University of Cambridge currently interning at Apple,
working with JF Bastien and Michael Spencer on improving constexpr performance.
Constexpr evaluation is presently slow and difficult since it relies on a
monolithic AST walker. We propose replacing the tree evaluator with a bytecode
interpreter to improve compilation times. The tree evaluator also poses
significant maintenance and scalability problems, which we intend to ameliorate
using the interpreter. While being generally faster, the interpreter does
present two critical issues: a slightly increased memory footprint and added
complexity to the compiler. This tradeoff is justified, as efficient constexpr
evaluation could prove to be valuable as the language evolves.

We would like to integrate this interpreter into clang. This RFC details the
benefits of an interpreter and describes an initial implementation, along with
a roadmap for replacing almost all of the tree evaluator with the interpreter.
Even without optimizations, the performance of the interpreter matches that of
the tree evaluator, thus the short-term focus is on feature completeness, not
evaluation speed, as reflected by known inefficiencies in the current
implementation. We would highly appreciate comments related to integration into
clang and our roadmap towards replacing the evaluator, as well as feedback on
the initial patch. This project serves mostly as a prototype in order to
determine what kind of bytecode and compiler is required to efficiently evaluate
code and emit useful diagnostic messages, paving the way for a future fast,
potentially JIT-ed, interpreter.

What?

The ConstExprPreter is an interpreter for C++ constexpr embedded into the clang
frontend: instead of evaluating constant expressions by walking the AST, the
constexpr interpreter compiles C++ to safe bytecode and executes the bytecode
in accordance with the constexpr semantics, emitting all appropriate
diagnostics. It aims to replace the existing AST walker, which is less efficient
and does not scale well in complexity as the constexpr subset of the C++
language is expected to increase in the future.

Why?

The present constexpr evaluator is a 12.000 LOC monolith defined in
clang/lib/AST/ExprConstant.cpp and poses a performance and maintenance problem.
The tree interpreter limits the complexity of constexpr evaluated in a module by
bounding recursion depth (-fconstexpr-depth=) and bounding the number of
execution steps (-fconstexpr-steps).

Minor point: while the depth limit does exist primarily to work around problems caused by the recursive implementation, the steps limit is provided to catch infinite loops. (I assume your point is that we could raise this limit commensurately with any performance improvements in the evaluator; that seems reasonable.)

This severely limits the complexity of
expression which can be evaluated in a given time budget. Furthermore, because
of complexity, the implementation of certain potential future features on top
of the evaluator, such as exception handling, pose serious difficulties. An
efficient constexpr interpreter is expected to be faster and easily extensible,
ameliorating some of the limitations of the tree evaluator, especially regarding
performance. By improving the evaluation speed of constexpr, we expect to enable
C++ users to replace instances of automatically generated code with complex
constexpr, simplifying and improving the reliability of builds.

Proposed Roadmap

Commit the initial patch which embeds a simple bytecode compiler and
interpreter into clang, alongside the existing constexpr evaluator. This
interpreter only supports a subset of constexpr and is disabled by default.

I notice that this is effectively building another system to convert the AST into a control flow graph. Have you considered using the existing CFG layer for (the majority of) this?

Add features to the interpreter, reaching a point where it supports all the
features of the existing evaluator.

Turn the interpreter on by default.

Move the entry point of the interpreter from function calls to arbitrary
toplevel expressions. In order to avoid performance regressions, a small
subset of features will be evaluated without entering the bytecode compiler
and the VM: for example, frequent integer constant definitions, such as
constexpr int kMask = 1 << sizeof(T). This strategy requires keeping parts of
the existing Int evaluator, but allows the removal of the LValue, Complex,
etc. evaluators, significantly reducing the complexity of ExprConstant.cpp.

Remove most of the toplevel evaluator, minus the parts required to interpret
simple expressions. Roles will be reversed: if the evaluator encounters an
unsupported feature, it falls back to the interpreter.

I like this approach a lot.

Remove the flags enabling/forcing the interpreter.

Initial Implementation

The initial contribution is available in D64146. This only contains the
minimal interpreter required to compile a basic example. Further patches have
been developed, implementing pointers, which will be submitted for review
afterwards.

The implementation of the interpreter resides in lib/AST/ExprVM, in the
clang::vm namespace. The sole entry point is from ExprConstant.cpp, in the
HandleFunctionCall method. When this method is invoked in order to evaluate a
function application, the vm::Context attached to each module is retrieved from
ASTContext and an attempt is made to compile and interpret the function with
the given parameters. If this fails, the tree interpreter is used as a fallback,
unless the command line flags explicitly ask for a diagnostic pointing to
unimplemented features. If the interpreter succeeds, the result is converted to
a format compatible with the tree evaluator.

The bytecode compiler simply walks the AST of each function generating a
vm::Function, linking them together into a vm::Program. Certain peephole
optimizations are performed in the compiler in order to optimize
local/parameter accesses and remove redundant instructions, avoiding the use of
pointers whenever possible. The compiler heavily relies on type information:
the Classify method in vm::Context is used to approximate a type from the AST
with a base type supported by the interpreter. The types and the necessary
utilities are defined in Type.cpp. Presently, only 32-bit integers are
supported, however 8, 16 and 64 bit integers will be added and a fallback onto
APSInt is planned for 128-bits and beyond.

Internally, the compiler relies on a type classification–PrimType–to decide what
opcodes to emit for a particular operation. The Context::Classify method relies
on target information to decide what internal types to map C/C++ types to,
returning llvm::Optional. In the future, the classification is
expected to be complete, removing a large number of the nullopt-checks from the
compiler.

The opcodes supported by the VM are defined in Opcodes.inc, a header used to
generate most of the interpreter, as well as the disassembler. The VM is stack
based, since such bytecode is fast to generate and the high upfront cost of
register allocation is avoided. In order to accurately emit diagnostics, the VM
needs to cooperate with the tree interpreter—this is achieved by isolating
diagnostics into the vm::State class, inherited by both EvalInfo and
InterpState. Stack frames now use vm::Frame as their base class, inherited by
InterpFrame and CallStackFrame. This allows the stack frame to be traced through
both the VM and the tree walker effectively. The present focus is on
correctness — a path is kept open to optimize the interpreter and improve
instruction dispatch and memory layout, however this is not the current
priority. The interpreter could also be specialized into two variants: one that
only detects problems, falling back to a slower version which tracks
significantly more metadata for informative diagnostics.

The interpreter needs to detect pointer accesses that would result in Undefined
Behavior: dereferences and arithmetic on pointers which point to invalid memory
locations. To achieve this, each allocation site has a unique descriptor,
containing the metadata required to emit a diagnostic message. Each allocation
creates a block tracking the descriptor, along with all live pointers to that
block. Whenever a block goes out of scope, all the pointers are invalidated:
instead of pointing to the block, they now point to the descriptor. If such a
pointer is used, a diagnostic is generated and execution stops. This scheme
adds an overhead to pointer arithmetic, as the intrusive linked list of
pointers to a block needs to be maintained. If pointers correctly track the
lifetime of stack objects, no additional cost is paid at deallocation sites as
there are no pointers to invalidate.

There are a bunch of other ways that pointers might be invalidated, due to lifetime events rather than storage events. For example, changing the active member of a union would prevent access through old pointers to the old active member, and changing the active union member /back/ would make such accesses valid again. Similarly, an explicit destructor call on an object prevents that object from being used, but certain other operations can bring it back to life and make the old pointers / references usable again. How do you intend to handle such cases?

efriedma-quic · July 3, 2019, 7:23pm

Have you given any thought to using the bytecode in other parts of the compiler? Other parts of clang, like static analysis and IR generation, could benefit from a simplified representation of a function's semantics. It doesn't immediately impact this project, but it would be a good idea to design the representation in a way that allows those uses in the future.

-Eli

Bigcheese · July 3, 2019, 8:15pm

Have you given any thought to using the bytecode in other parts of the compiler? Other parts of clang, like static analysis and IR generation, could benefit from a simplified representation of a function’s semantics. It doesn’t immediately impact this project, but it would be a good idea to design the representation in a way that allows those uses in the future.

-Eli

We have. I don’t really think it makes sense to try to use this for more than constexpr evaluation. The bytecode is designed to do one thing well, and changing it to better support the needs of the static analyzer and codegen could harm that use case and make it harder to change as constexpr evolves. If a clang-ir makes sense we should build a clang-ir, but this is orthogonal to that.

Michael Spencer

Nandor_Licker1 · July 3, 2019, 10:19pm

Hello Richard,

The current APValue representation is extremely bloated. Each instance is 72 bytes, and every subobject of an object is stored as a distinct APValue, so for instance a single char[128] variable will often occupy 9288 bytes of storage and incurs 128 distinct memory allocations.

Arrays and structures will be stored in a compact, contiguous form in memory, so we can save a lot of space here.

The current representation of an lvalue as an explicit subobject path is likewise very expensive in terms of memory and distinct allocations.

LValues are simply pointers in the VM. Pointers into structures will be a combination of a block and a descriptor: the
block represents the allocated memory, while the descriptor characterises the current field and limits what is accessible
through the pointer. There should be a unique descriptor for each field, so this option is more compact.

I have some pretty concrete ideas for how to solve these problems that I could write up if you’re interested in tackling them.

Would be great - we do have some ideas on how to proceed, but we’d appreciate any other input. One thing to mention is that
we first want a working and simple implementation which leaves the door open for further optimisations. We would like to avoid
optimisations now if they come with significant complexity.

(Your project name makes me cringe a little; maybe referring to it as ExprVM to match the subdirectory name would be better?)

I’ll let JF Bastien argue this point.

Minor point: while the depth limit does exist primarily to work around problems caused by the recursive implementation, the steps limit is provided to catch infinite loops. (I assume your point is that we could raise this limit commensurately with any performance improvements in the evaluator; that seems reasonable.)

The numbers are a ballpark measure either way - we could allow the VM to run 10x the number of steps if we expect it to be faster.
But this is something to be implemented in the future. Same goes for recursion depth, although we could experiment with tail call
elimination as well.

I notice that this is effectively building another system to convert the AST into a control flow graph. Have you considered using the existing CFG layer for (the majority of) this?

The bytecode is just a linear sequence of instructions - we do not build a CFG or explicit basic blocks at any point.
I guess that would be required if we were to apply non-trivial optimisations, but that is not planned. We perform
some peephole optimisations and emit specialised opcodes, but we do not plan to go beyond that as we are not sure
the optimisations would pay for themselves. Given that diagnostics are required whenever UB is encountered,
the opportunities are severely limited.

There are a bunch of other ways that pointers might be invalidated, due to lifetime events rather than storage events. For example, changing the active member of a union would prevent access through old pointers to the old active member, and changing the active union member /back/ would make such accesses valid again. Similarly, an explicit destructor call on an object prevents that object from being used, but certain other operations can bring it back to life and make the old pointers / references usable again. How do you intend to handle such cases?

The present implementation requires all allocated blocks to track all pointers which point to them. This
mechanism should allow unions and invalidation to be implemented. This is obviously not the fastest
option and we definitely need to think more about the invalidation aspect. Up until now, we mostly
considered the lifetime problem and a GC-like solution to it, but invalidation further complicates that.

Thank you,
Nandor Licker

David_Chisnall3 · July 4, 2019, 10:37am

Hi Nandor,

Nandor_Licker1 · July 4, 2019, 3:31pm

Hello David,

Can you contrast your approach with the other proposal that's come up a few times of simply JITing the constexpr bits of the code using the existing compiler infrastructure?

A JIT could be the next step following the interpreter, if the cost of the JIT can pay for itself.
Unfortunately, given the way the current evaluator is implemented, we cannot just go for
"simply JITing the constexpr bits of the code": a more fine-grained value representation and
a better pointer representation is needed, as well as a mechanism to manage and link
all the constexpr functions and methods in a compilation unit. We develop these in the
context of a bytecode interpreter.

Thank you,
Nandor Licker

JF_Bastien2 · July 5, 2019, 5:49pm

I agree that a JIT is purely additive to Nandor’s approach. You need to do all the same work as Nandor proposes, and then a lot more for the JIT to work. If JIT-compilation is the only support mechanism then some platforms that can’t JIT won’t even be able to compile.

I really want to stress one point Nandor makes: the JIT has to pay for itself in runtime. His approach has a handful of optimizations that pay for themselves, and he still requires really simple tree interpretation because an interpreter costs too much for extremely simple expressions. Writing a JIT is a balance between compile-time and runtime performance, and if you want both you end up with tiering, and eventually you have an interpreter anyways.

A JIT is therefore something to keep in mind, but definitely not something to prioritize. It’ll only start making sense *if* people start writing extremely complex constexpr code. If that happens, the interpreter is the right starting point.

Topic		Replies	Views
constexpr JIT? Clang Frontend	1	122	November 6, 2017
Outsourcing the constexpr evaluation to the LLVM IR interpreter? Clang Frontend	8	1253	February 24, 2023
Will clang do some optimization when evaluate the constexpr expression for faster compile speed? Clang Frontend clang	2	181	December 24, 2023
Status of constexpr Clang Frontend	4	71	May 11, 2011
Constexpr Evaluation and CodeGen Questions Beginners	2	525	November 27, 2019

[RFC] ConstExprPreter — clang constexpr interpreter

Related topics