I am researching various static analyzers and want to get a better understanding of how the symbolic execution in the clang static analyzer works.
Every implementation of symbolic execution has to make certain tradeoffs to be feasible.
I'm interested in the following issues:
How does the static analyzer approach the path explosion problem?
In this  presentation it is mentioned that there are budgets which limit the exploration after visiting the same node multiple times.
Are there any additional techniques to alleviate path explosion?
Yes, basically a number of different deterministic budgets and limits. A budget for the total number of nodes (the closest thing we have to a hard time limit), a budget for visiting a CFG block too many times (which sounds like what you're describing), a limit for callee function size and a recursion budget in interprocedural analysis, etc. One of my "favorite" limits is the rule to not consider a function for interprocedural analysis if it hit a budget last time we tried it; this one even works across otherwise unrelated symbolic execution entry points (within the same Clang instance, of course) (in hindsight, this was probably a bad idea; we're considering to replace it with something more predictable).
Secondly, I'm am wondering how it deals with symbolic values as indices for arrays (or symbolic pointer arithmetic).
I have read the paper about its memory model , but it doesn't seem to address this question.
All the approaches to the problem I know rely either on address concretization, creating a new path for every possible index or using a more powerful constraint solver (for example using theory of arrays).
An access to array through symbolic index produces a symbol that remembers, as part of its identity, that it was obtained by loading from *that* array by *that* symbolic index.
This is where support basically ends This guarantees that loading twice with the same symbolic index will produce the same value but we don't yet properly write down constraints over the new symbol in terms of constraints over array elements and constraints over the index value.
This probably leads to bugs / false positives but I haven't seen much of those in practice (even though I can easily craft one). If we were to improve upon this, we'd probably go ahead with writing down a constraint like
a[i] == a || a[i] == a || a[i] == a
without splitting the execution path. And then hope that the solver will handle that.
Generally, our experience with powerful constraint solvers is relatively interesting. We have our custom constraint solver that's extremely primitive and barely correct but extremely fast. Replacing it with a full-featured SMT solver causes an unacceptable slowdown (like 20x or so). Our overall future direction is to go for cross-checking instead: use our fast solver for discovering bugs, and then invoke a stronger solver only when a potential bug is discovered. We have this implemented (with Z3 as a motivating example) and it was shown to noticeably improve precision without causing any noticeable slowdowns. One reason this isn't enabled for all users is because nobody is ready to contribute enough manpower to take responsibility of keeping Z3 running in practice - say, if there's a bug in Z3 that causes the static analyzer to crash, it may harm our users way more than the false positives it eliminates, and we'll most likely be unable to address it quickly enough. Also licensing concerns and usual LLVM external dependencies concerns.
If we actually had a chance to ship a strong SMT solver, we could totally take advantage of its array theory support as part of the cross-check pass.