In a project we’re working on, we wish to build a tool that automatically manipulates c code.
We wish to be able to track the control flow of the program and manipulate it (flattening, adding blocks of code etc.).
We want the tool to be very general and automatic and be able to work properly for every c file and generate compilable code.
Do you think this is achievable using clang? Do you have any tips for implementing this tool?
With this sort of thing, “the devil is in the detail”. Yes, clang has tools to do things to source-code, and there are other tools to do things with the intermediate form in llvm.
But without fully understanding more in detail what changes you are wanting to do, and how those are meant to interact with the overall source code, it’s hard to say exactly what is the best approach.
(I’m far from sure I can give any appropriate advice, but I know enough about the subject to know that there can be difficulties, and the exact details is what makes the difference between “approach A is good”, “approach B” is good and “it’s impossible no matter what”.)
Let’s say I want to create a tool that simply replaces every “while” block with an “if” block (and a goto in the end – to make it work the same).
Is it possible to create such a tool with clang, that will work on any compilable c file?
That would very much depend on exactly what the code looks like inside the loop - for example, if there is a break or continue, you need to deal with that. If you have multiple loops inside each other, you need to handle that in some way, etc. And whilst clang may be able to HELP, it certainly doesn’t have tools that does this. And since you mention flattening, I presume this means “remove nesting from loops”, which brings a whole heap of trouble if anyone ever declares variables inside those nestings, and you need to bring this out to the outer level, you now need to rename any collisions (preferrably without causing new name collisions).
Of course, the LLVM toolchain will recognize just about every “does the same thing” loop construct, called loop normalization, and do the same IR code for any variant of the equivalent code [at least for any reasonable input code], so I’m not sure there is that much to be gained from this - whether you think you’re going to get better code, or more unreadable code - there was a thread looking to “obfuscate the code”, and I think the conclusion from me in that case was that this would be a lot of work, and lllvm generated almost identical code either way (because it figures out that the goto-loop is actually the same as the original while-loop).
Obviously, it’s still possible to mess up the code enough to confound the compiler into generating extra code that isn’t optimal, and eventually end up with something that is ultimately hard to understand - but things like LTO and other optimization tools will also do a good job of “mangling” the code into something rather hard to understand, with much less effort.
Of course, without understanding what you are ultimately trying to achieve, I’m only guessing on the “obfuscation” side of things - but my advice if you are going down that route is to experiment by hand on some medium-complex code, and see if the machine code generated is actually sufficiently different to make the whole project worth-while.