Instruction Set Randomization

Hi. I’ve been rolling this back and forth in my head for a while, but after several hours of of grep’ing and reading LLVM documentation, it seems to be more complex than expected. Or it is just an abstraction of MC / CodeGeneration / the DAG / clang-driver / clang cc1as / … that I have been yet unable to find.

A C++ object that contains the current (Instruction) opcode past Assembly Parsing in binary or hexadecimal / numeric form, as part of the LLVM IR assembly compilation to target dependent machine code process in clang, containing among the opcode further Instruction info displayed in some sort of struct {} / CPP object, but Backend / Architecture independent. I am aiming to implement ISR into clang (LLVM) and was curious whether a synthetic LLVM backend has to be generated or created for each existing backend (RISC V,, ARM64, …) (it seems TableGen is one part of that) or whether there can be a more clever way to do it (Instr…cpp Abstraction, XOR on opcode, done, not to care about .td files of Backends)

Clearly there is literature about this topic, and it uses a random, secret (secrecy depends on execution implementation; in our case, key is shared with a HQEMU code base before compilation, probably ./configure;; in other implementations, they added a register in VHDL code emulating SPARCv8 already and patched Linux kernel, or use Intel PIN and XOR with key everytime ‘the processor fetches for next instruction’) “XOR key” to operate on the bits of Instruction opcodes, so I am adapting that. If someone reading this has a better idea, please assert it. XOR cryptographic value as to opcode modification seems to be as good as any binary cryptographic method or algorithm.

Either way, no one would have much use from a opcode randomized ELF executable or even bootable Linux Kernel unless we have an emulator to run it as a Proof Of Concept, or can generate a Verilog piece run on FPGA synthesized from our LLVM ISR backend - I was thinking generating a HQEMU build from a randomized LLVM backend for now, while the FPGA seems definitely interesting (Unless I am wrong, and this is clang/lib specific target implementation, and has not much to do with the LLVM backends)

Here is academic links:

this may be very lame, but just to make my point in python:

mystery_bytes = [0x42, 0x23] # x86-64 supports 2 byte instructions officially?

opcodes = {
    "MOV_AX_DISP32":    0xa0,
    "POP_SEG_SHORT":    0x07,
    "JUMP_PC_RELATIVE": 0xeb,
    "INT_OPCODE":       0xcd,
    "NOP_OPCODE":       0x90,
    "REX_OPCODE":       0x40

scrambled_opcodes = {}

for op in opcodes:
    opcodes[op] = opcodes[op] ^ mystery_bytes[0] # xor

And I would be thankful for guidance, feedback on doing this.