[ms] [LLVM-ML] Reserved words in assemblers

Hi all,

I’m still working on MASM support via LLVM-ML, though it’s been pushed to a backburner project due to lack of reviewers - anyone who’d be interested in reviewing, please contact me!

However: I’ve also encountered an ambiguity in MASM syntax. For example:
DWORD 5
is a valid declaration of a 32-bit variable with value 5… but
CALL DWORD PTR []
is a valid x86 call instruction. (Yes, MASM has infix directives, and most of its directives are valid identifiers.)

It looks like ML.EXE resolves this by keeping reserved words that can’t be used as identifiers: specifically, all native instructions, as well as MASM directives, operators, and other predefined symbols. Unfortunately, there’s not currently an interface in MCTargetAsmParser to check whether a string is an instruction name, so that’s less trivial than it might be.

I see 3 ways to resolve this problem:

  1. (easy) disambiguate size directives (DWORD PTR) from variable declarations (DWORD) by lookahead for “PTR” tokens. This covers the case I know of so far, but there could be cases I haven’t spotted yet. Draft Phabricator patch: https://reviews.llvm.org/D103257

  2. (medium) introduce a new function: MCTargetAsmParser.isValidInstructionMnemonic (name to be bikeshedded). This will have to be introduced for all MCTargetAsmParsers, which is not ideal… but it can leverage existing GET_MNEMONIC_CHECKER infrastructure to recognize the names from TableGen files. Using this, we can define our list of reserved words, and work from there.

  3. (hard) introduce a new function: MCTargetAsmParser.tryParseInstruction, which parses an instruction if present and otherwise backtracks, restoring parser & lexer state as if it had never been called. Again, this needs to be introduced for all MCTargetAsmParsers. (We’ve already done this with tryParseRegister, but that was relatively simple.) MasmParser can call this first, and only try to recognize a directive if the instruction fails to parse.

I’m currently leaning towards option #1. Anyone opposed, or see a significant benefit to having the other options for other reasons?

Thanks,

  • Eric