Increase the flexibility of the AsmLexer in parsing identifiers.


I would like to gather some ideas and opinions on how to make the default AsmLexer more flexible when dealing with Identifiers.

When the lexer emits something as an “Identifier” (read. String of characters) it means that it needs to be parsed all at once in a single go, even if it contains elements that might be wanted to be parsed as separate entities.
In that case it is needed to implement some custom parsing logic that lexes and parses in place the identifier string to emit the Operands in the operand vector, which might not be ideal.

At the moment the default AsmLexer lexes tokens like this:

There are a bunch of symbols that are parsed directly into tokens(like #, % … etc), then there are integer/float literals and a fairly big category that catches the default case that doesn’t match any of the previous, that are handled by the LexIdentifier() function.

Actually in the current default AsmLexer this function doesn’t always emit an Identifier token, but might return Float literals or Dot tokens in some special cases, so it works more like a “handle what I couldn’t directly recognize” kind of function.

In multiple occasions I found like I wanted to be able to change what actually this function considers an Identifier or separate tokens.

A use case would be this.

Let’s say that my target’s assembly syntax has this fancy characteristic where different operands are separated by ‘$’ (dollar) like in:

add r0$5$r3

The default AsmLexer would lex the entire r0$5$r3 as a single “Identifier” and it is not possible to Lex every operand separately , but some custom lexing logic must be applied over the returned “Identifier” Token to split and recognize each of the operands.

This is a stupid example, but there are other cases where something similar happens and can be a hassle to deal with, because what an Identifier is entirely dependent from some arbitrary logic in the Lexer.

To override this logic the entire default Lexer and Parser needs to be overridden (probably copying most of the existing logic for the rest of the parsing anyway).

I would like to find a more easy way to specify what to return as an identifier or separate logic allowing for more flexibility.

I developed a tentative patch that adds this flexibility to the current MCAsmLexer infrastructure.
I would like to gather opinions on this approach or ideas on other possible approaches to achieve something similar and find out if somebody else finds this kind of concept useful or not.


configurable_asmlexer.patch (5.23 KB)

I think allowing MCAsmParserExtensions to control this behavior by overriding methods would be cleaner than adding more setters. I’m imagining that each target is allowed to supply its own table or virtual method to implement ‘IsIdentifierChar’ in AsmLexer.cpp. This would handle AllowAtInIdentifier and your use case.

Hello Reid,

I’m not exactly sure I understand completely your proposal.
Are you proposing to add overridable virtual methods to MCAsmParserExtensions to be used by AsmLexer to specify which characters are part of an identifier or not?

If that is the case I’m not sure how MCAsmLexer or AsmLexer can make use of those , because while MCAsmParserExtensions sees MCAsmLexer the latter doesn’t know anything about the former.

Or … maybe I completely misinterpreted what you were saying? :smiley: