archives

Lexical Analysis with Extended Identifiers and Disambiguation by Table Look-up.

I present an extension to lexical analysis that allows extended identifiers that are any character string. This feature (some may say mess) requires some simple additions to a lexer. Moreover, allowing any character string, for example one that begins and ends with whitespace, would really be a mess. Thus, some limits can and probably should be placed on extended identifiers.

First, the lexer needs a way to identify an extended identifier when it first occurs. For example assume the first occurrence of an extended identifier must be enclosed by !! and !. An example is !!a four word identifier!.

A forcing character, such as \ can be used to put an ! inside an identifier.

After scanning the first occurrence of an extended identifier, the lexer saves it in an identifier-table. Subsequently, the lexer searches the identifier-table as it begins scanning each token. Subsequent occurrences of the extended identifier do not need to be surrounded by !! and !, because the lexer can find them in its identifier table.

This identifier search may be done before, after, or in parallel with the standard lexer scan that uses rules to scan and identify tokens. When both the search for an extended identifier and conventional scan rules identify different tokens, the maximal munch rule may be used to select the token to be used. See: http://en.wikipedia.org/wiki/Maximal_munch.

Assume the token !!a+b! has been stored in the identifier table. Then the lexer can scan expressions containing a+b that are always a single token "a+b" instead of three different tokens. An expression such as "a+b*3" would always be scanned with a+b being an identifier. Although this technique can lead to program obfuscation, it permits identifies that are like natural languages, for example "The Statue of Liberty."

What are the pros and cons of using this lexer extension?

Has this technique already been discussed in the literature?

Are there any languages that use such lexer?