Lexical Analysis with Extended Identifiers and Disambiguation by Table Look-up.

I present an extension to lexical analysis that allows extended identifiers that are any character string. This feature (some may say mess) requires some simple additions to a lexer. Moreover, allowing any character string, for example one that begins and ends with whitespace, would really be a mess. Thus, some limits can and probably should be placed on extended identifiers.

First, the lexer needs a way to identify an extended identifier when it first occurs. For example assume the first occurrence of an extended identifier must be enclosed by !! and !. An example is !!a four word identifier!.

A forcing character, such as \ can be used to put an ! inside an identifier.

After scanning the first occurrence of an extended identifier, the lexer saves it in an identifier-table. Subsequently, the lexer searches the identifier-table as it begins scanning each token. Subsequent occurrences of the extended identifier do not need to be surrounded by !! and !, because the lexer can find them in its identifier table.

This identifier search may be done before, after, or in parallel with the standard lexer scan that uses rules to scan and identify tokens. When both the search for an extended identifier and conventional scan rules identify different tokens, the maximal munch rule may be used to select the token to be used. See: http://en.wikipedia.org/wiki/Maximal_munch.

Assume the token !!a+b! has been stored in the identifier table. Then the lexer can scan expressions containing a+b that are always a single token "a+b" instead of three different tokens. An expression such as "a+b*3" would always be scanned with a+b being an identifier. Although this technique can lead to program obfuscation, it permits identifies that are like natural languages, for example "The Statue of Liberty."

What are the pros and cons of using this lexer extension?

Has this technique already been discussed in the literature?

Are there any languages that use such lexer?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Do you *really* need a extensible lexer?

What is your root goal here? I mean if it's just natural language programming, that can be done without doing this sort of mutation of the lexer.

"phrases" in Tangent (disclosure: Tangent is one of my older toy languages) for example allow for the specialization you're looking for without the !!/! hoops for creating new identifiers, while still using a static lexer and parser (though it does add a third step after parsing to determine 'part of speech' for lack of a better analogy). That provides at least some measure of consistency to the language, as well as allowing more... ambiguity in how natural language phrases are dealt with.

Sadly, the slapdash software VM for it is dog slow, and I've not worked on it for some time. But there's some documentation and it seems like it might be quite a bit of what you're looking for.

I'm sure others will have better alternatives, and perhaps more formal research on the topic.


Yes, more toward natural language, but in a programming language. googled lots of things, didn't find anything. Wondering what others have come up with.

Answer: Do I really need an extensible lexer?

This is a controversial subject. But, you asked for my opinion.

I have often felt constrained by the naming conventions. It is not a particular language that has bothered me, all that I have used or read about bother me.

I don't blame the 60's (and before) naming conventions. Computers were massive and massively expensive. After all, magnetic core memory was all hand threaded using microscopes, one core (bit) at a time. Memories were small, the first supercomputer I used, a Cray designed CDC6600 had only 1MB 60-bit words.

However, today computers 1000x more powerful than the CDC6600 sit on everyone's desk. Memories often exceed 1GB. Bad compiler performance is not a reason for name constraints. There is no performance reason to follow 60's naming conventions.

The relevant issues are human factors and compiler complexity. People's native tongue have all kinds of words and symbols that cannot be used for programming in most popular languages, for example The Statue of Liberty and 1st. Moreover, writers make up words and acronyms to improve communications. I was prevented from using some names that would have improved my documentation and/or saved time. I don't believe I am the only one.

This position supports my belief that computer language naming conventions should be more English like.

The quoting convention of !!...! that I used as an example is simple to simplify my explanation. In many cases, language syntax that defines code, data and objects already encloses tokens to be defined to allow enhanced identifiers. For example, "int x = 0;" could easily be "int an iterator = 0;" This statement could easily (except for standards) define "an iterator" as a lexeme, and the lexer could always identify it as such.

tyvm for the question

For example, "int x = 0;"

For example, "int x = 0;" could easily be "int an iterator = 0;" This statement could easily (except for standards) define "an iterator" as a lexeme, and the lexer could always identify it as such.

I don't think this is quite as simple as you're making out, unless you're making some fairly limiting constraints on things.

int an iterator = 0; can be dealt with in a variety of ways. You have 3 traditional identifiers there which could be dealt with in a variety of permutations. At the lexing time, you have no idea of what a type is. Given your description here, you shouldn't even know that = should be some non-variable operator. Maybe if you supplied an example of how that would be setup via lexer definitions, it would be more clear. I can't think offhand of a way that doesn't involve tons of duplication, fragility, or limitations on what can be used where.

Further, you state a bit lower that languages with arguments separated by spaces should not use multi-word variable names. I question this strongly. 'The Statue of Liberty' shouldn't behave too differently than 'The Capitol of <>' (fill in the blank with a variable of the proper type).

Either way, I think this post indicates that you've missed my original point. My point wasn't arguing against more natural sorts of programming languages. That's your business, and I definitely want people to feel free to explore/research new and interesting things. In fact, I tend to agree with the basic premise. I just feel that achieving that via lexer shenanigans as you describe might not be the best way to achieve that.

int x=0

Lexing time can mean three things, maybe more.

1) Lexer works with keystrokes and called by parser...calculator.
2) Lexer goes through a line/statement before parsing...interpreter.
3) Lexer goes through a program before parsing...compiler.

Calculator, interpreter and compiler are examples. A lexer for a calculator can be called by the parser with an argument that tells what kind of lexing to do. When lexing a line, statement, or program before the parser runs, coordination is difficult to impossible.

Is your complexity issue a matter of the compiler tools used or inherent in the syntax I propose?

I'll back off my objection to multi-word variable names in languages with arguments separated by spaces, (add 5 tics on Fluffy) is OK with me, unless tests of readability contraindicate.

Forgive me if I'm being

Forgive me if I'm being unclear. Lexing time (as I understand/meant it) is the process of taking your stream of characters and tokenizing them. I rather presumed that you were talking in the context of a new programming language; something with no context, no existing state. Something where a specific directive defines a new lexeme which is then valid for the rest of the process.

If this sort of thing is built on top of another language/framework, then I really have to question the utility of not just extending that language or using existing parser generators or conventional plug-in systems to accomplish the behavior.

If not, then there's still the question of how its going to handle things.


New languages are a dime a dozen, and many only have users who also developed the language. I am definitely suggesting an enhancement for languages being used, and new ones.

I don't mean to trivialize how much effort would go into making any change to a language. The simplest bug is a big effort to fix. I'm familiar with language validation software that checks syntax and semantics. To begin, he standards committee would have to authorize any change--not trivial.

Otherwise, do you believe adding enhanced identifiers to an existing language is technically challenging? On a scale of 1-10, with changing C into C++ being 10, where would it rank?

If "using existing parser generators or conventional plug-in systems to accomplish the behavior" is possible, then that is certainly the way to go.

However, defining "a+b" as a variable causes an ambiguity when used in an expressions such as "a+b*2", especially if a and b are also defined. Disambiguation is necessary in such cases, and using extended identifiers can lead to such ambiguities.

Languages are designed to avoid ambiguities. Thus, I suggested a technique for disambiguation whenever a programmer makes an identifier that leads to an ambiguity.

I don't want to be unclear either. Unfortunately, I know that I am not a good communicator. Partly because I am dyslexic; thus, often misread things.

I reread your posts. If we still are not on the same wave length, then perhaps I am missing some fundamental information that you know.

The only purpose of this post is for me to complain that identifiers are limiting, and that programmers may be able to do a tiny bit better job with extended identifiers. LOL, trash my naive technical ideas if appropriate. Let me have extended identifiers.

degree of difficulty

Otherwise, do you believe adding enhanced identifiers to an existing language is technically challenging? On a scale of 1-10, with changing C into C++ being 10, where would it rank?

Up there. You have algorithm challenges to make an efficient lexer. You have language design issues working out the exact rules for extended identifiers. You have human factors things like giving sane error messages. You have unknowns about how it plays with existing language syntaxes (or how to build a novel syntax around it). It's pretty open (e.g., unknown) just how hard it is other than it isn't trivial.


Otherwise, do you believe adding enhanced identifiers to an existing language is technically challenging? On a scale of 1-10, with changing C into C++ being 10, where would it rank?

The technical challenge isn't with the implementation, it's with marrying that change with the rest of the grammar's design. Having a lexer that defines an identifier differently isn't in and of itself too difficult. Doing it consistently with the rest of the grammar, and doing it in a way that is clear and consistent to the programmer are hard. Allowing user-defined grammars that are arbitrary (not quite sure if you're looking for that or not anymore) is super-hard.

However, defining "a+b" as a variable causes an ambiguity when used in an expressions such as "a+b*2", especially if a and b are also defined. Disambiguation is necessary in such cases, and using extended identifiers can lead to such ambiguities.

Indeed, which is why I pointed you towards some of the Tangent documentation (and why I cautioned against doing this in the lexer). It effectively uses your suggested method for disambiguation, but it also uses a few others. I'm often lurking on the irc channel (#ltu on freenode) if you're looking for detailed discussion, as the forum isn't really appropriate for that.

Anyways, I'm sorry if I came across as trying to trash anything. On the contrary, I currently learn towards more natural language like programming languages. Just trying to help show the lessons I've learned.


You didn't come across as trying to trash anything. Forgive me for giving that impression.

I've learned a few lessons over the past 40 years, Technology changes have made little affect on programmer productivity and program quality.

To avoid creating a bug, programmers make the simplest possible changes to code when fixing a bug. They leave cryptic names rather than risk making a change. No one cares about the next maintenance programmer having trouble understanding a program.

How can we focus programmers attention on making clear names? It is a small issue with a big impact on maintenance. Enhanced identifiers cannot solve the problem but will not hinder making a clear name.

Done before

Oz uses `a long symbol`. (Though you'll need to use the quotes each time.)

I've also seen (in an in-house scripting language) \"a string as a variable name\n", which has the advantage of simply reusing the string's parser, the initial '\' turning it into a symbol.

Oz and Mozart

Thanks for letting me know about this feature in Oz. Still working to get it running on my sys.

Racket has this already

You can already do this in Racket:

(define |The Statue of Liberty| "A big metal thing in NY")

syntax form

It appears that understanding your example requires me to understand syntax forms, and that your example is not a trivial as it appears. On the other hand, maybe I'm reading in the wrong section, atm.

This is interesting, in any case.

When one does (define |my var| "My Value") the form |my var| must be used everywhere, apparently. My suggested solution would allow my var without the "|" to be use everywhere except the declaration.

In fact, it appears to me the "|" aren't needed at all.

Nothing to do with syntax

The example is exactly as trivial as it appears, and has nothing to do with the syntax form. The relevant link in the documentation is here. The form |my var| is indeed required everywhere.


previous post edited.

Added value?

I don't want to criticize your efforts, but I'm wondering if there is such a difference between a variable named "The Statue of Liberty" which is hard to parse and another named The_Statue_of_Liberty which is nearlyas easy to read and very easy to parse..

Added value?

Spaces are is closer to natural language than _,

Personal preference:

My fingers are fat and stubby, when try to key in _ I often hit + or (, which is a bug.

If I realize I've keyed + or (, my attention is distracted from coding an algorithm for a second or two, and it annoys me. The distraction and annoyance may lead to my coding another bug.

If possible, I use CamelCase when coding to avoid keying _,. However, I made a reading test to determine whether _ or CamelCase is faster to read, and _ wins reading. CamelCase is slower.

I doubt I am the only one who doesn't type _ perfectly and smoothly.

In the case of Lisp-like languages, which use spaces, instead of commas, to separate arguments, space separated identifiers would be hard to read and might cause errors. However, the compiler would never be confused using my suggestion. To avoid confusion, languages that separate arguments with spaces shouldn't use space separated identifiers. Oddly enough, - is easier for me to key than _.

Thanks for asking.

I think it's a neat idea

It's a neat idea. Programmers can shoot themselves in the foot with it and it might be inappropriate for languages where the number of programmers extending a given namespace has to scale arbitrarily -- but I think it could lead to very nicely readable code.

Heh. Here's an example of the limits you'd want on extended identifiers:

The "longest munch" rule isn't sufficient to disambiguate.

var ''345' = 12;

Now, what does

print 345 + 1


Also: seems like a fun algorithms problem to actually try and build that lexer! Bonus points for being tolerant to variations in whitespace and some variations in capitalization, etc.

...neat idea

LOL It can be abused. Would help with the most obfuscated program contest. But, programmers do try to make programs readable.

IMO var 345, var "345", and any other similar syntax trying to use a literal as a variable name should cause an error, even with extended identifiers. Same goes for comments, keywords, operators, etc. that are already described with a syntax.

Moderately Related

You might wish to check out Inform 7's syntax, which can deal with such things as Statue of Liberty as identifiers.

A few related topics on LtU: [1], [2].