questions re common lisp readtable hacks

The Common Lisp reader has a feature known as The Readtable which can be used to configure an ad hoc but interesting assignment, to individual characters, of a syntax type or reader macro.

I'm interested in historic and current experience with these features. I'm especially interested in experience that does not make use of the reader macro feature but that uses the mutability of the "syntax type" of characters in interesting ways.

How good or bad an approach to lexical analysis is this? How general purpose has this approach proved to be in a practical rather than theoretical sense? Has Common Lisp's list of character syntax types been compelling extended?

The more general question I'm contemplating asks what good alternatives there are to regular expressions for specifying lexical syntax. Common Lisp syntax tables have some appeal to me because the character syntax types reflect how a person might describe a lexical syntax to another person: "These characters can begin an identifier. This other, overlapping set of characters can be constituents of an identifier. This is the single-escape character for identifiers."... and so forth.

I'm wondering about the possibility of a system for defining lexers in those kinds of familiar terms.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

why not regexps?

A little post-script about my motivations in asking:

Stepping back a little bit further, my interest is in approaches to lexical analysis that can be bootstrapped starting from very little, and using very little code.

A lexical analyzer generator that parses regexps and eventually spits out minimized DFA tables -- or anything similar -- is a fine piece of work but is not the kind of thing I'm interested in here. My problem with such approaches (for my goals) is their complexity.

I look admiringly at the Common Lisp reader because it is a simple, hand-made-table, table-driven approach that (macros aside) can be implemented with comparatively little code, even in machine code.

The FORTH approach to lexical analysis is similarly minimalistic but is seemingly less flexible, at least in classic form.

I'm interested in lexical analyzers that can be configured by hand, easily, without relying on a generator. That are flexible and that hopefully can be dynamically reconfigured for a variety of interesting syntaxes.

"Hand coding" a lexer can be done concisely in many practical cases but the result isn't flexible.

A table driven lexer can be flexible but a regexp->DFA generator to drive it is more complexity than I want.

What are sweet spots in between?

Not sure if this is helpful, but ...

... since you're not getting any responses I figured that it can't hurt to throw in my 2 cents.

Have you considered the Brzozowski derivative method for regular expressions? This is a pretty straightforward approach.

I'm currently working on a "dynamically reconfigurable" lexer/parser in my spare time that I was hoping to write about here on LtU shortly. I think that my approach might go to the opposite extreme of what you have in mind though, since I'm treating lexers as "just" a parallel sequence of LALR(1) parsers. Then I reuse the standard DeRemer/Pennello method to build both the lexer and the parser. Finally I make a minimal grammar that defines just a few rules which let you define lexemes or grammar rules.

The grammar compilation process runs at each such statement so you can immediately pinpoint where you've introduced ambiguity. It also makes syntactic extension pretty straightforward and modular.

I wasn't sure if it was worth posting to LtU, since parsing is generally considered uninteresting.