Lambda the Ultimate

inactiveTopic Larry Wall: Apocalypse 5: Regular Expressions
started 6/4/2002; 11:46:45 PM - last post 6/11/2002; 10:57:20 AM
Ehud Lamm - Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/4/2002; 11:46:45 PM (reads: 2016, responses: 16)
Larry Wall: Apocalypse 5: Regular Expressions
(via the DG. Thanks Keith!)

Given all this, I need to warn you that this Apocalypse is going to be somewhat radical. We'll be proposing changes to certain "sacred" features of regex culture, and this is guaranteed to result in future shock for some of our more conservative citizens. Do not be alarmed. We will provide ways for you to continue programming in old-fashioned regular expressions if you desire. But I hope that once you've thought about it a little and worked through some examples, you'll like most of the changes we're proposing here.

Since this essay is quite long and detailed, I tried to single out a few things I found interesting.

Page 2 lists problems with the currect regex approach. It is always interesting to see language critique by the language designer. Naturally, this critique determines to a large extent the direction of the redesign of regular expressions.

Values that are determined within a regular expression should usually be viewed as speculative, subject to cancellation if backtracking occurs. This applies not only to the values captured by (...) within the regex, but also to values determined within closures embedded in the regex. The scope of these values is rather strange, compared to ordinary variables. They are dynamically scoped, but not like temp variables. A temporary variable is restored at the end of the current block. A hypothetical variable keeps its value after the current block exits, and in fact keeps that value clear to the end of its natural lifetime if the regex succeeds (where the natural lifetime depends on where it's declared). But if failure causes backtracking over where the variable was set, then it is restored to its previous state. (p. 13). [I'll try to write more about this later.]

Variable scope rules

Defining Your Own Rules

It is fun to watch a strong willed, dominant, language designer try to regain control of core fetaures of "his" language. Can the community tolerate so much change?

I'll have to think about how all this compares with Icon. Some connections are pretty obvious, but the general approach is - as was to be expected - quite different. It is also going to be interesting to watch how these changes will influence other languages that support regexes, like Python.


Posted to general by Ehud Lamm on 6/5/02; 12:53:25 AM

Alex Moffat - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/5/2002; 4:09:35 AM (reads: 2036, responses: 0)
"The underlying visual problem is the overuse of parentheses, as in Lisp." I hope it's just a throwaway statement. It's not like parentheses mean different things in different places in Lisp, or that, using any reasonable editor, you have to match them by eye.

andrew cooke - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/5/2002; 7:14:32 AM (reads: 2016, responses: 0)
There's a possibly interesting discussion of an S-expression based syntax for regular expressions in Scheme at http://www.ai.mit.edu/~shivers/sre.txt (it has some sensible-ish philosophical comments about developing free software at the start too).

(I may have posted this before, but so long ago that I suspect the audience has changed!)

John Wiseman - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/5/2002; 10:33:32 AM (reads: 1997, responses: 0)
"We're frogs who are getting boiled in a pot full of single-character morphemes, and we don't notice."

Except when you're really being boiled the part that feels hottub warm and cozy lasts longer.

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/6/2002; 6:49:24 AM (reads: 1899, responses: 0)
Another view on the role of language dsigner:

Saying that the language is ready, ie. no more changes should be introduced, at any given point of the evolution of the language is the same as to say that van Rossum cannot make the correct decisions on language design. But saying that the language is ready is the same as to say that van Rossum indeed can make correct decisions on language design, since he has done so with the improvement of the previous versions and the original version too. Herein lies the paradox. (This is about Python, of course).

Ed Heil - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/6/2002; 6:17:37 PM (reads: 1858, responses: 2)
Larry sez:

Notionally, a regex is an organization of assertions that either succeed or fail.

But Olin Shivers sez (in the above-referenced document):

In principle, a regexp doesn't specify a matching algorithm, it specifies a *set of strings*.

Historically speaking, who's right?

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 3:15:17 AM (reads: 1888, responses: 0)
I think both statements you quote focus on the declerative meaning of a regular expression.

From a CS pov I'd say that the distinction between regular expressions (as a declerative specification of a regular language) and finite automata (that recognize regular languages) is quite fundamental. You actually prove the connection.

(BTW: notice that Perl-like regexes are more powerful than what is called regrexes in CS, for example they support backtracking).

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 6:14:39 AM (reads: 1878, responses: 0)
I asked around and the feeling here was that regular languages were devised as a way to specify the properties of finite automata (so the recognizer was there first).

Notice that this is pre-history. Regular expressions in programming languages are not really regular languages for quite awhile, so the history is more involved.

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 8:39:17 AM (reads: 1823, responses: 0)
One more thing, before I forget. Re speculative values. In Icon you have a reversible assignment operator (<- intead of :=). The assignment is undone when backtracking. Other side effects are not subject to this special treatment.

Ed Heil - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 9:19:12 AM (reads: 1818, responses: 3)
"not really regular languages" -- can you explain what the term "regular" means in this context, or point me to a resource that explains it?

andrew cooke - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 12:06:04 PM (reads: 1858, responses: 2)
http://www.wikipedia.com/wiki/Chomsky_hierarchy

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/7/2002; 1:01:29 PM (reads: 1926, responses: 1)
Or if you want a more in depth look, grab the classic book used in many CS departments.

Regular languages have many interesting properties (closures etc.)

To connect this to programming languages: Regular languages are often used to specify the tokens found during the lexing phase (the job of the lexical analyzer), for example recognizing numbers, white space etc. Context free languages are usually used to specify the grammer (which is handled by the parser, that builds the abstract syntax tree). Since programming languages are often not really context free, context information is usually handled as part of the semantic analysis of the program, which is done by analyzing the syntax tree. This is usually much simpler than using a context-sensitive grammer.

I tried to make this as simple as possible, hope I didn't make to many wrong claims along the way...

jon fernquest - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/8/2002; 4:28:13 AM (reads: 1776, responses: 0)
Going back to first principles and then dumbing it down for programmer productivity would seem to be the best strategy.

There are more powerful regular expression packages available than Perl's, based on finite state machines and math.

Try Xerox's:

http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/home.en.html

Or ATT's:

http://www.research.att.com/sw/tools/fsm/

And yes you can be mathematically rigorous about relating regular expression matching to finite state machine theory. See for instance this link and author:

http://www.cs.sunysb.edu/~algorith/implement/watson/implement.shtml

You can even use regular expressions to do parsing of languages more complex than regular and context free languages like Tree Adjoining Grammars.....used to parse natural languages...

http://www.geocities.com/SoHo/Square/3472/complingrefs.html#tree

See the following book especially:

http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-9/mtr-9-7.htm

andrew cooke - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/8/2002; 4:49:59 AM (reads: 1955, responses: 0)
The book at http://www.cs.vu.nl/~dick/PTAPG.html also includes quite a good introduction to the hierarchies of grammars and languages.

Ed Heil - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/11/2002; 6:35:25 AM (reads: 1748, responses: 1)
Thanks, guys. I'd heard of the Chomsky Heirarchy before but never really had a context in which to understand it. The Wikipiedia was extremely informative. (I know Chomsky's natural language work pretty well, and don't like it at all, but I think his main problem in natural languages is that he treats them too much like computer languages, so it stands to reason that his work is in fact solid when applied to computer languages! :)

Ehud Lamm - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/11/2002; 7:44:55 AM (reads: 1799, responses: 0)
don't like it at all...

We can have some Chomsky linguistic bashing, if you like (no politics though, please!)

But are you thinking of Chomsky's early work, or his more recent views?

Ed Heil - Re: Larry Wall: Apocalypse 5: Regular Expressions  blueArrow
6/11/2002; 10:57:20 AM (reads: 1736, responses: 0)
Both his early work and his more recent views. As far as I can tell, his more recent views (GB and Minimalist) seem to be a matter of taking his older stuff and making it progressively more vacuous and less falsifiable.

(BTW, I don't know enough about politics to even begin to address his politics, so no worries on that issue from me. :)

Rather than directly criticizing Chomsky let me say what kind of linguistics I prefer:

I find theories like Ronald Langacker's "Cognitive Grammar" (originally called "Space Grammar" but George Lakoff convinced him to change it for marketing reasons -- heh) much more compelling. Langacker's main work in Cognitive Grammar is published in two large volumes called

Charles Fillmore, Paul Kay, and Adele Goldberg (yes, THAT Adele Goldberg!) have done some work which is essentially compatible with Langacker's stuff, which they call "Construction Grammar."

George Lakoff has also done a lot of work on this sort of linguistics.

The basic idea is that "grammatical correctness" -- which is what Chomsky is all about; those grammars are all bout deciding whether a string is "grammatical" -- is epiphenomenal.

A grammar is nothing but a list of constructions, where a construction is a form-meaning pairing. Some of the forms are more abstract than others. Forms can be embedded within forms, in which case meanings are combined in particular ways.

There are no grammatical rules whatsoever except for the above -- and "grammatical" means nothing more nor less than "parsable as a construction or combination of constructions" -- where "combination" here refers to both juxtaposition/embedding of form *and* blending of meaning.

The blending of meaning aspect is particularly interesting because meaning-construction by simple componential assemblage is understood to be a dengenerate case; the basic process of which that is a degenerate instance is very flexible and contextually sensitive (Mark Turner and Gilles Fauconnier have done some very interesting work on this "blending" phenomenon as a general piece of cognitive equipment).

Chomsky didn't have much to say about meaning construction and what he did say was mostly about establishing correct reference for pronouns and other matters of what he called Logical Form.

Construction/cognitive grammarians like the above tend to be very interested in language as an embodied product of a big wet brain, rather than as an abstract mathematical system of symbols. They tend to be interested in the roles that imagistic and gestalt phenomena play in language understanding, and they reject Chomsky-style isolation of language from general cognitive mechanisms. They tend to posit very little in the way of special mental equipment which is just for language and nothing else, preferring to look at language in terms of specialized, focused usage of abilities that we also display in non-linguistic contexts.

I'm aware there are many other approaches to language besides the straight Chomsky-style and the Cognitive/Construction Grammar theorists, but those are the only ones I'm familiar with so I'm contrasting the two of them. I got to know Chomsky's stuff at first and then moved on to the latter, several years ago when I went on a big linguistics binge.

Let me end by saying I am definitely not a professional linguist and am just giving my impressions as a layman who's taken an interest in these things.