New Regexp

While working on new programming metalanguage, I encountered a parsing problem. I wanted to use Javascript Regexp to parse arbitrary tokens when parsing with recursive descent code. But, there is a big but with Javascript and Regexp: it is not possible to efficiently check does Regexp match at specific string offset. Splicing string on every check (to check supported beginning of string) was not my option, so I decided to implement my own Regexp library. I followed some documentation and got this implementation in my metalanguage:

RegExp <= (
    Union <= (@SimpleRE, '|', @RegExp) |
    SimpleRE <= (
        Concatenation <= (@BasicRE, @SimpleRE) |
        BasicRE <= (
            OneOrMore <= (@ElementaryRE, ('+?' | '+')) |
            ZeroOrMore <= (@ElementaryRE, ('*?' | '*')) |
            ZeroOrOne <= (@ElementaryRE, '?') |
            NumberedTimes <= (
                @ElementaryRE,
                '{',
                In <= (
                    Exactly <= @Integer |
                    AtLeast <= (@Integer, ',') |
                    AtLeastNotMore <= (@Integer, ',', @Integer)
                ),
                ('}?' | '}')
            ) |
            ElementaryRE <= (
                Group <= ('(', @RegExp, ')' |
                Any <= '.' |
                Eos <= '$' |
                Bos <= '^' |
                Char <= (
                    @NonMetaCharacter |
                    '\\', (
                        @MetaCharacter | 
                        't' | 'n' | 'r' | 'f' | 'd' | 'D' | 's' | 'S' | 'w' | 'W' | 
                        @OctDigit, @OctDigit, @OctDigit
                    )
                ) |
                Set <= (
                    PositiveSet <= ('[', @SetItems, ']') |
                    NegativeSet <= ('[^', @SetItems, ']')
                ) <~ (
                    SetItems <= (
                        SetItem <= (
                            Range <= (@Char, '-', @Char) |
                            @Char
                        ) |
                        @SetItem, @SetItems
                    )
                )
            )
        )
    )
)

It would work with some javascript back-end, but when I compared "union" to "set" in original Regexp definition, I concluded they are about the same thing, a choice of values detected at parse time. I didn't like this redundancy, so I decided to slightly change the definition of Regexp and to develop my own version which looks like this:

ChExp <= (
    Choice <= (@ConExp, '|', @ChExp) |
    ConExp <= (
        Concatenation <= (@WExp, @ConExp) |
        WExp <= (
            Without <= (QExp, '!', @WExp) |
            QExp <= (
                OneOrMore <= (@GExp, ('+?' | '+')) |
                ZeroOrMore <= (@GExp, ('*?' | '*')) |
                ZeroOrOne <= (@GExp, '?') |
                NumberedTimes <= (@GExp, '{', @Integer, '}') |
                GExp <= (
                    Group <= ('(', @ChExp, ')') |
                    Exp <= (
                        Any <= '.' |
                        Range <= (@Char, '-', @Char) |
                        Char <= (
                            @NonMetaCharacter |
                            '\\', (
                                @MetaCharacter |
                                't' | 'n' | 'r' | 'f' |
                                '0x', @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit
                            )
                        )
                    )
                )
            )
        )
    )
)

So, what is the difference? I completely removed Set notation (noted by [xyz]) because it can be replaced by "Union" notation (x|y|z). Initially, there was a thing missing in my new Regexp: negative set ([^xyz]). To support this I decided to introduce a new operator: "Without" (!), that parses expression to the left of '!' excluding expression to the right of '!'. So, '[^xyz]' now becomes '.!(x|y|z)'. What is interesting is that '!' operator allows to exclude a character sequence from parsing like in example '(.*)!(keyword1|keyword2)', in which we can parse a stream that is different from "keyword1" or "keyword2". Traditional Regexp supports only one character exclusion through negative set, so I observe a greater expressiveness of the new RegExp I designed. The new Regexp does not have redundant set-union colision, while its definition looks a bit cleaner than traditional Regexp.

I have to say, probably structured way of defining the grammar in my metalanguage saved me from the pitfalls which original authors of Regexp had in the seventies when they probably used unstructured plain BNF for defining Regexp.

So, what do you say? Is this new Regexp appearance worth of abandoning traditional standard that is widely spread across the globe? Do you have any sugestions on my modified solution? Traditional Regexp has been around for decades, do you think it is time for modernizing it?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Technical assistance

I wanted to use Javascript Regexp to parse arbitrary tokens when parsing with recursive descent code. But, there is a big but with Javascript and Regexp: it is not possible to efficiently check does Regexp match at specific string offset. Splicing string on every check (to check supported beginning of string) was not my option, so I decided to implement my own Regexp library.

I was recently optimizing a parser I had written in JavaScript, and the best change I made by far was to add a regex-based tokenizer. My tokenizer probably does exactly what you're trying to do! The code looks like this, and it just uses the standard lastIndex property:

var tokenRegex =
    /[ \t]+|[\r\n`=;',\.\\/()\[\]]|[^ \t\r\n`=;',\.\\/()\[\]]*/g;
tokenRegex.lastIndex = currentCodeIndex;
var token = regex.exec( codeString )[ 0 ];
currentCodeIndex += token.length;

The exec method looks at the lastIndex property and uses it as the starting point for the search. Since the regex I'm using has no possibility of failure, the match always begins precisely at the specified lastIndex, rather than proceeding to search for a match later on in the stream.

Depending on your token syntax, it might be tricky to write a regex that always succeeds like that, but someday soon it could be a little easier to write these. ECMAScript 6 adds a sticky flag, /y, which specifically makes it so a regex does not proceed to try later starting positions if it fails to match at lastIndex. Apparently only Firefox supports this at the moment.

What is interesting is that '!' operator allows to exclude a character sequence from parsing like in example '(.*)!(keyword1|keyword2)', in which we can parse a stream that is different from "keyword1" or "keyword2". Traditional Regexp supports only one character exclusion through negative set, so I observe a greater expressiveness of the new RegExp I designed.

Are you aware of negative lookahead (?!...)? The JavaScript regex /^(?!keyword1|keyword2).*$/ will exactly match the strings "foo" and "not keyword1" but will not match "keyword1" or "keyword2". In a regex, /[^abc]/ and /(?!a|b|c)./ have the same effect.

I'm not saying JavaScript regexes are without limits, but it looks like they might be a little more capable than you expect. Hopefully that's good news. :)

So, what do you say? Is this new Regexp appearance worth of abandoning traditional standard that is widely spread across the globe?

There's a variety of regex syntaxes out there for different languages, not to mention many different forms of EBNF for specifying entire language syntaxes. I think your alternate appearance will be in good company, which is perhaps a cheerier way of saying it won't stand out.

Regexes provide a suite of basic operators for combining regexes, like concatenation /AB/ and alternation /A|B/. In this way, they're a special case of the broader category of parser combinator libraries, so a Parsec clone like ReParse will probably serve as a nice regex substitute. If you still want to experiment, this could be a nice starting point for that experimentation; maybe your syntax could compile to use this underneath.

If you just want a parser that works already, a very popular option for parsers in JavaScript is PEG.js.

Tx for assistance

That /y flag would be of use, too bad it is not yet supported everywhere.

I thought of negative lookahead, but I wasn't aware that javascript supports it. Good to know, thanks.

So, out there is a great deal of different Regexps? Maybe my version doesn't even deserve a different name... Anyway, I'll stick to my version if speed performance lets me... I kind of like it...

PEG.js is cool and fast, but I can't use it because I want to mix syntax definitions and actual instances of data in the same file. I also want to support left recursion (by seed growing algorithm) .

left recursion

I have my doubts about left recursion in top down parsers.

I've read a few papers on it where the methods turned out to be impractical.

Though there was an early paper I'd like to see about left recursion using continuations, I have a weak spot for continuation and I find papers from the 70s easier to read than current ones anyway.

Re: left recursion

I've tested seed growing algorithm in Javascript. It works like a charm. I've even extended it to deal with CFG instead of PEG. Thank you guys from the web (Alessandro Warth, James R. Douglass, Todd Millstein)
:)

Not addressing a real problem?

[Edit.. sorry, just realised I didn't read that clearly enough, so my main point was obsolete. Now edited to make the seconary point the main point :-(]

So, '[^xyz]' now becomes '.!(x|y|z)'.

So ! is a negative look-behind assertion.

Set and inverted set are nice efficient syntactic sugars. No one wants to write (0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F) instead of [0-9A-F]. But the real problem with regexps is that they can't be composed. Perl6 has some nice answers to this. PEG>js looks nice, thanks Ross, I'll investigate further.

Not really syntactic sugar

These two forms are not really syntactic sugar for one another - they overlap in what they can do but there are differences.

For example: (x|y|zz) doesn't translate directly into character-sets. This is the basis for handling escaping sequences in regexp.

The real difference between them is that a character set is a fixed width construct, but the choice is being made over arbitrary regexp's (it is an operator). This implies that [^xyz] has a closed set of matches, while !(x|y|z) does not. For example, is "kk" in the set of matches for !(x|y|z) ?

Operator "!"

In proposed Regexp form, operator "!" would be read as "without" (or negative look behind, if you like), so "!(x|y|z)" isn't really independent expression. It depends on what's left to it. Only if we write ".*!(x|y|z)" then we can read it as ".*" without "(x|y|z)".

The dot was subtle

Ok, so it is a compound modifier. How does the matcher work - match the expression on the left, then check the match a second time against the excluded set?

Would something like [0-9]+!(12+) work?

Yes, that's the idea,

except I would write it like "0-9*!(12+)". I completely removed "[...]" notation from the syntax and introduced character range (like "A-Z") without brackets.

One thing about pattern matching is you always want binding.

I would look at the regexp above and wonder whether it matched any number that didn't start with '12' or any number that didn't contain the sequence '12', and go reading to see what the conventions were. But mostly I'd be reading to see whether the regexp language has variable binding and reference built in, because they are incredibly useful and inserting references to (or bindings of) them is the main reason IMO for square brackets.

One hardly ever uses a regexp unless picking a value out of a stream. And what one does with that value, at least half the time, is to bind it to a variable. So I think that a regexp sublanguage needs a way to specify variables to bind to the values it picks out of the stream.

I like being able to write something like

success=regmatch( (a-z)^5[varName=]((1-9)*) ,inputstring)

where "success" is a boolean, and if its value after this expression is #true then the expression matched 5 characters of lowercase alphabetic followed by an integer, and varName is now bound to the integer. Or even

success=regmatch( [val=]((1-9)*)_[val], inputstring )

to see if something matched a number, underscore, the same number.

I would write the definition

I would write the definition in multiline string:

def = `
AssignExp <= (
    VarName <= /A-z+/,
    op      <= ':=',
    Value   <= /1-9+/
)`;

and then i'd use the full blown parser:

abstractSyntaxTree = parser.parse(def, inputstring);

after which I'd get a parsed tree for analyzing.

re how does the matcher work

How does the matcher work - match the expression on the left, then check the match a second time against the excluded set?

If the discussion is limited to actual regular expressions as the terms for "!", then the operator is a set difference and the pattern "x!y" is always regular. Multiple passes might be desirable but aren't necessary.

I would actually like to

I would actually like to write (0-9|A-F)