archives

New Regexp

While working on new programming metalanguage, I encountered a parsing problem. I wanted to use Javascript Regexp to parse arbitrary tokens when parsing with recursive descent code. But, there is a big but with Javascript and Regexp: it is not possible to efficiently check does Regexp match at specific string offset. Splicing string on every check (to check supported beginning of string) was not my option, so I decided to implement my own Regexp library. I followed some documentation and got this implementation in my metalanguage:

RegExp <= (
    Union <= (@SimpleRE, '|', @RegExp) |
    SimpleRE <= (
        Concatenation <= (@BasicRE, @SimpleRE) |
        BasicRE <= (
            OneOrMore <= (@ElementaryRE, ('+?' | '+')) |
            ZeroOrMore <= (@ElementaryRE, ('*?' | '*')) |
            ZeroOrOne <= (@ElementaryRE, '?') |
            NumberedTimes <= (
                @ElementaryRE,
                '{',
                In <= (
                    Exactly <= @Integer |
                    AtLeast <= (@Integer, ',') |
                    AtLeastNotMore <= (@Integer, ',', @Integer)
                ),
                ('}?' | '}')
            ) |
            ElementaryRE <= (
                Group <= ('(', @RegExp, ')' |
                Any <= '.' |
                Eos <= '$' |
                Bos <= '^' |
                Char <= (
                    @NonMetaCharacter |
                    '\\', (
                        @MetaCharacter | 
                        't' | 'n' | 'r' | 'f' | 'd' | 'D' | 's' | 'S' | 'w' | 'W' | 
                        @OctDigit, @OctDigit, @OctDigit
                    )
                ) |
                Set <= (
                    PositiveSet <= ('[', @SetItems, ']') |
                    NegativeSet <= ('[^', @SetItems, ']')
                ) <~ (
                    SetItems <= (
                        SetItem <= (
                            Range <= (@Char, '-', @Char) |
                            @Char
                        ) |
                        @SetItem, @SetItems
                    )
                )
            )
        )
    )
)

It would work with some javascript back-end, but when I compared "union" to "set" in original Regexp definition, I concluded they are about the same thing, a choice of values detected at parse time. I didn't like this redundancy, so I decided to slightly change the definition of Regexp and to develop my own version which looks like this:

ChExp <= (
    Choice <= (@ConExp, '|', @ChExp) |
    ConExp <= (
        Concatenation <= (@WExp, @ConExp) |
        WExp <= (
            Without <= (QExp, '!', @WExp) |
            QExp <= (
                OneOrMore <= (@GExp, ('+?' | '+')) |
                ZeroOrMore <= (@GExp, ('*?' | '*')) |
                ZeroOrOne <= (@GExp, '?') |
                NumberedTimes <= (@GExp, '{', @Integer, '}') |
                GExp <= (
                    Group <= ('(', @ChExp, ')') |
                    Exp <= (
                        Any <= '.' |
                        Range <= (@Char, '-', @Char) |
                        Char <= (
                            @NonMetaCharacter |
                            '\\', (
                                @MetaCharacter |
                                't' | 'n' | 'r' | 'f' |
                                '0x', @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit
                            )
                        )
                    )
                )
            )
        )
    )
)

So, what is the difference? I completely removed Set notation (noted by [xyz]) because it can be replaced by "Union" notation (x|y|z). Initially, there was a thing missing in my new Regexp: negative set ([^xyz]). To support this I decided to introduce a new operator: "Without" (!), that parses expression to the left of '!' excluding expression to the right of '!'. So, '[^xyz]' now becomes '.!(x|y|z)'. What is interesting is that '!' operator allows to exclude a character sequence from parsing like in example '(.*)!(keyword1|keyword2)', in which we can parse a stream that is different from "keyword1" or "keyword2". Traditional Regexp supports only one character exclusion through negative set, so I observe a greater expressiveness of the new RegExp I designed. The new Regexp does not have redundant set-union colision, while its definition looks a bit cleaner than traditional Regexp.

I have to say, probably structured way of defining the grammar in my metalanguage saved me from the pitfalls which original authors of Regexp had in the seventies when they probably used unstructured plain BNF for defining Regexp.

So, what do you say? Is this new Regexp appearance worth of abandoning traditional standard that is widely spread across the globe? Do you have any sugestions on my modified solution? Traditional Regexp has been around for decades, do you think it is time for modernizing it?