User loginNavigation |
New RegexpWhile working on new programming metalanguage, I encountered a parsing problem. I wanted to use Javascript Regexp to parse arbitrary tokens when parsing with recursive descent code. But, there is a big but with Javascript and Regexp: it is not possible to efficiently check does Regexp match at specific string offset. Splicing string on every check (to check supported beginning of string) was not my option, so I decided to implement my own Regexp library. I followed some documentation and got this implementation in my metalanguage: RegExp <= ( Union <= (@SimpleRE, '|', @RegExp) | SimpleRE <= ( Concatenation <= (@BasicRE, @SimpleRE) | BasicRE <= ( OneOrMore <= (@ElementaryRE, ('+?' | '+')) | ZeroOrMore <= (@ElementaryRE, ('*?' | '*')) | ZeroOrOne <= (@ElementaryRE, '?') | NumberedTimes <= ( @ElementaryRE, '{', In <= ( Exactly <= @Integer | AtLeast <= (@Integer, ',') | AtLeastNotMore <= (@Integer, ',', @Integer) ), ('}?' | '}') ) | ElementaryRE <= ( Group <= ('(', @RegExp, ')' | Any <= '.' | Eos <= '$' | Bos <= '^' | Char <= ( @NonMetaCharacter | '\\', ( @MetaCharacter | 't' | 'n' | 'r' | 'f' | 'd' | 'D' | 's' | 'S' | 'w' | 'W' | @OctDigit, @OctDigit, @OctDigit ) ) | Set <= ( PositiveSet <= ('[', @SetItems, ']') | NegativeSet <= ('[^', @SetItems, ']') ) <~ ( SetItems <= ( SetItem <= ( Range <= (@Char, '-', @Char) | @Char ) | @SetItem, @SetItems ) ) ) ) ) ) It would work with some javascript back-end, but when I compared "union" to "set" in original Regexp definition, I concluded they are about the same thing, a choice of values detected at parse time. I didn't like this redundancy, so I decided to slightly change the definition of Regexp and to develop my own version which looks like this: ChExp <= ( Choice <= (@ConExp, '|', @ChExp) | ConExp <= ( Concatenation <= (@WExp, @ConExp) | WExp <= ( Without <= (QExp, '!', @WExp) | QExp <= ( OneOrMore <= (@GExp, ('+?' | '+')) | ZeroOrMore <= (@GExp, ('*?' | '*')) | ZeroOrOne <= (@GExp, '?') | NumberedTimes <= (@GExp, '{', @Integer, '}') | GExp <= ( Group <= ('(', @ChExp, ')') | Exp <= ( Any <= '.' | Range <= (@Char, '-', @Char) | Char <= ( @NonMetaCharacter | '\\', ( @MetaCharacter | 't' | 'n' | 'r' | 'f' | '0x', @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit, @HEXDigit ) ) ) ) ) ) ) ) So, what is the difference? I completely removed Set notation (noted by [xyz]) because it can be replaced by "Union" notation (x|y|z). Initially, there was a thing missing in my new Regexp: negative set ([^xyz]). To support this I decided to introduce a new operator: "Without" (!), that parses expression to the left of '!' excluding expression to the right of '!'. So, '[^xyz]' now becomes '.!(x|y|z)'. What is interesting is that '!' operator allows to exclude a character sequence from parsing like in example '(.*)!(keyword1|keyword2)', in which we can parse a stream that is different from "keyword1" or "keyword2". Traditional Regexp supports only one character exclusion through negative set, so I observe a greater expressiveness of the new RegExp I designed. The new Regexp does not have redundant set-union colision, while its definition looks a bit cleaner than traditional Regexp. I have to say, probably structured way of defining the grammar in my metalanguage saved me from the pitfalls which original authors of Regexp had in the seventies when they probably used unstructured plain BNF for defining Regexp. So, what do you say? Is this new Regexp appearance worth of abandoning traditional standard that is widely spread across the globe? Do you have any sugestions on my modified solution? Traditional Regexp has been around for decades, do you think it is time for modernizing it? By Ivan V. at 2016-02-16 19:03 | LtU Forum | previous forum topic | next forum topic | other blogs | 5382 reads
|
Browse archives
Active forum topics |
Recent comments
16 weeks 4 days ago
16 weeks 4 days ago
16 weeks 4 days ago
38 weeks 5 days ago
43 weeks 13 hours ago
44 weeks 4 days ago
44 weeks 4 days ago
47 weeks 2 days ago
51 weeks 6 days ago
52 weeks 1 hour ago