archives

Advancement in TDFA and POSIX submatch extraction

It came up in an old LTU thread about regular expressions. There was an argument whether tagged FA invented by Ville Laurikari can support POSIX disambiguation semantics. It turns out, they can: it is possible to construct efficient Laurikari TDFA with POSIX semantics, as well as with leftmost greedy semantics (details in this paper).

Back in 2007 Chris Kuklewicz suggested an algorithm which is implemented in his Regex-TDFA Haskell library. He also wrote an informal description of his algorithm, but never fully formalized it. Some ten years later I stumbled upon this thread when I was trying implement fast submatch extraction in the open source lexer generator re2c. I revised both the original algorithm by Laurikari and the modification by Kuklewicz and found a number of improvements and bugs in Regex-TDFA.