What makes backreferences impossible in a DFA regular expressions evaluator?

Hello all,

I've been lurking for a while but this is my first post, so I apologize if I'm off-topic. I have read in papers such as this one that matching regular expressions with a "Thompson's NFA" or DFA makes matching backreferences inefficient (the above paper says it's np complete). Since I've written a regex engine that uses the "Thompson's NFA" algorithm as I understand it that matches backreferences efficiently in most cases, I'm assuming I am misunderstanding either Thompson's NFA or the use cases that can cause exponential runtime.

I've made the entire code available here, and I'm pasting the relevant code at the bottom of my post (it is short).

Why are backreferences considered impossible with a dfa regex engine?

BEGIN RELEVANT CODE

-- the compiler compiles a regex to a single MatchSequence
data RegexOp where
        MatchEndAnchor  :: RegexOp
        MatchChar       :: Char -> RegexOp
        MatchAny        :: RegexOp
        MatchSequence   :: [RegexOp] -> RegexOp
        MatchGreedy     :: Int -> Int -> RegexOp -> RegexOp -- min, max, operation
        MatchNonGreedy  :: Int -> Int -> RegexOp -> RegexOp
        MatchClass      :: [RegexOp] -> RegexOp -- used for character classes and (..|..)
        MatchLookAhead  :: RegexOp -> RegexOp
        MatchNegative   :: RegexOp -> RegexOp
        MatchBackRef    :: Int -> RegexOp
        BeginCapture    :: RegexOp
        EndCapture      :: RegexOp
    deriving (Eq, Show)

...

-- a possible state, our runtime keeps a list of all possible states by using the List Monad
data ReState = ReState {
    stream              :: String
    ,current_groups     :: [String]
    ,groups             :: [String]
    ,current_match      :: String
} deriving (Show, Eq)

...

-- start a new capture
--
push_capture :: ReState -> ReState
push_capture state = ReState (stream state) ([] : current_groups state) (groups state) (current_match state)

...

-- finish the newest capture and add it to our groups
--
pop_capture :: ReState -> ReState
pop_capture state = ReState (stream state) (tail $ current_groups state) (groups state ++ [head $ current_groups state]) (current_match state)

...

-- dfa_mkop is our main execution function, it takes an op and a state
-- and returns a list of all possible states
--
dfa_mkop :: RegexOp -> ReState -> [ReState]

...

dfa_mkop (MatchBackRef refnum) state =
        do
            g <- return $ groups state
            if length g < refnum
                then []
                else do
                    gval <- return $ g !! (refnum - 1)
                    st <- return $ stream state
                    if length st < length gval
                        then []
                        else if take (length gval) st == gval then [matched (length gval) state] else []

dfa_mkop BeginCapture state = [push_capture state]
dfa_mkop EndCapture state = [pop_capture state]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Looks hard to me!

Can you find ten strings, A-J, of binary digits that, when concatenated ABCDEFGHIJ, give,

110001110101000101001011010010010010110110110010101000111010001010101011010111010010100

and, when concatenated in reverse order, JIHGFEDCBA, give,

110100101000101010110101101000100101010001111100100100100101101101010001010010110111100

?

That should be easy to solve with backreferences - just label the parts A-J and then ask to find them in reverse order. I'm too lazy to see how long it takes your code to solve this. The example above has at least one solution. If for some reason this example is too easy to solve, I'd wager you can find an example in this class.

No, I cannot. Too many

No, I cannot. Too many states are created at each step and it fails in some cases just like backtracking will. It still performs efficiently most of the time though.

If these pathological cases have not prevented backrefs from being implemented in backtracking implementations such as pcre or perl, then why do the current dfa implementations not support them? The literature I've found makes the impression that back references are "too hard" computationally to implement in such engines. Is there something that makes them inherently more dangerous in a dfa implementation?

The fact that it "performs

The fact that it "performs efficiently most of the time" does not negate the claim that the problem is NP-complete. A problem is NP-complete if there exists even one bad example that cannot be solved in polynomial time.

In general, I think computer scientists are unreasonably afraid of NP-complete algorithms. There are lots of examples of NP-complete algorithms that work very well and very usefully on real-world problems. The difficulty is that it is very hard to characterize which example inputs will turn out to be explosive.

Perhaps I should say: if you had a way to explain to the user: this input will explode and should be reconsidered, using an NP-complete approach would not be problematic. The real problem is that recognizing the bad cases is usually as hard as solving them.

More precisely...

A problem is NP-complete if there exists even one bad example that cannot be solved in polynomial time.

There has to be an infinite set of instances not solvable in polynomial time. Otherwise they can be recorded in a table and be disposed of in the constant time it takes to look up the answers.

That said, Jonathan is right. We deal 'easily' with lots of formally intractible, or even insoluble, problems all the time. (Like theorem proving, halting, bin packing, Sudoku, ...)

Though it's almost never plausible to draw a narrow line around the set intractible instances, it's often easy to construct a fence that has lots of interesting cases on the inside and keeps all the monsters at bay. HM is a good example of this. Lots of type inference problems are uncomputable or have exponential lower bounds (or worse!). HM can be made to run in pseudo-linear time (in some interesting cases) but corresponds to an extremely rich type system.

Recalling formal language theory

I am a bit rusty, but from what I recall, DFA/NFA can only express regular languages. By adding backreferencing to regular expressions, you can express non-regular languages. Therefore DFA can't express regex with backreferences.

Wikipedia's article mention a language called square, which is a string of repeated words or "(.*)\1" in regex. This language is not regular can be proved by using the pumping lemma.

The pumping lemma states that any string of length more than p, a constant depending on the language, in the language L can be decomposed into three sections, xyz, where y must not be empty, and any string composed by xy*z must also be within L.

If we create a string within the square language that's greater than length p, there is no way we can repeat the y substring and still be within the square language, since y substring is either match by "(.*)" or the backreference, and changing either part would invalidate the backreference.

I haven't read your code, but pushing and popping of state makes me think you've actually implemented a push-down automaton, which can express context-free languages and can perform some type of backreferences.

Formal languages and

Formal languages and programming languages are not the same topic. Please make the thread relevant to PLT, or continue it elsewhere. Thanks.

Eh?

This isn't relevant, but Google Chrome is? At least this post deals directly with an aspect of implementation necessary in many programming languages.

Chrome's a "tech adoption"

Chrome's a "tech adoption" story - and may well be a significant step in the practical evolution of Javascript. Granted this isn't PLT so much as PL practice, but it's more the end of practice that's historically been followed on LtU.

This was a mixup of

This was a mixup of terminology and a lack of understanding Thompson's NFA on my part (in my defense I only read the above link after writing the code...).

While my code is technically a deterministic finite state machine, I move the problem of calculating all the possible states into the runtime, and just generate them on the fly, which makes my runtime have the potential to use exponential time in the same use cases where backtracking fails.

In reply to Ehud, I don't see how formal languages can not be on topic for PLT, but I will restrict my posts in the future to make sure that there is a definite PLT theme. Sorry.

In reply to Ehud, I don't

In reply to Ehud, I don't see how formal languages can not be on topic for PLT, but I will restrict my posts in the future to make sure that there is a definite PLT theme. Sorry.

Roughly speaking, "implementation" syntax issues aren't PLT except insofar as noteworthily horrific ones (C++, I'm looking at you) can make various things difficult. Issues that're still present in abstract syntax are different, and ones that