determining subsumption of regular languages

I recently came across the concept of regular expression types, such as in XDuce. The idea seems promising, but the current implementations all seem excessively restrictive (disallowing nontrivial patterns), so I was wondering about the feasibility of further development in this area. The bottleneck for regular types would definitely be in trying to determine subsumption of these types in an automated manner, but I can't seem to find much literature on the subject.

Computationally how hard is it to determine equivalence/subsumption of regular languages in general. Finite? Hard? Easy?

For at least some languages (a+ <: a*) this sort of comparison seems plausible. If this comparison isn't plausible in general, is there any way to foresee which languages are easy to compare/hard to compare?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

PSPACE-complete, I believe

NFAs (which capture regular languages perfectly) are trivially closed under all set operations (union, intersection, complementation). We also know how to convert NFAs to DFAs (NP-complete, so likely exponential time), and emptiness testing (whether the automaton accepts nothing) on DFAs is also trivially decidable: enumerate all the strings of length at most the number of states in the DFA and test for membership. That's atrocious in terms of performance... but, iirc, DFA emptiness-testing is PSPACE-complete, so the previous algorithm would be optimal.

In short, it's decidable, but not tractable.

Note: PSPACE is the worst-case performance...

I don't think, particularly for real-world XML grammars, you would necessarily hit the worst case very often -- that is, if you pick a more clever algorithm than the one you've proposed. ;)

DFA minimization

DFA minimization yields the unique DFA and runs in O(n3) (that's just the naive algorithm; faster ones exist). So if you wanted to see if Language(DFA1) is a subset of Language(DFA2), you could do: Minimize(DFA1 - DFA2), then check the result to see if it's the minimal empty DFA.

[EDIT: I just realized that going through DFA minimization to check for emptiness is kind of silly. What's wrong with just doing a graph reachability test (see if you can reach any of the DFA accept states from the start state)? That should be O(states + transitions).]

But like pkhuong said, getting the DFA is the hard part. Going regex to NFA to DFA isn't great because NFA to DFA is potentially exponential time (though in practice it'll probably work fine).

However, I think NFAs are slightly more "expressive" than regexes for the same size (somebody please correct me if I'm wrong here). So maybe there's a way to go from regex to DFA directly that is more efficient than going through an NFA?

Right, it's probably

Right, it's probably NFA-emptiness testing that's PSPACE-complete then. Either way, you can hit really bad cases, if only when converting NFAs to DFAs... Not that atrocious worst-case performance has ever stopped Hindley-Milner with let-polymorphism from being adopted.

re dfa minimization

However, I think NFAs are slightly more "expressive" than regexes for the same size (somebody please correct me if I'm wrong here). So maybe there's a way to go from regex to DFA directly that is more efficient than going through an NFA?

By "regex" I assume that you mean alphabet, kleene star, and alternation. In particular, we're leaving out some operators we don't need but that can shorten patterns like intersection or complement.

A minimal DFA from a regex can be exponentially larger than the regex.

A minimal regex from a DFA can be exponentially larger than the DFA.

In that sense, regexs are not more expressive than DFAs and DFAs are not more expressive than regexs - and the lower bound on conversion between them is exponential.

Every DFA of size K is trivially an NFA of size K.

There exist NFAs, therefore, whose corresponding minimal regex is exponentially larger.

Every regex trivially corresponds to an NFA of the same size.

Therefore, you are correct that, in that sense, NFAs are more expressive, for given every such regexp a same sized NFA can be produced - but given some NFAs, an exponentially larger regex is needed.

Suppose that we allow intersection and complement operators in our regexs:

Then there will exist regexs whose minimal NFA is exponentially larger.

I don't know (but it is an interesting puzzle for some rainy day) whether or not there exist DFAs whose corresponding minimal regexs using intersection and complement are exponentially larger. Supposing that there are, then none of the three forms is more expressive than the others. Supposing that there are not, then such regexs (with intersection and complement) are maximally expressive - which (conjecturing, now) seems rather unlikely.

In any event, a lower worst-case bound on direct regex to DFA conversion with or without intersection and complement is exponential so, no, you don't have any short-cut there that is more efficient (in that sense) than going through an NFA.

Funny enough, this came up on another forum...

I've implemented an algorithm from another XML-typesafe language (XOBE); I'm not sure what you mean by "disallowing nontrivial patterns" since it appears to cover all regular languages. Maybe that's a restriction from elsewhere in XDuce?

Consider a right-regular grammar, with nonterminals collapsed into or-patterns. If you're interested in deciding/proving that A >= B, then (this is a heavily paraphrased version of the algorithm):

     prove(A > B) ->
       if seen(A > B): True 
       else: add_to_seen(A > B); prove(rhs(A) > rhs(B))
     prove(xA > yB) -> prove_terminal(x > y) && prove(A > B)
     prove( A|B > C ) -> prove(A > C) || prove(B > C)
     prove( C > A|B ) -> prove(C > A) && prove(C > B)
     prove( [Empty] > [Empty] ) -> True
     prove( _ ) -> False

The trickiness is (only) in the handling of recursion. A more formal treatment (unfortunately only a google books preview) of the approach (via XOBE's citations):

(Edited to add: I'm not sure what the performance of this is, but it seems to work well enough for me. :) )

Nontrivial patterns

What I was referring to when I mentioned disallowing nontrivial patterns applies more to testing a single string for membership than it does to construction of such languages. What excited me about this topic was the prospect of parsing for free, but that is clearly not what these languages are about.

In XDuce or XOBE or XHaskell, constructing a regular type refers to constructing a type that describes the parse tree corresponding to a string in that language, not the string itself. Removing that limitation would certainly make static guarantees harder to come by, but such is the fate of any dependent typing scheme.

I'm not 100% sure what you're getting at...

Rather than thinking about parse trees, it might be more productive to think of "XML Types" as representing guarantees over the values that parser or generator code can accept or create (a transformer does both at once). Those guarantees can be applied by the type system regardless of how the value is represented internally, which may be a parse tree but could just as well be a stream of SAX-style events or a string literal.

What do you mean by "parsing for free"? It's easy to automatically derive a parser from a type signature: that's the status quo out in the 'real world' for dealing with SOAP web services, after all.