CFL parsing, and another way to look at the CNF...

I'll implement it and try to break it when I get a chance, but it seems the below should exhibit O(|G| |U|2 n2 log n) worst case behavior (in time), where |G| is the total size of the grammar, |U| is the number of distinct nonterminals in the subset of unary productions U -> w in G, and n the input's length.

Or, what am I missing?

We consider only unambiguous CFGs in Chomsky Normal Form (CNF).

We devise a sort of "type system" over rewritings of the input as we process it from right to left, from end to beginning.

We'll look at the algorithm's work environment as being made of a left context, say Lc, a current symbol, say w (talking about a terminal) and a right context, say Rc.

The left context (Lc) is always a (length-decreasing*) prefix string w0...wi...w of terminals of the actual input (equal to the entire input s = w0...wn-1 initially, and empty at the end, when no more production type rewritings can be applied).

(* albeit, non-strictly)

The right context (Rc) is a mutating list of computed S-expressions storing what has been consumed and "typed" by reading the input from right to left, from end to beginning.

In this scheme, it is constructed so as to be, also, the recorded value of the rightmost-derivative of the parse tree, at the "point" Lc \ w, before Lc gets empty and reduction to the grammar start symbol (if possible) happens, eventually finalizing Rc as the single S-expr (S w0...wn-1) (or failing to do so).

Initially, Rc is empty, and Lc equals the entire input.

In the CNF grammar, we treat the two distinct types of productions thusly:

Productions of the form U -> w (aka rule R1)

Rule R1: U -> w  will be said to "type" w between Lc and Rc as... w: U ("the utterance of w is of type U between any Lc and Rc")

that is, whenever Lc ends with w

rewrite...

Rewriting via R1

Lc Rc ~> (Lc \ w) (U w) Rc
 new Lc> \------/ \------/ <new Rc

(takes a list of S-exprs as Rc and prepends the new S-expr (U w) to it, making a new Rc longer by one S-expr, while Lc gets shorter by one terminal, w)

Productions of the form U -> P R (aka rule R2)

Rule R2: U -> P R  will be said to "type" P followed by R in Rc as... P: R -> U ("the utterance of P applied to Rc typed as R promotes Rc to U")

that is, whenever head(Rc) = (P x) and head(tail(Rc)) = (R y)

rewrite...

Rewriting via R2

Lc Rc ~> Lc (U head(Rc) head(tail(Rc))) tail(tail(Rc))
            \----------------------------------------/ <new Rc

(takes a list of S-exprs as Rc and turns it into a new Rc shorter by one S-expr, Lc left unchanged)

Notice the one to one correspondence between mutually ambiguous grammar productions and mutually unsound production types, btw:

E.g., having, in the grammar, the mutually ambiguous

A -> BC
D -> BC

would yield, in this "type system", attempts at applying the mutually unsound (if only signature-wise)

B: C -> A
B: C -> D

Which would obviously be intrinsically problematic from either / both perspectives, and no matter what the input may be, as soon as said input will contain utterances of (a substring already reduced to) B, followed by (a substring already reduced to) C.

Anyway.

Now for a concrete example (full parse / derivation); consider the grammar in CNF (informal meaning given in parentheses):

(nonterminal-only RHS)

S  -> NP  VP  ("a noun phrase followed by a verb phrase reduces as a sentence")
VP -> Vt  NP  ("a transitive verb followed by a noun phrase reduces as a verb phrase")
NP -> Det N   ("a determinative followed by a noun reduces as a noun phrase")
N  -> Adj N   ("an adjective followed by a noun reduces as a noun")

and

(terminal-only RHS)

Det-> a       ("'a' reduces as a determinative")
Det-> the     ("'the' reduces as a determinative")
Adj-> young   ("'young' reduces as an adjective")
N  -> boy     ("'boy' reduces as a noun")
N  -> dragon  ("'dragon' reduces as a noun")
Vt -> saw     ("'saw' reduces as a transitive verb")

Which we'll use to "type" thru rewritings of S-exprs in Rc, as:

NP : VP -> S  ("the utterance of 'NP' applied to Rc typed as 'VP' promotes Rc to 'S'")
Vt : NP -> VP ("the utterance of 'Vt' applied to Rc typed as 'NP' promotes Rc to 'VP'")
Det: N  -> NP ("the utterance of 'Det' applied to Rc typed as 'N' promotes Rc to 'NP'")
Adj: N  -> N  ("the utterance of 'Adj' applied to Rc typed as 'N' promotes Rc to 'N'")

and

a     : Det   ("the utterance of 'a' is of type 'Det' between any Lc and Rc")  
the   : Det   ("the utterance of 'the' is of type 'Det' between any Lc and Rc")
young : Adj   ("the utterance of 'young' is of type 'Adj' between any Lc and Rc")
boy   : N     ("the utterance of 'boy' is of type 'N' between any Lc and Rc")
dragon: N     ("the utterance of 'dragon' is of type 'N' between any Lc and Rc")
saw   : Vt    ("the utterance of 'saw' is of type 'Vt' between any Lc and Rc")

After iteration 0 (init)

   (the          young          boy         saw         a         dragon)
Lc>\--------------------------------------------------------------------/ ( ) <initial Rc

(NB: Rc = empty list)

After iteration 1, via R1

   (the          young          boy         saw         a)        dragon
Lc>\-----------------------------------------------------/       (N dragon)1
                                                                 \--------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 2, via R1

   (the          young          boy         saw)        a          dragon
Lc>\-------------------------------------------/       (Det a)1   (N dragon)2
                                                       \-------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 3, via R2

   (the          young          boy         saw)        a         dragon
Lc>\-------------------------------------------/   (NP (Det a)   (N dragon))1
                                                   \-----------------------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 4, via R1

   (the          young          boy)        saw          a         dragon
Lc>\-------------------------------/       (Vt saw)1(NP (Det a)   (N dragon))2
                                           \--------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 5, via R2

   (the          young          boy)        saw         a         dragon
Lc>\-------------------------------/
    the          young          boy    (VP (Vt saw)(NP (Det a)   (N dragon)))1
                                       \------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

After iteration 6, via R1

   (the          young)         boy          saw         a         dragon
Lc>\------------------/
    the          young       (N boy)1   (VP (Vt saw)(NP (Det a)   (N dragon)))2
                             \-----------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 7, via R1

   (the)         young           boy          saw         a         dragon
Lc>\---/
    the         (Adj young)1  (N boy)2   (VP (Vt saw)(NP (Det a)   (N dragon)))3
                \-------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 8, via R2

   (the)         young          boy          saw         a         dragon
Lc>\---/
    the      (N (Adj young)  (N boy))1  (VP (Vt saw)(NP (Det a)   (N dragon)))2
             \---------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 9, via R1 (final Lc is empty)

    the           young          boy          saw         a         dragon
   (Det the)1 (N (Adj young)  (N boy))2  (VP (Vt saw)(NP (Det a)   (N dragon)))3
   \--------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 10, via R2

    the          young          boy          saw         a         dragon
(NP(Det the) (N (Adj young)  (N boy)))1 (VP (Vt saw)(NP (Det a)   (N dragon)))2
\----------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 11, via R2

       the          young          boy         saw         a         dragon
(S (NP(Det the) (N (Adj young)  (N boy))) (VP (Vt saw)(NP (Det a)   (N dragon))))1
\-------------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

Success, with the expected parse binary tree:

       the          young          boy         saw         a         dragon
       Det          Adj            N           Vt          Det       N
                            N                                   NP
               NP                                     VP
                                   S

Disclaimer edit:

this is just genuine amateur research from a practitioner who's never been in academia (that would be me). I've obviously posted it here on a friendly LtU only because I'm (still) wondering if I'm on to something / or what am I missing. Bear with me (and thank you in advance).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Just FTR

Just for the record wrt. the connections, I've been familiar with the CNF and the CYK algorithm since circa 1993, that I found both interesting for their theoretical generality from the beginning of course, all practical matters left aside.

More recently, I've been playing with Lisp-like language interpreters, followed by H-M type inference over Lisp-ish S-expressions.

But re: the latter, it's Matt M's remark (*) on cases of unsoundness leaks in a tangent topic, that got me intrigued again on the curious (deep?) similarity (so to speak) between parsing ambiguity arising from our choices of grammar design, and unsound or undecidable type inference arising from our choices of type system design.

Critiques / pointers welcome, if this algorithm turns out to be incomplete (fails to answer correctly for some grammars and inputs).

(*) credit where it's due

Parsing and type checking

I find combining parsing and type checking to be interesting, as you can have a typed stack for type-safe parsing, however the downside is that the algorithms for parsing and type checking become conflated, increasing the complexity of code in a single module. There is something to be said for separating concerns and keeping parsing and type inference/checking as separate passes.

Thank you for your comment.

Thank you for your comment.

On this, though:

keeping parsing and type inference/checking as separate passes

sorry for the confusion, maybe I should have put it more explicitly in my post. It's actually not about mixing parsing and type inference in the usual sense of both. It's more about leveraging what is known about strong normalization in the latter towards the former.

It's only about a CFL parsing objective in the post, and only for grammars in CNF, specifically. There's nothing about type inference going on further, semantically / interpretatively speaking (eg, that would in turn be used to guide some subsequent parse tree transformation, eg, for code generation, or, etc)

But true, at the core, the (admittedly, strange?) curiosity I had is to explore the applicability of the strong normalization property of the β-reduction rule found in the STLC (and the likes) to the a priori unrelated problem of CFL parsing, thru this S-expr rewriting scheme I've described.

The latter sort of / so to speak "type-annotates" a reification (thru mutating S-exprs after Lc) of the ongoing parsing progress itself, as it also carries out its traditional primary operation -- parse tree construction.

Hence, the genuine question, btw:

is it known whether (or not) the parsing of sentences that belong to a language described by an (unambiguous) CFG is in fact computably equivalent(?) or nearly, to finding the principal type of a well-formed term in a type system having the principal type property?

shallow transformations

There is something to be said for separating concerns and keeping parsing and type inference/checking as separate passes.

I think you can accomplish this via shallow transformations. That is, you parse and type check totally separately, but interleave them level-by-level, node-by-node as you traverse the syntax tree. This means you actually have three separate concerns: 1) parsing 2) type checking and 3) AST traversal.

Assume you're parsing independently from reading, so you've already got some s-expr in memory. For example, you may parse (if b (f x) (g y)) in to something like (if-node (symbol "b") (list (symbol "f") (symbol "x") ... without recursively parsing the arguments. Then a full-parse is defined by (traverse shallow-parse sexpr). Then you can also do something like (traverse (compose shallow-check shallow-parse) sexpr). The tricky part, of course, is that you need to thread context through and the traversal needs to be able to change the shape of the result, etc.

Conflated?

Why do they become conflated?

In my (as yet unimplemented) design the type checking is static and done entirely by the compiler (i.e. the host language type checker). Instead, the burden is on the grammar writer to get the types right. Obviously, one has to actually compile such a parser.

Basically the parser at present uses an untyped parser stack, and AST nodes are essentially S-expressions. If instead we switch to a typed stack, the stack type is determined by the compiler from the action codes in the productions.

I have had this working in a weak way but the type combinators were not representable. Since then I have added GADTs to the language so that typed pushes and pops from the parser stack are representable.

[I have some extra tricky stuff to handles since the data types associated with a symbol are not the attribute type, but channels which transmit that attribute. So when generating AST terms, I'm dealing with a parser stack of typed channels which have synchronisation requirements above and beyond a standard spatial product]

Conflated

By conflated I mean you cannot write one algorithm without knowledge of how the other one works. In a clean multi-pass system the parser outputs an AST of untyped nodes (and type annotations), and the type inference/checker is in a separate pass. No knowledge of the parser node generation order is needed to write the checker, and no knowledge of the checker is needed to write the parser. They can be written by deferent people and take different approaches to order, parallelism, etc. The only commonality is the agreed AST format. You can even completely change the parsing or checking algorithm independently with no problems.

It should be clear that any attempt to type the stack requires the parser to check as it goes along, which precludes separating the passes, and prevents the independent selection of algorithms.

Ambiguities

Are you sure you don't want to deal with ambiguities? Natural language processing is full of them.

Can this parser parse the sentence: "I shot an elephant last night in my pajama." (who is wearing pajama?)

[Edit] Unfortunately, dealing with ambiguities puts parsing time on logarithm scale.

Structure

This is only non-deterministic to parse in a logic that does not allow for probabilities. If you abandon the black and white of binary logic , you can construct a graph representing the situation. The 'I' object is linked to the 'elephant' object by a 'shot' relationship (in the direction "I ---shot--> elephant" with a probability of 100%. 'pajamas' is linked to both 'I' and 'elephant' with a 'worn by' relation, with a probability of say 50% each. At this point the weighting can be adjusted based on prior experience, say for example the number of times thus sentence structure has been used, and subsequent information has resolved the bifurcation one way or the other (let's guess 80% 'I', 20% 'elephant').

In any case, there is not necessarily any non-determinism here, there need not be different parse orders, nor multiple outputs of complete parses). This only occurs if you view the output with a binary understanding of truth.

Sorry, no, I'm not interested

Sorry, no, I'm not interested in dealing with ambiguous grammars. Or parse forests.

More humbly, I'm just exploring how far can we go with non-ambiguous CFGs and for hopefully cheaper in time / space complexity (of parsing).

This algorithm always terminates, is sound, and complete

This algorithm always terminates, is sound, and complete at least for those inputs that have a unique parse tree.

Termination

For a finite grammar (basic assumption), the algorithm's R2 rewriting / reduction rule will exhaust all the productions trying to find one applicable. Or else, the R1 rewriting / reduction rule always consumes one (and exactly one) terminal, that which ends Lc. Otherwise, fail.

Soundness

By very bottom up construction of the nested S-exprs from the nonterminals appearing on the left hand side of the productions (starting with U - > w reductions whenever R1 is applied).

Completeness (at least for single-parse tree sentences)

(left to do)

A non-LR(1) example

Consider the language,

L = { cabxy, cabxz }

described by the CFG (G1)

S -> Q
Q -> A x y
Q -> B x z
A -> C a b
B -> C a b
C -> c

(discussed in [Gayler Harford, Heuringt, and Main, 1992])

Grammar (G1) isn't ambiguous to describe L, but it just isn't LR(1) -- as we need more input look-ahead to find the parse trees:

c a b x y
|   | | |
C   | | |
 \  | | |
  \ | | |
   \| | |
    A | |
     \\ |
      \\|
       \|
        Q
        |
        S

and

c a b x z
|   | | |
C   | | |
 \  | | |
  \ | | |
   \| | |
    B | |
     \\ |
      \\|
       \|
        Q
        |
        S

It's fair to ask why would one write (G1) like this in the first place, if it wasn't by accident.

Well, for instance, one may want to discriminate the left context's "need" to reduce to A vs. B, but not until we're ready to reduce to Q after consuming y or z.

We can try to rewrite (G1) into (G2) (in CNF), and allow for the same sort of discrimination (below, thru U vs. V), as a desired "constraint" on the parse trees, in the "neighborhood" (so to speak) of X -> x :

(G2)

S -> H Q
H -> C A
Q -> B U
Q -> B V
U -> X Y
V -> X Z
C -> c
A -> a
B -> b
X -> x
Y -> y
Z -> z

which yields:

c a b x y
| | | | |
C A B X Y
 \| \ \ /
  H  \ U
   \  \|
    \  Q
     \ |
      \|
       S

and

c a b x z
| | | | |
C A B X Z
 \| \ \ /
  H  \ V
   \  \|
    \  Q
     \ |
      \|
       S

So, we want to test our right-to-left, bottom-up parse algorithm, in the familiar "left-to-right-plus-look ahead" parsing setup / use case:

        c       a       b       x       y

    Lc>(c       a       b       x       y) ( )<Rc (init)
                                          (empty)

    Lc>(c       a       b       x)    (Y y)<Rc (via R1)

    Lc>(c       a       b)    (X x)  (Y y)<Rc (via R1)

    Lc>(c       a       b)  (U(X x)  (Y y))<Rc (via R2)

    Lc>(c       a)    (B b) (U(X x)  (Y y))<Rc (via R1)

    Lc>(c       a)  (Q(B b) (U(X x)  (Y y)))<Rc (via R2)

    Lc>(c)    (A a) (Q(B b) (U(X x)  (Y y)))<Rc (via R1)

Lc>( ) (C c)  (A a) (Q(B b) (U(X x)  (Y y)))<Rc (via R1)
 (empty)

Lc>( ) (H(C c) (A a)) (Q(B b) (U(X x)  (Y y)))<Rc (via R2)
 (empty)

Lc>( ) (S (H(C c) (A a)) (Q(B b) (U(X x)  (Y y))))<Rc (via R2)
 (empty)

and likewise, for input = c a b x z, to yield instead:

Lc>( ) (S (H(C c) (A a)) (Q(B b) (V(X x)  (Z z))))<Rc
 (empty)

Reference which discusses the (G1) grammar above:

Gayler Harford, A., Heuringt, V.P. and Main, M. G., 1992. A New Parsing Method for Non-LR (1) Grammars

Reinventing RL(0) parsing

As far as I can tell, you've reinvented RL(0) parsing. If you try the reverse grammar, you should see why it doesn't always work.

RL(0)?

I don't know about RL(0) parsing and I don't see what you mean by "reverse" the grammar? Can you provide an example in CNF?

Knuth introduced it

Knuth introduced it in the first paper on LR parsing from 1965, "On the translation of languages from left to right." (I cannot over-emphasize the importance of reading original sources.) It's defined symmetrically to LR(0); you can think of it as LR(0) on the reversed grammar and input. The "reversed" grammar just means reverse the symbols on the right hand side of every production rule. E.g., the reverse of (G2) is:

S -> Q H
H -> A C
Q -> U B
Q -> V B
U -> Y X
V -> Z X

with the same rules for A, B, C, X, Y, Z as before. It describes the reversed language, { yxbac, zxbac }. This should be the hard case for your method as it's not RL(0) or even RL(1), although it's LR(0).

Ah ok, thanks

Ah ok, thanks for the pointer. Now I see. So you were alluding to what could possibly be a right-to-left vs. left-to-right input processing sensitivity of the method.

Fair concern, that needs to be checked, yes.

Given,

yxbac

can we parse it against,

(G3)

S -> Q H
H -> A C
Q -> U B
Q -> V B
U -> Y X
V -> Z X
C -> c
A -> a
B -> b
X -> x
Y -> y
Z -> z

with the same algorithm?

With Lc = all of the input s, and Rc empty initially;

R1: whenever Lc ends in w, and there is U -> w in the grammar, rewrite:

Lc , Rc ~> Lc \ w     , ( U w )  Rc
(shorter by w)^                  ^(same)

R2: whenever Rc = ( P x ) ( R y ) ... and there is U -> P R in the grammar, rewrite:

Lc , Rc ~> Lc , ( U ( P x ) ( R y ) ) ...
     (same)^                          ^tail(tail(Rc))

Stop when either Lc is empty and/or neither of R1 or R2 is applicable;

s is a word of the language iff Rc = ( S ... ) where S is the grammar start symbol.

Here is the full derivation:

Lc            Rc

( y x b a c ) ( ) init.
( y x b a )   ( C c ) via R1
( y x b )     ( A a ) ( C c ) via R1
( y x b )     ( H ( A a ) ( C c ) ) via R2
( y x )       ( B b ) ( H ( A a ) ( C c ) ) via R1
( y )         ( X x ) ( B b ) ( H ( A a ) ( C c ) ) via R1
( )           ( Y y ) ( X x ) ( B b ) ( H ( A a ) ( C c ) ) via R1
( )           ( U ( Y y ) ( X x ) ) ( B b ) ( H ( A a ) ( C c ) ) via R2
( )           ( Q ( U ( Y y ) ( X x ) ) ( B b ) ) ( H ( A a ) ( C c ) ) via R2
( )           ( S ( Q ( U ( Y y ) ( X x ) ) ( B b ) ) ( H ( A a ) ( C c ) ) ) via R2

Success, yxbac is a word of the language generated by (G3), as we have a final Rc representation of the parse tree, with S at the root:

y   x b   a   c
|   | |   |   |
Y   X |   |   |
 \ /  |   |   |
  U   B   A   C
   \ /     \ /
    Q       H
     \     /
        S

My mistake

My mistake. I didn't bother to check that your transformation of G1 to CNF preserved the property of not being LR(1); I think G2 is actually LR(0) so that G3 is RL(0). Try G4:

S -> Q
Q -> Y P
Q -> Z R
P -> X A
R -> X B
A -> a
B -> a
X -> x
Y -> y
Z -> z

which generates "yxa" and "zxa". I made sure this one isn't RL(0).

Not as-is

Granted, your grammar isn't usable as-is for my algorithm if only because of productions like these:

A -> a

B -> a

(I was already aware of that, at least.)

But we can rewrite it in this grammar (arguably, simpler also), in the same LL / LR / RL classes (unsure?)

S -> Q
Q -> Y P
Q -> Z P
P -> X A
A -> a
X -> x
Y -> y
Z -> z

which generates the same language, and still without ambiguity --

both points obviously desirable (as I had no interest in parsing against inherently ambiguous grammars or languages anyway).

Then,

Lc        Rc

( y x a ) ( ) init.
( y x )   ( A a ) via R1
( y )     ( X x ) ( A a ) via R1
( y )     ( P ( X x ) ( A a ) ) via R2
( )       ( Y y ) ( P ( X x ) ( A a ) ) via R1
( )       ( Q ( Y y ) ( P ( X x ) ( A a ) ) ) via R2
( )       ( S ( Q ( Y y ) ( P ( X x ) ( A a ) ) ) ) via R2

(Success)

and

Lc        Rc

( z x a ) ( ) init.
( z x )   ( A a ) via R1
( z )     ( X x ) ( A a ) via R1
( z )     ( P ( X x ) ( A a ) ) via R2
( )       ( Z z ) ( P ( X x ) ( A a ) ) via R1
( )       ( Q ( Z z ) ( P ( X x ) ( A a ) ) ) via R2
( )       ( S ( Q ( Z z ) ( P ( X x ) ( A a ) ) ) ) via R2

(Success)

I really should take the time to write a minimal implementation finally, instead of executing this by hand.

Thanks for stressing it, though.

Right. I mainly wanted to

Right. I mainly wanted to point out that you're on well trodden ground here. Your earlier posts made it sound like you were claiming to have a linear algorithm capable of parsing arbitrary unambiguous CFGs in CNF. In fact, what you have is something strictly weaker than RL(0), I think (because the state stack provides more opportunities to determine which of two applicable reductions is the right one). This is a perfectly respectable approach, but you should expect to run into all of the same issues that deterministic left to right parsers run into, just in their symmetric form. In particular, determining whether a given "naturally" expressed CFG is unambiguous and can be refactored somehow to the class of grammars accepted by your parser is in general a far from trivial problem.

Be reassured

I mainly wanted to point out that you're on well trodden ground here. Your earlier posts made it sound like you were claiming to have a linear algorithm capable of parsing arbitrary unambiguous CFGs in CNF.

Be reassured I took all your comments here that way. And I did make a hasty claim that I know by now is false. Thank Andrew who was quick to make me think harder. And you, pointed out to me good, relevant knownledge I must keep in mind, so thank you.

What I still don't know is if I put the finger on something new anywhere between (excluded) linear time and cubic time.

But, code is fun anyway :)

Language is LR(1).

The example is somewhat synthetic as the grammar is a pathological case of describing the language. The language itself is LL(0), although the grammar is artificially constructed to make a non-LR(1) representation of it.

How would your algorithm handle cleaner cases: for example \( L = \{ x^i\cdot y^j | i>y \} \), where the language itself is LR, rather than LL.

Another great point

And another great point you make, thank you for that.

Well, as you can see, James is stressing it against the RL cases, while so far I had only played with a few LL / non-LL / LR / non-LR toy grammars. But I must reckon I haven't done much checking yet on the intrinsic properties of the languages themselves.

If you have some examples readily expressed in CNF and parseable only thru the most general methods such as CYK (because too difficult to express with LL / LR / RL grammars otherwise) I'll be glad, in fact, very interested, to check them out.

I'll think about this one you just gave.

Thank you!

{ a^i b^j, i > j > 0 }

Okay, so that was a bit tedious (for me, anyway) to get it right in CNF, but I couldn't resist / was feeling too itchy to not know:

For the language { a^i b^j, i > j > 0 } described by,

S -> a S | a P
P -> a P b | a b

(which Grammophone reports as SLR(1), LALR(1), and LR(1), but not LL(1) and not LR(0))

here's a possible CNF version of it, for the same language:

S -> A S
S -> A P
P -> A Q
P -> A B
Q -> P B
A -> a
B -> b

(maybe there's a simpler one?)

which generates, eg,

a a b
a a a b
a a a b b
a a a a b
a a a a b b
a a a a a b
a a a a b b b

(... etc)

And for the input,

s = a a a b b

the full derivation is:

 ( a a a b b ) ( ) init. 
 ( a a a b )   ( B b ) via R1
 ( a a a )     ( B b ) ( B b ) via R1
 ( a a )       ( A a ) ( B b ) ( B b ) via R1
 ( a a )       ( P ( A a ) ( B b ) ) ( B b ) via R2
 ( a a )       ( Q ( P ( A a ) ( B b ) ) ( B b ) ) via R2
 ( a )         ( A a ) ( Q ( P ( A a ) ( B b ) ) ( B b ) ) via R1
 ( a )         ( P ( A a ) ( Q ( P ( A a ) ( B b ) ) ( B b ) ) ) via R2
 ( )           ( A a ) ( P ( A a ) ( Q ( P ( A a ) ( B b ) ) ( B b ) ) ) via R1
 ( )           ( S ( A a ) ( P ( A a ) ( Q ( P ( A a ) ( B b ) ) ( B b ) ) ) ) via R2

(Success)

Interesting

I think you've described an RR parser. The processing rules for the tail of Lc are definitely describing a right-to-left processing of the input string. It looks like it builds rightmost derivations in that example, as part of the process of applying a new root to the tree in the Rc looks like the handle reduction in an LR parser.

Boundary cases for language classes are somewhat difficult to find. I actually found that one on stackoverflow a few months ago as I've been looking at ways to implement something a little bit more sane than bison to give to students to work with. That has slowly morphed into reimplementing parsec in c++, which seems to be interesting enough in its own way.

If anybody knows of any more boundary cases, e.g. a simple language that is context-free but not in LR then I would also be very interested in seeing them. There a slightly more complex version of your natural language example on the wiki page for CYK. I guess its well known which class it is in, but I would have to think about it for a while to try to work it out.

The notes about complexity at the bottom imply that the minimum complexity for parsing should be \( \mathcal{O}(n^2\cdot|G|) \) - I think, from memory the predicted best case for binary matrix multiplication is \( \mathcal{O}(n^2) \), but it is not known how to achieve it as it would yield a linear time multiplication algorithm for the integers. I have to repeat this is from memory, so I'm entirely sure. But I suspect this means there is a missing term in your complexity analysis. It would probably show up on an example of a grammar from a hard language....

Well, I shall know soon

Well, I shall know soon about that, with an actual implementation, less toy-ish grammars, and much longer inputs.

Both CYK and my algorithm are easy to implement and are mostly of theoretical interest, I'd agree with that.

Just to be clear, myself, I'm still not quite 100% sure about the generality of my method or its actual complexity, so I can surely understand the skepticism here.

But just like a naive recursive Fibonacci can eat the entire computer stack or take days to terminate as opposed to milliseconds for the non-recursive version, a few actually executable and less trivial test cases should tell us in a more definitive way (eg, when compared to CYK's) what this algorithm's actual O(...) looks like eventually.

Tempered skepticism

I'm a little skeptical about the complexity - but I would emphasize the word "little" rather than the word skeptical. I think what you've done looks really interesting and I would definitely like to see more about it. The traces that you've shown to illustrate how the algorithm works are very clear and easy to follow. Looking at the way it folds up the parse tree for the LR boundary case gives me the feeling that you have something very interesting here. The steps that it goes through to consume the string from Lc, and build up the tree in Rc, are very simple. This suggests that small set of evaluation steps could be quite powerful.

Definitely looking forward to reading more about it as you work out the details.

Time will tell

Thanks for the encouragement. I'll try to find time over the next few days, if the domestic allows...

Btw, on your other comment re: the traditional state stack as found in bottom up LR methods, my feeling is mine gets its information flow over time embedded in fact in that S expr that it builds as that parse tree's right-most derivative, sort of (just my intuition). My algorithm eats the input's terminals from one end to the other, and in the process refines the parse at the point Lc \ w by digging into the prior state of that parse on the right, not just by a lookahead token, because there isn't any, really, as we've already noticed.

Yet, this does seem enough to actually parse against grammars from various classes, just as CYK does. Arguably, CYK doesn't have any look ahead either, does it? Hence why I believe that I'm closer to have found a variant of CYK if anything, than to have a weaker or stronger LR or RL. The right-to-left direction I chose doesn't seem to matter in fact, to effectively get a correct parse tree, before or after reversing a grammar. (James made a good point it's something to be aware of potentially impacting the generality and effectiveness, though.) So, that's what I find intriguing from the start, this independence to the grammar "shape", and made me think of an analogy with the inside-out information flow of the App rule in H-M type inference, that doesn't care about the order in which the argument types have been introduced (relative to one another), before calling Unify(...) -- but okay, no idea if that'll make sense to anyone else but me so far.

More practically, what ballpark of test grammar sizes and input lengths would you recommend me to prepare and feed the unbiased computer to answer us more convincingly and definitively, either way?

(Toys may be important to test ideas, but are rather boring when it's demo time, so while I'm at it...)

Ah! Wait. Why the independence?

After editing the above comment, that finally sunk in:

why the effectiveness of my algorithm seems independent to the choice of direction for eating the input?

Well, I suspect it's simply because of the shape of a CFG in CNF with respect to shift and reduce operations: it is totally symmetric for both of them (ie, the computations they carry out). "Duh."

Indeed, for the CYK algorithm also, it just doesn't matter for building the parse tree, which direction you used to reduce the terminals on the diagonal of the matrix... The other steps don't need to care... because the CNF will still be symmetric in essence anyway, modulo the pair: language described vs. input being processed.

And you know what is also the invariant location where to find (or not) the reduction to the grammar start symbol, if the word belongs to the language (or not) -- for CYK, it's "in that corner, over there, facing the diagonal", can be nowhere else; for my algorithm, it's that last, single outer S-expr, can be nowhere else.

I'm thinking of benchmarking it

I'm thinking of benchmarking it vs. CYK for |G| ~ 20 to 50 productions, and input lengths ~ 1,000 to 10,000 terminals (*)

Thoughts?

(* typical of what a human can grok in minutes throughout a few pages, be it in a natural or computer language)

Et voilà! Fail.

The failure that James was looking for:

The reverse of this language is,

{ b^i a^j, 0 < i < j }

Here's the (obviously) working CNF version:

S -> S A
S -> P A
P -> Q A
P -> B A
Q -> B P
A -> a
B -> b

(interestingly enough, btw: still not LL(1) or LR(0), but still SLR(1), LALR(1), and LR(1), says Grammophone)

So, my algorithm lamentably fails on it... while it worked like a charm on its mirror, and other mirrored languages. Was just "bad-good luck".

But this language is "nastily" anti symmetric so to speak.

So, I made another false claim earlier in my "Why the independence" comment.

Thus, as it stands in the post, the algorithm is in fact, not independent of the direction of eating the terminals, for such pairs of mirrored languages.

Thank you Andrew and James, that's team work, now :)

But, guess what, there is a way to amend (almost trivially) the algorithm, to make it parse correctly against both grammars / recognize words from either mirrored languages.

Just... guess how! :)

Hint: time vs space (but not in complexity, in logic, rather).

Complexity

Could you explain your reasoning behind O(|G|*n) for the complexity?

If I look at the reduction step then it must scan each rule in the grammar to determine which rule can reduce the handle; O(|G|). This may be repeated up to O(|G|) times, depending on the structure of the grammar and the length of the chains between rules (i.e. C->B, B->A ...), and this reduction step needs to be performed for each terminal. Is it not at least \( \mathcal{O}(|G|^2\cdot n) \) ?

At each step,

At each step, we have a left context Lc consisting of n-i terminals if n is the input's length, for i varying from 0 to n-1.

The right context Rc is an evolving S-expr capturing what has already been reduced from either forms of productions in the CNF.

At each iteration, from the one (or two) S-exprs cached at the head of Rc (ie, the leftmost ones, S1 and S2 in Rc = S1 S2... Sk) we know which right hand side, eg, P and R, in productions of the form

U -> P R

we need to look up in G.

That's because S1 and S2 will be of the form (P (some child S-expr...)) and (R (some child S-expr...)) assuming P and R have already been reduced from former states of Rc (from past greater values of n-i, or smaller ones of i, equivalently).

The look up of the pair P R is bounded in proportion to / linear in |G|, but can be improved if we have provisioned for instance with an inverse map with keys that are those pairs onto values that are the productions. We cannot avoid this lookup of course throughout all of the input, but it needs to be done no more than once at each iteration, hence my O(|G| * n) estimates.

Wrong.

but it needs to be done no more than once

(Wrong. See Andrew's comment below.)

My bad Andrew.

My bad Andrew. I finally got your point. Thank you. I'm slow. You're right in that this lookup wouldn't be only in O(|G|).

Actually, the per iteration-cost would be more likely in O(|G| * log n) for the parse trees induced by the general shape of the CNF, because, at each step, as you point out, we have on average a potential of applicable reductions proportional to the depth of what is the actual parse (always binary) tree, not just one (reduction).

So, I'd re-estimate overall as O(|G| n log n). I'll try to lay it out more thoroughly using the usual T(n) formulas with a complexity analysis more worth of the name.

E.g., just looking at the non-LR(1) example (as I should have noticed much earlier) :

n = 5

depth of parse tree = 4

5 * log24 = 10

corresponding, seemingly, to the 9 or 10 steps of what turns out to be the actual derivation (only excluding the lookups over G, for an extra time complexity factor in O(|G|)).

Not.

So, I'd re-estimate overall as O(|G| n log n)

Heck, not. Revised below.

Your feedback

Your attentive criticism is the sort of feedback I was / am hoping for. Thank you.

Grammophone

Michael Daines' Grammophone tool is a neat grammar sanity checker / analyzer, btw.

Worth giving my shot at it

So, I'll be coding this over the weekend hopefully, with more palatable test data.

Will push the code and whatever results I get on GitHub.

Cheers.

Other non-LR(1) CF grammars and/or languages (FYI / test suite)

Palindromes of even length

Classed: not LL(1), not LR(0), not SLR(1), not LALR(1), not LR(1)

CFG:

P -> a P a | b P b | a a | b b

in CNF:

P -> AP A
P -> BP B
AP -> A P
BP -> B P
P -> A A
P -> B B
A -> a
B -> b

Sample sentences:

a a
b b
a a a a
a b b a
b a a b
b b b b
a a a a a a
a a b b a a
a b a a b a
a b b b b a
b a a a a b
b a b b a b
b b a a b b
b b b b b b
a a a a a a a a
a a a b b a a a
...

{ a^m b^n, 0 < m <= n <= 2m }

Classed: not LL(1), not LR(0), not SLR(1), not LALR(1), not LR(1)

CFG:

S -> a S b | a S b b | a b b | a b

in CNF:

S -> A SB
S -> ASB B
S -> AS BB
S -> A B
S -> A BB
SB -> S B
ASB -> AS B
AS -> A S
BB -> B B
A -> a
B -> b

Sample sentences:

a b
a b b
a a b b
a a b b b
a a b b b b
a a a b b b
a a a b b b b
a a a b b b b b
a a a b b b b b b
a a a a b b b b
a a a a b b b b b
a a a a b b b b b b
a a a a b b b b b b b
a a a a b b b b b b b b
a a a a a b b b b b
a a a a a b b b b b b
...

{ a^n b^n, n > 0 } U { a^n b^2n, n > 0 }

Classed: not LL(1), not LR(0), not SLR(1), not LALR(1), not LR(1)

CFG:

S -> a C b | a D b b
C -> a C b | .
D -> a D b b | .

in CNF:

S -> AC B
S -> AD BB
AC -> A C
AD -> A D
BB -> B B
C -> AC B
C -> A B
D -> AD BB
D -> A BB
A -> a
B -> b

Sample sentences:

a b
a b b
a a b b
a a b b b b
a a a b b b
a a a b b b b b b
a a a a b b b b
a a a a b b b b b b b b
a a a a a b b b b b
...

...

(stay tuned)

So, here it is

Tests updated

Tests updated with two more languages found in the literature; cf. 06-basic-grammar.txt / 06-basic-input.txt and 07-basic-grammar.txt / 07-basic-input.txt.

Interesting results for the longest inputs; cf. test-results-CYK-20170228.txt vs. test-results-CJ2-20170228.txt.

See the notes, btw.

No doubts any longer about the time complexity.

No doubts any longer about the time complexity (the machine cannot lie or care less about that).

What I am now pondering about (anew) is the algorithm's generality wrt. the CFL classes (its termination is guaranteed and soundness was easy to check too, but my completeness proof sketch was flawed; hence the deletion edit).

Bah, let's just keep fishing

I think I'm just going to continue fishing for a bunch of grammars mentioned by LL / LR / GLR users and see where that takes us after CNF conversion.

Incomplete

Unsurprisingly, as it turns out, this algorithm is incomplete, including for CFLs that can be described by an unambiguous grammar and for inputs yielding single parse trees.

It isn't difficult to build a counter example to exhibit it on less trivial languages, be their grammars LL, or LR, or not.

However, while scrutinizing its incompleteness on failures cases (where it should have found the parse tree and couldn't), I stumbled on new insights wrt. CFGs in CNF and (or, am I rediscovering it?) it is useful to introduce the notion of a finite "kernel" of a language.

It seems that one is closely related to what makes the language of palindromes fall into the non-deterministic CFL class, for instance.

More later.

Definition: kernel K(L/G) of L relative to G

Informally

The kernel of a (context-free) language relative to a (context-free) grammar that describes it, is the set of pairs of terminals at least one element of which (aka, "a kernel witness") appears at least once, for any word of the language; IOW, any word of the language thus contains at least one witness instance, or more, from that kernel (different words containing different witnesses);

further, there is this exactly one witness instance in the word which leads to the full parse tree (via the given grammar's applicable reductions), after computing both the rightmost and leftmost parse tree derivatives, at the point of that witness instance (that we may call, say, "the parse witness", among possibly other witnesses also present) of the input word.

More formally

Given a context-free language L described by a context-free grammar G (in CNF),

G = (N, T, P, S) where N is the set of nonterminals, T is the set of terminals, P is the set of productions, and S is the start symbol,

we define,

the kernel K(L/G) of L relative to G,

K(L/G) = { A . B in N2, such that (starting from the production(s) S -> U V), A is in dL(U) and B is in dR(V) }

with, for all X in N,

dL(X) = {
  if X -> w in P, and there is Y -> C X in P
  reachable from U in S -> U V:
  include X
or
  if X -> A B in P, and there is Y -> C X in P
  reachable from U in S -> U V:
  include all of dL(B) (by transitive closure)
}

dR(X) = {
  if X -> w in P, and there is Y -> X D in P
  reachable from V in S -> U V:
  include X
or
  if X -> A B in P, and there is Y -> X D in P
  reachable from V in S -> U V:
  include all of dR(A) (by transitive closure)
}

Eg, with the Wikipedia example:

https://en.wikipedia.org/wiki/CYK_algorithm#Example

K(L/G) = { N . V, NP . V }

(ie, the set of two pairs (in N2) which generates the finite L's kernel's terminal pairs { "she eats", "fish eats", "fork eats" })

I will put this idea to test in the next version of the algorithm; roughly:

in the input, the instance(s) (*) of terminal pairs generated by the language's kernel (aka, its witnesses in the input) shall give us (a lot?) useful information wrt. answering about the membership of the input word to the language and/or about its possible parse(s), "simply" by "looking" at what appears on both sides, left and right, for some or all of these witnesses.

(* clearly potentially multiple, in the case of recursive languages)

And what makes the CFL of palindromes so "hard"

And what makes the CFL of palindromes so "hard" (impossible) to reject a non-word without having to read the entire input?

Simple:

that's because (one or more) words from its kernel -- { a a, b a, a b, b b } -- (ie, witnesses thereof) are found "all over the place" in the input.

Actually, inherent to the language itself, a non-empty subset of that kernel is guaranteed to always appear throughout all of it, all the way long of the entire input.

IOW, "always too much information is the same as never enough information," in this case.

No wonder / hence another way to see why we need more computational power on that one (ie, beyond pushdown automata).

Easier / more friendly (recursive) CFLs and their kernels

L1 = { a^n b^n, n > 0 }

K(L1/G1) = { a b }

Expected witness occurrences: exactly one

L2 = { a^n b^n c, n > 0 }

K(L2/G2) = { a b }

Expected witness occurrences: exactly one

or,

K(L2/G2) = { b c }

Expected witness occurrences: exactly one

(depending on how we design / write L2's grammar G2 in CNF)

Etc, etc.

At the languages level, I believe the size of L1 and L2's kernels, combined with the trivial multiplicity of those (expected to be only 1 for all words in the respective languages), make it much easier / a non-issue for the known LL or LR algorithms to reject non-words, as soon as is visible (to a pushdown automaton).

(At the exact opposite from the palindromes)

An upper bound, and on the significance of the kernel

Upper bound

for an input s of length n, and if s is in L(G), s may contain at most n-1 occurrences of words generated by the language's kernel.

This is by very construction: the words generated as the finite L's kernel are pairs of terminals, and there are at most n-1 possible locations to find such pairs of adjacent terminals throughout the input.

Eg, in the ("extreme") case of L = the palindromes of even length,

given G,

P -> a P a | b P b | a a | b b

or, in CNF, as G':

P -> AP A
P -> BP B
AP -> A P
BP -> B P
P -> A A
P -> B B
A -> a
B -> b

we have:

K(L/G') = { p1, p2, p3, p4 }

where,

p1 = a a

p2 = b a

p3 = a b

p4 = b b

Significance of the kernel

Eg, let,

s = b a b a a b a b

be the input to recognize and parse.

Thus, after computing K(L/G'), we can match s against it in one pass, yielding,

s = b      a      b      a      a      b      a      b
    --(p2)--      --(p2)--      --(p3)--      --(p3)--
           --(p3)--      --(p1)--      --(p2)--

Now let us simply index these witnesses of K(L/G'), throughout their locations,

s = b      a      b      a      a      b      a      b
    --(k1)--      --(k3)--      --(k5)--      --(k7)--
           --(k2)--      --(k4)--      --(k6)--

Let us pick, say, k3 ( = p2 = b a ) and compute the rightmost parse tree derivative of the prefix found on k3's left (as described in the post):

initial Lc = b a b

            b      a      b          (empty)
(init.) Lc> \_____________/          ( ) <Rc

            b      a
        Lc> \______/                  B <Rc (via R1)

            b
        Lc>                           A B <Rc (via R1)

            (empty)           
        Lc>                           B A B <Rc (via R1)

So far, this doesn't seem to get us anywhere.

Maybe we didn't pick a useful witness of the kernel. Or, did we?

So, nevertheless, let's just continue with computing, this time, the (dual) leftmost parse tree derivative of the suffix found on k3's right.

initial Rc = a a b a b

    (empty)        a      a      b      a      b
        Lc> ( )    \___________________________/ <Rc (init.)                  

                          a      b      a      b
        Lc> A             \____________________/ <Rc (via R1)

                                 b      a      b
        Lc> A A                  \_____________/ <Rc (via R1)

                                 b      a      b
        Lc> ( P A A )            \_____________/ <Rc (via R2)

                                        a      b
        Lc> ( P A A ) B                 \______/ <Rc (via R1)

                                               b
        Lc> ( P A A ) B A                        <Rc (via R1)

                                         (empty)
        Lc> ( P A A ) B A B                      <Rc (via R1)

Still, nothing exciting, it seems.

Certainly, though, we could record which reduction outcomes we got last, for each derivative, respectively -- B and ( P A A ) in this example -- and try to bridge both ends / S-expr lists from there.

That is, using, on the left:

      B      A      B

and, on the right:

      ( P A A )     B      A      B

we will further reduce, from now on, by exploiting only, either the immediate, adjacent left, or the immediate, adjacent right, obtained at the last reduction's outcome (ie, of previous iteration) starting from the chosen kernel witness' location:

      B      A      B      ( P A A )      B      A      B
                  ( found: BP -> B P )

can reduce to:
      B      A    ( BP B ( P A A ) )      B      A      B

      B      A    ( BP B ( P A A ) )      B      A      B
                  ( found: P -> BP B )

can reduce to:
      B      A    ( P ( BP B ( P A A ) ) B )     A      B

      B      A    ( P ( BP B ( P A A ) ) B )     A      B
           ( found: AP -> A P )

can reduce to:
      B    ( AP A ( P ( BP B ( P A A ) ) B ) )   A      B

      B    ( AP A ( P ( BP B ( P A A ) ) B ) )   A      B
                                ( found: P -> AP A )

can reduce to:
      B    ( P ( AP A ( P ( BP B ( P A A ) ) B ) ) A )  B

      B    ( P ( AP A ( P ( BP B ( P A A ) ) B ) ) A )  B
    ( found: BP -> B P )

can reduce to:
    ( BP B ( P ( AP A ( P ( BP B ( P A A ) ) B ) ) A ) ) B

    ( BP B ( P ( AP A ( P ( BP B ( P A A ) ) B ) ) A ) ) B
    ( found: P -> BP B )

can reduce to:
( P ( BP B ( P ( AP A ( P ( BP B ( P A A ) ) B ) ) A ) ) B )

(Success)

and, indeed:

      b    a    b    a    a    b    a    b
      B    A    B    A    A    B    A    B
                B(k3)P.....
                BP.........    B
                P...............
           AP...................    A
           P.........................
      BP.............................    B
      P...................................

Or, in CYK-flavored layout:

   ((b((a((b (a  a))b))a))b)
   
     b----------------BP--P
                       |  |
        a----------AP--P  |
                    |  |  |
           b----BP--P  |  |
          (k3)   |  |  |  |
              a--P  |  |  |
                 |  |  |  |
                 a  |  |  |
                    |  |  |
                    b  |  |
                       |  |
                       a  |
                          |
                          b

We would have failed parsing it, if we had picked any of the other witnesses: k1, k2, k4, k5, k6, or k7, to carry out our kernel-directed reductions.

Conjecture

Conjecture:

given a grammar G, in CNF, of a context-free language L whose words have a single parse tree (relative to G),

computing (and exploiting) the kernel K(L/G) of L relative to G should allow us to parse words of L against G in:

O(|G| |U|2 n2 log n)

time.

Quick rationale:

a) the kernel witnesses present in the input can be as many as n-1 (as shown above); obviously, there is at least one of such witnesses, if the input is a word of the language (b/c witnesses generated at the meet of dL(U) with dR(V), from a production S -> U V, where S is the start symbol), but there can't be more than n-1 of those in any input of length n, anyway -- and, as we've seen, the language of palindromes is one CFL example (are there others, btw?) which exhibits the upper bound / worst case in that respect;

b) computing the input's left or right derivatives at some terminal location w (wrt. Lc and Rc as described in the post) is believed to be in O(|G| n log n) (cf. other comments and some non-contradicting empirical results on GitHub);

c) the size of the kernel is bounded in O(|U|2) (|U| being the number of distinct nonterminals on the left-hand side of the unary productions U -> w of the grammar);

d) hence, combining (a), (b), and (c),

( n - 1 ) * |G| * n * log n * |U|2

~

O(|G| |U|2 n2 log n)

(in worst case).

Eg, as in the post:

S  -> NP  VP
VP -> Vt  NP
NP -> Det N
N  -> Adj N

Det-> a
Det-> the
Adj-> young
N  -> boy
N  -> dragon
Vt -> saw

we have,

|G| = 10, |U| = 4 (Det, Adj, N, Vt)

note that in practice, though, we can reasonably expect, in many cases, Card(K(L/G)) to be smaller, or much smaller, than its upper bound, |U|2;

in this example, it's just the singleton:

K(L/G) = { N . Vt }

which generates the witnesses,

{ "boy saw", "dragon saw" }

Ingenuous question

Ingenuous question:

has anybody else noticed what I call the kernel for the particular class of context-free languages that can be described by unambiguous grammars, before?

If yes, then clearly I must have missed it; so, pointers welcome.

If not, then I also welcome whoever is interested in picking up and carrying this investigation on; to help me with finding a solid (formal) completeness proof (assuming of course that any such would only exist).

(ysharp \dot\ design |at| gmail /dot/ com)

Thank you.

The ATIS grammar

I've got a hold of a version of the ATIS grammar converted to CNF (thus, CYK- and my algorithm-friendly).

(Found in an assignment for the course "ANLP 2015 / Advanced Natural Language Processing" of Prof. Dr. Alexander Koller.)

With ~ 20,000 productions (rules), it's quite an interesting candidate of a non-toy grammar, for benchmarks.

I've refactored my CYK implementation to make it complete (and accept to yield parse forests, not just a random single parse tree). Will push this again on GitHub soon, if only to have a better point of reference.

Thus, that CYK is now able to parse 73% (72 out of 98) of a sample made of guys like these, uttered in airports:

(excerpt)

i want to leave before noon .
what flights leave boston to pittsburgh .
i 'd like an afternoon flight .
are any fares cheaper than these .
what is the flying time from .
which flights use a large plane .
what area does canadian airlines international service .
what flights leave las vegas to oakland .
show me the airlines and flight numbers .
what airlines fly from toronto to detroit .
which city is cheapest to fly into .
does delta fly to boston from denver .
what is the duration of this flight .
what flights leave los angeles arriving in minneapolis .
i can only spend a hundred fifty dollars .
i 'd like to leave on a saturday .
orlando to kansas city and then to minneapolis .
is there a flight from memphis to los angeles .
what if i wanted to leave on may fifth .
list u s air flights from dallas to boston .
i need a flight from philadelphia to westchester county .
what is the cheapest ticket from memphis to miami .
find a flight from washington d c to milwaukee .
find a flight from washington d c to montreal .
show american flights after twelve p.m. from miami to chicago .

(...)

Etc, etc.

Then, I'll see where integrating this notion of kernel K(L/G) into my algorithm takes us.

Complete CYK implementation and ATIS sample

Complete CYK implementation and ATIS sample committed.