archives

CFL parsing, and another way to look at the CNF...

I'll implement it and try to break it when I get a chance, but it seems the below should exhibit O(|G| |U|2 n2 log n) worst case behavior (in time), where |G| is the total size of the grammar, |U| is the number of distinct nonterminals in the subset of unary productions U -> w in G, and n the input's length.

Or, what am I missing?

We consider only unambiguous CFGs in Chomsky Normal Form (CNF).

We devise a sort of "type system" over rewritings of the input as we process it from right to left, from end to beginning.

We'll look at the algorithm's work environment as being made of a left context, say Lc, a current symbol, say w (talking about a terminal) and a right context, say Rc.

The left context (Lc) is always a (length-decreasing*) prefix string w0...wi...w of terminals of the actual input (equal to the entire input s = w0...wn-1 initially, and empty at the end, when no more production type rewritings can be applied).

(* albeit, non-strictly)

The right context (Rc) is a mutating list of computed S-expressions storing what has been consumed and "typed" by reading the input from right to left, from end to beginning.

In this scheme, it is constructed so as to be, also, the recorded value of the rightmost-derivative of the parse tree, at the "point" Lc \ w, before Lc gets empty and reduction to the grammar start symbol (if possible) happens, eventually finalizing Rc as the single S-expr (S w0...wn-1) (or failing to do so).

Initially, Rc is empty, and Lc equals the entire input.

In the CNF grammar, we treat the two distinct types of productions thusly:

Productions of the form U -> w (aka rule R1)

Rule R1: U -> w  will be said to "type" w between Lc and Rc as... w: U ("the utterance of w is of type U between any Lc and Rc")

that is, whenever Lc ends with w

rewrite...

Rewriting via R1

Lc Rc ~> (Lc \ w) (U w) Rc
 new Lc> \------/ \------/ <new Rc

(takes a list of S-exprs as Rc and prepends the new S-expr (U w) to it, making a new Rc longer by one S-expr, while Lc gets shorter by one terminal, w)

Productions of the form U -> P R (aka rule R2)

Rule R2: U -> P R  will be said to "type" P followed by R in Rc as... P: R -> U ("the utterance of P applied to Rc typed as R promotes Rc to U")

that is, whenever head(Rc) = (P x) and head(tail(Rc)) = (R y)

rewrite...

Rewriting via R2

Lc Rc ~> Lc (U head(Rc) head(tail(Rc))) tail(tail(Rc))
            \----------------------------------------/ <new Rc

(takes a list of S-exprs as Rc and turns it into a new Rc shorter by one S-expr, Lc left unchanged)

Notice the one to one correspondence between mutually ambiguous grammar productions and mutually unsound production types, btw:

E.g., having, in the grammar, the mutually ambiguous

A -> BC
D -> BC

would yield, in this "type system", attempts at applying the mutually unsound (if only signature-wise)

B: C -> A
B: C -> D

Which would obviously be intrinsically problematic from either / both perspectives, and no matter what the input may be, as soon as said input will contain utterances of (a substring already reduced to) B, followed by (a substring already reduced to) C.

Anyway.

Now for a concrete example (full parse / derivation); consider the grammar in CNF (informal meaning given in parentheses):

(nonterminal-only RHS)

S  -> NP  VP  ("a noun phrase followed by a verb phrase reduces as a sentence")
VP -> Vt  NP  ("a transitive verb followed by a noun phrase reduces as a verb phrase")
NP -> Det N   ("a determinative followed by a noun reduces as a noun phrase")
N  -> Adj N   ("an adjective followed by a noun reduces as a noun")

and

(terminal-only RHS)

Det-> a       ("'a' reduces as a determinative")
Det-> the     ("'the' reduces as a determinative")
Adj-> young   ("'young' reduces as an adjective")
N  -> boy     ("'boy' reduces as a noun")
N  -> dragon  ("'dragon' reduces as a noun")
Vt -> saw     ("'saw' reduces as a transitive verb")

Which we'll use to "type" thru rewritings of S-exprs in Rc, as:

NP : VP -> S  ("the utterance of 'NP' applied to Rc typed as 'VP' promotes Rc to 'S'")
Vt : NP -> VP ("the utterance of 'Vt' applied to Rc typed as 'NP' promotes Rc to 'VP'")
Det: N  -> NP ("the utterance of 'Det' applied to Rc typed as 'N' promotes Rc to 'NP'")
Adj: N  -> N  ("the utterance of 'Adj' applied to Rc typed as 'N' promotes Rc to 'N'")

and

a     : Det   ("the utterance of 'a' is of type 'Det' between any Lc and Rc")  
the   : Det   ("the utterance of 'the' is of type 'Det' between any Lc and Rc")
young : Adj   ("the utterance of 'young' is of type 'Adj' between any Lc and Rc")
boy   : N     ("the utterance of 'boy' is of type 'N' between any Lc and Rc")
dragon: N     ("the utterance of 'dragon' is of type 'N' between any Lc and Rc")
saw   : Vt    ("the utterance of 'saw' is of type 'Vt' between any Lc and Rc")

After iteration 0 (init)

   (the          young          boy         saw         a         dragon)
Lc>\--------------------------------------------------------------------/ ( ) <initial Rc

(NB: Rc = empty list)

After iteration 1, via R1

   (the          young          boy         saw         a)        dragon
Lc>\-----------------------------------------------------/       (N dragon)1
                                                                 \--------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 2, via R1

   (the          young          boy         saw)        a          dragon
Lc>\-------------------------------------------/       (Det a)1   (N dragon)2
                                                       \-------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 3, via R2

   (the          young          boy         saw)        a         dragon
Lc>\-------------------------------------------/   (NP (Det a)   (N dragon))1
                                                   \-----------------------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 4, via R1

   (the          young          boy)        saw          a         dragon
Lc>\-------------------------------/       (Vt saw)1(NP (Det a)   (N dragon))2
                                           \--------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 5, via R2

   (the          young          boy)        saw         a         dragon
Lc>\-------------------------------/
    the          young          boy    (VP (Vt saw)(NP (Det a)   (N dragon)))1
                                       \------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

After iteration 6, via R1

   (the          young)         boy          saw         a         dragon
Lc>\------------------/
    the          young       (N boy)1   (VP (Vt saw)(NP (Det a)   (N dragon)))2
                             \-----------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 7, via R1

   (the)         young           boy          saw         a         dragon
Lc>\---/
    the         (Adj young)1  (N boy)2   (VP (Vt saw)(NP (Det a)   (N dragon)))3
                \-------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 8, via R2

   (the)         young          boy          saw         a         dragon
Lc>\---/
    the      (N (Adj young)  (N boy))1  (VP (Vt saw)(NP (Det a)   (N dragon)))2
             \---------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 9, via R1 (final Lc is empty)

    the           young          boy          saw         a         dragon
   (Det the)1 (N (Adj young)  (N boy))2  (VP (Vt saw)(NP (Det a)   (N dragon)))3
   \--------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 10, via R2

    the          young          boy          saw         a         dragon
(NP(Det the) (N (Adj young)  (N boy)))1 (VP (Vt saw)(NP (Det a)   (N dragon)))2
\----------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 11, via R2

       the          young          boy         saw         a         dragon
(S (NP(Det the) (N (Adj young)  (N boy))) (VP (Vt saw)(NP (Det a)   (N dragon))))1
\-------------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

Success, with the expected parse binary tree:

       the          young          boy         saw         a         dragon
       Det          Adj            N           Vt          Det       N
                            N                                   NP
               NP                                     VP
                                   S

Disclaimer edit:

this is just genuine amateur research from a practitioner who's never been in academia (that would be me). I've obviously posted it here on a friendly LtU only because I'm (still) wondering if I'm on to something / or what am I missing. Bear with me (and thank you in advance).