Perl Cannot Be Parsed: A Formal Proof

Perl Cannot Be Parsed: A Formal Proof via Perl Monks.

Elegantly proved via reduction to (from?) the Halting Problem.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Computers cannot work

The same reasoning proves that it's impossible to run any program, since the ability to run a program would imply a solution to the halting problem.

Uh, no...

Assuming you are being serious, the act of running a program only determines whether or not a given set of inputs may halt. And as the running program itself might not halt, thus failing to yield (in finite time) the answer to the question "does this halt"?, you might not learn anything.

More specifically...

the static structure of a Perl 5 program cannot be determined, in the general case, from static inspection of the code (i.e without running the progam). The correct parse of a program may depend on its inputs.

This is something long suspected, and now apparently proven.

Obviously this doesn't mean ''interpretation'' of Perl is impossible; as Perl scripts are parsed on a lazy basis. One is tempted to compare this to the situation with dynamic vs static typing--dynamic typing systems can capture program logic that cannot be captured in purely static typing systems (without casts)--only those parts of the program which actually need typechecking are typechecked (via the dynamic dispatch mechanism, generally). Of course, this constrains severely the sort of static analyses which may be made.

Should we call Perl an example of "dynamic syntax"?

Fear

Should we call Perl an example of "dynamic syntax"?

No, we shouldn't, because then somebody will think "dynamic means good, right?", and start imitating Perl syntax.

Even more specifically

The static structure of some Perl 5 programs cannot be determined.

If the behavior of the Exporter module is well known (which it is), and the export lists can be proven to be static, then the effect on the symbol table of the consuming module can be statically determined, and hence the remaining source code can be parsed statically.

This proof applies to the fact that you *can* put nondeterministic behavior in code that runs at compile time and therefore affects parsing, but most code doesn't actually do that.

That aside, statically parsing Perl is still very difficult ;-)

Well, sure

However, if you want to write a refactoring tool, you really want it to work for all legal code—otherwise, people just aren't going to trust it.

At best, you could refuse to make any changes when the code isn't statically parseable.

On a side note: Perl supports code that overrides the parser for the rest of the file, right? Does that get used much, for EDSLs and what-not? (I used Perl heavily in 2000-2001, but very little since then.) That would throw static parsing right out the window.

The "dynamic syntax" cannot

The "dynamic syntax" cannot be determined for the same reason. The halting problem cannot be magically solved by invoking the Perl interpreter.

Otherwise I'm a bit confused that no one here on LtU argues against the claim of the formality of the proof since the parsing operation is nowhere defined. Hand waving is o.k. when things get clarified in the comments part but otherwise...

BTW not too long ago I responded to the article in a blog posting.

I agree

Does a given Perl program even have a consistent "parse" during a single run? Or can the same source text be parsed different ways at different times? I don't know enough Perl to answer, but my guess from the nature of the comments here that the latter is the case. If so, it makes me wonder what is supposed to have been proven here.

A Strange Misunderstanding

There are only a few means by which you can change the way Perl 5 will parse a line of code, and there are only a few types of syntactic constructs where the parse can change.

The only way a given Perl 5 program could parse differently between runs is if executing the compile-time declaration of the whatever sub from the example in the linked proof depended on indeterminate input.

No one writing serious Perl 5 code would write such code -- given a random number or user input or the last digit of the position of the moon at the time of launching this program, potentially declare a whatever sub with zero arity.

My point is basically that

My point is basically that once you formalize what the parser does you can see its limitations.

An ambiguous parse creates at least two trees T1, T2 and a context sensitive analysis might be applied to decide which tree is used ( in C one tracks typedefs and builds a symbol table ). In Perl there are situations where this decision making process always fails no matter when it gets processed. But this doesn't have to bother Perls parser. The parser can create a third node T3(T1, T2) which represents a switch executed at runtime. Instead of reducing (T1, T2) to either T1 or T2 the final parse tree is a little expanded and contains T1 and T2.

This might not be appropriate for every language and has the disadvantage that the parse tree grows exponentially ( in theory ) and not linearly with the number of token but it works in Perls case and keeps the parser simpler as one might expect initially.

Decision problem

Although it is not made abundantly clear, presumably the decision problem here is the membership problem for Perl programs.

Unfortunately I too have very little idea about Perl, but I think the author is arguing that the syntax of a given Perl program depends on the execution of arbitrary Perl and, importantly, the execution of the Perl sometimes results in a syntactic construction that is a member of the class of Perl programs and sometimes results in a construction that is not. Hence, the membership problem for the class of Perl programs is undecidable.

Having some kind of additional construction in the parse tree, as suggested elsewhere in this thread, would be useful for static analysis, but it would not rescue the situation. It is impossible, if we believe this proof, to decide the validity of (the syntax of) a Perl program.

Decision problem

Yes, that would make more sense to me. I don't know if that's what the author has in mind, though, considering statements like this:

To prove Kennedy's Lemma, we assume that we can parse Perl. In particular this means we can take the following devilish snippet of code, concocted by Randal Schwartz, and determine the correct parse for it:...

Lisp, too

An interesting point is that the same applies to Lisp, thanks to macros—and, even more so, to reader macros. Kind of amusing, since Lisp is famous for having such simple syntax.

Lisp is famous for simple

Lisp is famous for simple syntax, but it's useful (in part) because of its flexible syntax. If that means one or more compile phases can't be proven to terminate, so be it.

Perl programs worse than line noise

Yet more proof, if any is needed, that Teco was a better language than Perl.

C++ too.

Due to the Turing-complete nature of templates, C++ has the same property. It obviously hasn't hurt its popularity.

More practically, though, this could be a good test to determine which languages might take an unreasonable amount of time to compile. Lisp and C++ can, but I don't know if this is usually a problem with Perl.

Huh?

The "Turing Complete" nature of templates in C++ doesn't mean that the C++ language can't be statically parsed.

While C++ syntax may be ugly

and have more than a few ambiguities-resolved-by-fiat; it brackets well. One of the key "difficulties" of C++ (and C as well) is that correct parse for certain constructs depends on whether a given symbol corresponds to an in-scope typename or not. This complicates parsing, as the parser has to consult the symbol table while generating a parse tree.

However, the set of valid symbols and their kinds, at a given point in a C++ program is always statically known; as the language doesn't support things like dynamic scoping, mutable reflection, etc, which might make the symbol table dependent on the program's execution.

Macros may be another issue together; though the behavior of a C/C++ macro doesn't depend on program input.

There's definitely a

There's definitely a parallel here. Although C++ can definitely be parsed, parsing is only a small piece of the puzzle from a tooling perspective - refactoring, browsing, complexity analysis. To enable those things, you need to finish the compilation process, where we finally figure out the definition of every type in the program and the static type of every variable. And that is undecidable in C++, because the means by which types are assembled during compilation is itself a Turing complete dynamically-typed interpreted language.

And more...

C has similar ambiguity, but instead of the symbol table state as modifed by runtime inside compiler time, it's the type definitions that you need to keep track of to tokenize correctly,. See http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html

And of course there's Forth, where the parser is actually a stateful tokenizer as well as a pretty direct compiler that appends instructions to memory after almost every token. It's probably the only example I can think of something that is both extremely elegant and minimalist on such a low level, and actually functional. Forth's immediate words, that in a way are compiler extensions, are very similar to the runtime at compile time behavior in Perl. The difference is that Forth uses these everywhere, but the behaviors of most are well known (e.g. if/then, and the ; word).

C's ambiguity is not similar

C's typedef ambiguity is not the same. If your parser has access to a source file and all the appropriate headers then it can solve the typedef ambiguity without executing the C program. The analysis may be more complex than a simple LR(k) parse or whatever, but it's still doable with just statically available information.

The original article is about an ambiguity in Perl where the syntactic class of code following a "/" depends on the value dynamically computed for the expression on the left. In his example some following code can be a comment under one range of values and be a non-comment given another set of values. In short you must run a Perl program to "parse" it at all. Unlike the case with C, no amount of static analysis can resolve this.

I'm not an expert on Forth, but AFAIK it's very easily parsed. Its just that the parse tree can't carry very much information beyond distinguishing between comments, various literal types, and identifiers.

Running is not enough

And even running the program just gives you one possible syntax tree. If the program does I/O, then it can even "parse" differently when run a second time.

Even more fun

is to write perl scripts which execute their own output. :)

Aw, heck. If we can tolerate the fun of parsing perl, lets just dispense with the restricted levels Chomsky hierarchy altogether and design programming languages with Type 0 grammars.

I didn't say it was the

I didn't say it was the same, I said there are similarities, and the similarity I was pointing out is that the parsing is much more complicated than it should have been. I'm well aware of the particular behavior in Perl, and that it's different (much worse), but both are examples of the language semantics leaking into the syntax and making parsing harder.

Secondly, forth is very easily parsed into a series of words, but that is just a list of non whitespace strings, it wouldn't tell you much about what the forth program does. Anything more involves execution of immediate words such as the ; word which terminates compilation of another word. In that sense it's just as bad as Perl, except that most Perl code found in the real world doesn't actually exhibit this behavior whereas in Forth such practices are pretty standard.

Parsing is Static, at Compilation Time

The parsing does not depend on the value of the expression; the parsing is fixed at compile time. The parsing of this expression is always unambiguous when Perl 5 executes the parse tree.

The parsing does depend on any compile-time declarations which change the arity of the whatever sub, but only the parser cares about those. By run time, that information is no longer relevant in this case. Thus the behavior of this code has nothing to do with the value to which the expression whatever evaluates; it only depends on the syntactic category into which the parser places it.

In the face of non-deterministic branching which could designate the arity of this sub one way or another, you cannot completely predict how it will parse without running code -- but that's the difficulty here.

Given what that proof

Given what that proof depends on, we can be more specific than just talking about Perl. I don't know Perl at all, but from that description, a subroutine in Perl can apparently consume any number of tokens that appear after the subroutine's name at the call site, and it's necessary to execute the subroutine to discover just how many it will consume. In other words, at runtime, a subroutine in Perl is passed not a definite set of arguments, but the remainder of the program text, to do with as it pleases.

Clearly any language (or language feature) will not be statically parseable if the number of symbols consumed by a given expression is determined at runtime. Stated that way, none of this is a surprise, is it?

What's surprising to me is that this language ever managed to achieve widespread use - but I guess it's just another example of how you can break a whole bunch of precious rules and the sky doesn't necessarily fall in. Software is full of people declaiming their "thou shalt not" lists, and right across the street there's another bunch of people breaking those very rules quite profitably.

Arity is Fixed Into the Parse Tree

In other words, at runtime, a subroutine in Perl is passed not a definite set of arguments, but the remainder of the program text, to do with as it pleases.

That is not how Perl works.

It's possible to change the arity of functions as Perl 5 parses a program. That will change how the parser handles bareword functions where it might have to guess at the expected arity. (Even the word "guess" is a misnomer, as the combination of greedy variable-length parameter lists and operator precedence produces a deterministic heuristic.)

Changing the arity of a function for the parse affects only code the parser has yet to parse. It doesn't change previously parsed code.

The parser produces a parse tree and discards, as you say, the remainder of the program text. There is no reparsing to resolve an ambiguous parse at runtime; the parser resolves any ambiguity during the initial compilation phase.

See On Parsing Perl 5 for more.

Ah, that's much clearer. In

Ah, that's much clearer. In essence, you can tweak the parsing rules while they are being used to parse your code, just as in C++ you can meddle with type definitions while the type definitions are being figured out.

I'm not convinced

IMO the example given CAN be statically parsed. The behaviour of that source code is conditional on run-time values, thus the resulting AST must include a conditional. What you have is basically an implicit conditional of the form if-"whatever"-is-a-function which must be converted to an explicit conditional in the AST, choosing between two alternative interpretations of the rest of the line.

One of those conditional alternatives might never be used, and determining that is subject to the halting problem, but you can say that about any conditional.

Very likely the conditional doesn't fit the normal simple-nesting pattern for a traditional AST. There would need to be a node that is potentially near the root of the AST, high enough to encapsulate all the alternatives as simple cases, but that means the which-case-are-we-dealing-with cannot be determined when the node is executed (using an interpreter metaphor). Therefore, there need to be an evaluate, test and switch-interpretations nodes lower down is the AST - where "whatever" is evaluated, in the Perl example. That seems to imply duplication of non-ambiguous code fragments in the AST (a duplicate for each case), but that issue is bypassed if you use an abstract syntax digraph instead.

Not pretty, not supported by yacc, but I'm pretty sure it could be done. A variant of tabular LR or memoized backtracking LR should be able to cope, for example, unless I'm badly mistaken.

Similar issues exist WRT parsing C++ without access to all #include files - you can still do a static parse, but you have to defer some decisions until the missing information is available. Maybe it should be called an incomplete parse - a kind of partial evaluation - since what is deferred includes parsing decisions. Anyway, you can still determine the form of that missing information, and still build conditionals into the resulting AST to handle the alternative interpretations.

It's just that I *really* don't want to write that parser.

Note from an abject Perl user

(It's not that I like the language. I just use it.)

Using a branching strategy to enable static parsing of Perl looks like a promising approach.

The main problem with parsing Perl ... one of the main problems with parsing Perl ... one of the very many problems with parsing Perl ... is the meta-parsing concept of "contexts." (For those that don't know Perl, an expression can be evaluated in scalar, list, or -- as in this case -- void context.) There is no syntactic support for disambiguating between contexts. Indeed, the quoted fragment demonstrates that Perl is quite capable of going out of its way to remove syntactic disambiguation: either an enforceable prototype, or the requirement for a pair of parentheses, would make it obvious that "whatever" is nullary or not.

Given Larry Wall's predilection for throwing infix operators at a problem, I'm surprised that he hasn't considered smileys as a solution to this one. Scalar context could be considered default. :( might represent void context for the following closure. ;) might represent list context for the following closure. There are probably other contexts out there I'm not aware of: I'm sure a blessed context would come in handy.

Of course, the big problem with parsing Perl is that there is absolutely no reference implementation whatsoever. I'm being unfair here: at any given time, there's a reference implementation for 5.8.1.21, or whatever ... but woe betide anyone who writes a nice, clean, static parser -- now with extra gleaming branching! -- for that.

It's never gonna work for 5.8.1.22.

My favorite perl proof is

My favorite perl proof is reducing a fragment of the regex implementation to 3-coloring. Always fun to give out as extra credit after teaching basic models of computation :)

http://perl.plover.com/NPC/NPC-3COL.html

Opinions on Perl syntax

What do the younger programmers among us think about the informal, inconsistent syntax of languages like Perl?* I became a computer guy in the late '60s and early '70s, when we were searching for reasonably formal, easily parseable, consistent programming languages. The syntax of Perl makes me cringe.

Does it matter? Is there less cognitive load with more formal syntaxes?

~~ Paul

* At least half of the Perl programmers I've asked do not know the term "sigil."

Re: Opinions on Syntax

Personally, the entire Perl language makes me cringe. But the breadth and scope of CPAN is amazing, and Perl 6 looks like it will be much cleaner and a substantial improvement over Perl 5.

I find that syntax itself does not carry much cognitive load once you are accustomed to it; typically I can get accustomed to most any syntax within a week. My experience has been that semantics of the language typically imposes the ongoing cognitive burden, one that becomes lighter more gradually.

Syntax is not unimportant, but you should read the first page or two of the first chapter of PLAI.

I suspect that many people here are not big fans of Perl, although I'm also sure there are more than a few Perl lovers around LtU. :-)

Error in the proof?

The Halting Problem is not to produce "true" for every program that halts. That would be easy: just run the program and then produce "true".

The Halting Problem is to produce "true" for every program that halts, AND "false" for every program that doesn't.

UPDATE: Hmm... I suppose this might be more of a glitch in the explanation rather than an actual error in the proof?