.NET Compiler Platform ("Roslyn")

The .NET Compiler Platform (Roslyn) provides open-source C# and Visual Basic compilers with rich code analysis APIs. You can build code analysis tools with the same APIs that Microsoft is using to implement Visual Studio!

In a nutshell: OPEN SOURCE C# COMPILER. Putting aside possible practical implications of this for the .NET ecosystem, I think it is good for programming language geeks to be able to peruse the source code for compilers and language tools.

For the debate about MS being evil, you can head directly to HN where you'll also find an explanation of what bootstrapping a compiler means.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Extremely jealous of high school and college students

Back in my day, if you wanted to independently learn compilers, ML or gcc were your only open source compilers to study, with the latter looking like incoherent chicken scratch scrawl.

Now you have LLVM, OpenJDK, Mono, Roslyn, PyPy, LuaJIT, etc. Kids these days.

Yet I suspect fewer kids

Yet I suspect fewer kids these days want to build compilers.

They even mix implementation

They even mix implementation with language, not able to differentiate one from another.

I still remember the early 80's where most languages had compilers and interpreters to choose from.

Nowadays many think only C and C++ can produce native binaries and being memory safe requires a VM!

On the other hand, they

On the other hand, they mostly take functional programming and garbage collection for granted. That wasn't really true when I was a kid. They are also more open to distributed and parallel computing, mostly by necessity.

The real problem are the kids who aren't growing up as hackers because the only computer they use is an iPad.

You got a point there.

You got a point there.

Any large institution

Any large institution is naturally inclined to treat people like sheep, and our society is dominated by such institutions. The Wikimedia Foundation has recently been developing software to "shield" users from wiki markup (alleging it's "too hard" — yes, this is the markup that enabled Wikipedia by being so simple to use), with the discernible result that new users are increasingly unable to do anything unless explicitly spood-fed to them by the "Visual Editor" interface (in contrast to the way things have been in the past on Wikipedia and its sister projects, where a user learns how to do things by seeing how others have done them, and adds to their skill set at occasional need).

Like when we stopped

Like when we stopped programming in binary with fancy assemblers, or when we stopped programming in assembly with fancy compilers? I feel like I've traveled back to mid evil Europe whenever I'm re-exposed to emacs...I like to be isolated from the crap that a modern IDE can hide.

At least call it mid eval.

At least call it mid eval.

Mid Eval

Very amusing, as usual Ehud

I'm studying eviluation as a

I'm studying eviluation as a more wicked alternative to evaluation.

The situation with wiki

The situation with wiki markup isn't analogous to those things, no. What's being hidden is as simple as, and more uniform informative and versatile than, the interface it's being replaced with. More crucial from the perspective of the social dynamics of the wikimedian projects, which are fundamentally crowdsourced, wiki markup is essentially plastic under editing by the lay public, whereas the interface is fundamentally limited by the software which is not crowdsourced to anything remotely like the same degree. They're creating a high priesthood without providing a convenient path by which the masses can readily work up smoothly to becoming high priests. (I note, btw, that to the best of my understanding we never actually programmed in binary, per se.)

What's being hidden is as

What's being hidden is as simple as, and more uniform informative and versatile than, the interface it's being replaced with.

But this is, I would say, exactly what's under dispute. In your hands, it's more uniform and versatile. For others, not so. For you, markup is as simple as a GUI, but not for others who Wikipedia wants to be accessible to.

The vast majority of Wikipedia work is done by people from a very small demographic, who largely agree with your perspective, I would guess. But to broaden the appeal, we shouldn't be surprised that new tools are needed.

Actually, no, this isn't an

Actually, no, this isn't an idea that came from the established contributors, it was imposed by the foundation. Lots of us figure it's not easier, at best just different. Of course it's supposed to appeal to a wider demographic, but I've already observed the predictable phenomenon that users using the new interface don't get the benefit of seeing how others wrote the markup (which is how one would ordinarily learn most of the few, simple things one needs to write wiki markup), meaning for example they have no way of realizing something is possible by seeing that somebody else did it; and, in learning how to do something with the new interface, learn nothing except how to do that particular thing with the new interface. There's no path to slowly learn more and eventually become an expert. The foundation doesn't realize that the old generation of expert contributors aren't going to be replaced by a new generation of expert contributors if there's no continuous path by which someone can gradually learn by contributing to become an expert.

This is a good point that

This is a good point that should be turned into a universal design principle (if it isn't already): make construction observable so it can be learned from. Like having latex source to a document, especially when the number if possible features is high (e.g. If the word processor is library extensible). I think we can have this with wysiwyg, eventually.

You might be interested in what Daan has been doing recently:

http://madoko.codeplex.com

Madoko looks very nice

Thanks for the link, I hadn't heard about Madoko and it looks interesting. I'd be interested in an opportunity to use Koka a bit; besides, there is something strange with the almost-disruptive success of Pandoc, and getting a researchy perspective on what exactly these programs can do could be very interesting.

Actually, we do explore the

Actually, we do explore the concept of "navigability" in the "It's Alive" paper. In this case, you maintain a textual and visual representation of your program that are automatically synced (via live programming), manipulating either will cause the other to change. This is one way where one could "see" how something is done; it is simply a matter of maintaining strong connections between the wysiwyg editor and markdown textual representation.

I think...

...you may have just doubled my blog posting rate by pointing me to Daan's work. I had my own private horrible pile of hacks for blogging, and what I really wanted was Markdown + math, which is exactly what this is.

Thanks!

I don't think I suggested

I don't think I suggested that it came from established contributors. My point is the opposite -- things that appeal to established contributors won't necessarily work for others.

That Word is less powerful than LaTeX is obvious, and I certainly use the latter. But if word processing required LaTeX, there would be less of it done. That seems like precisely the point here.

And there are definitely Word experts, even though the Word interface isn't a language.

As a general principle, I

As a general principle, I agree what you're describing can happen. I don't believe it's the case with the wikimedia change I'm describing, though. (I'm also suppressing wikimedia politics that doesn't seem to me relevant to the current LtU theme, which is the remark about young people not learning to hack.)

kids

Well I have a HS student (same kid as the alligator lambda example from our 2007 conversation). They almost all work in some variety of interpreted languages, REPL is fairly standard. Though Java is the norm in high school (because the AP exam is in Java). She certainly gets the idea of the language underneath the language she's playing with. So for example last year they did Scratch, the stronger kids in the class implemented additional functionality in Squeak to get their Scratch program" to do stuff beyond what the language allows. She's sort of a Logo wiz and I taught her her some ideas from LISP to get her LOGO functions to do stuff beyond what LOGO is normally designed for and she would like to learn a LISP/Scheme.

Her boyfriend is the typical hacker type. I don't think he's clear in his own head what the distinction is between the kernel API and shells. His interest is primarily Python and that's obviously way to hard a compiler to even begin to attack.

I wouldn't trust a HS teacher on a full compiler class. But there is a an advanced elective class where they do some primitive electrical design i.e. hooking transistors together to get something like a simple ALU working. She's not going to take it but the boyfriend likely will and if they are still together I'll see if he likes it.

So looking back, the next generation of HS students interested in computers is at least as good as we were at that age.

channeling yegge

a hs class could probably cover things that are small time lightweight simple compilers, they don't have to be gcc. a stupid simple dsl or something, to whet the appetite at least for the kind of odd balls who like compilers ;-) but don't know it yet

This is great! I'm really

This is great! I'm really happy about where we are going as a company.

Yah...

I glanced a bit through the code. There still are good reasons not to want to write a compiler in an OO language.

(A silly comment somewhere since I embarked on another project and am now writing an interpreter in C++. Well, C++-ish, don't get all the ins and outs of the language yet.)

There are still good reason

There are still good reason to write a compiler in an OO language; in fact most compilers in production are written in OO languages. The only ones that are not are generally compilers for functional languages.

Are the reasons for writing

Are the reasons for writing a compiler in an OO language different than the reasons to use these languages for other kinds of software?

Not really, but the same

Not really, but the same could be said for using an OO vs. an FP language in general: the differences are much more nuanced. The reason most production compilers are written in OO languages is probably because most production languages are OO; eat that dog food!

One could argue that pattern matching is a big win for compiler writing, and that FP should have an intrinsic advantage with ADTs (or at least case classes); but in reality, this doesn't really dominate the compiler's complexity (and virtual dispatch can be used to achieve similar effect).

Right. So can we say that

Right. So can we say that production compilers are, at the end of the day, medium-size software systems with all that is entailed, and localized advantages of the problem domain become negligible?

Correct, but then I wasn't

Correct, but then I wasn't referring to localized advantages either.

Sure. No argument. I just

Sure. No argument. I just wanted to make the point clear.

Indifference?

You claim it should be irrelevant since medium-sized software systems all share the same characteristics? How droll. I choose to disagree fervently.

Compilers are usually big language manipulation experiments. Parsing, pretty printing, transforming, and translating terms are where functional languages shine. Disadvantages of functional programming such as the expression problem also don't come up as much since the problems are usually pretty much self-contained.

The shown compiler exposes the verbosity of OO. That sometimes is a good thing, in this case it isn't. Understanding and maintaining a gazillion different classes is not good compiler technology; though eating your own dog food is.

Size is a bad metric. By that metric any problem should be tackled in the language which fits the size of the project. What a silly argument.

I spend most of my time

I spend most of my time writing incremental compilers and execution technology. Pure functional programming would most definitely not be very useful in this case, and it turns out the best way to tackle this cookie is through lots of state encapsulation, a very OO concept. There are times I wish I was using Scala rather than C# for pattern matching, but it doesn't come up very often. I prefer the intrinsic approach of virtual methods vs. the extrinsic approach of case matching anyways. I enjoy parsing and type checking in the same method via Glitch.

I always hear FP is good for X, where X is pretty much unjustified. The biggie of course is parallel or concurrent computing, where it hasn't showed many advantages at all (hey, let's share immutable state...which is like not sharing at all, doh). But compiler is another X where I find the arguments bizarre. If FP has so many natural advantages in expressing compilers, why do most compiler writers still choose OOP languages and styles? I suggest it is because it's advantages are quite minimal and so momentum and preference still holds sway.

If FP has so many natural

If FP has so many natural advantages in expressing compilers, why do most compiler writers still choose OOP languages and styles? I suggest it is because it's advantages are quite minimal and so momentum and preference still holds sway.

Even if the advantage were enormous, do you really think it would affect the outcome significantly? I doubt it. So your argument says more about the psychology of programmers than it does about the technical merits of FP.

As a counterpoint, consider the number of Objective Caml programs that actually make use of the "objective" part vs. the functional part. I think that's much more representative of superiority when you have both OO and FP available.

Well. It isn't true.

I agree with you on parallel computing but that's a change of subject.

Your statement about most languages being implemented in an OO language simply doesn't look to be true. Where's the data?

I guess most programmers don't know better than imperative or OO languages and I agree with them that if you want to write an interpreter you should write it in C, C++, Java, C# for portability reasons. It doesn't make them languages particularly well suited for the task. (The fact that you need additional tools to automate, for instance, lexing and parsing, actually tells you that they aren't that good at it.)

Ocaml, MLton, GHC, Clean and a number of other programming languages are bootstrapped and written in an (imperative) FP style. So the notion that most compilers are written in OO isn't true; in fact, I would claim that most academic toy languages start off in a FP language.

Having short concise fault-free functions over term languages beats having verbose methods divided over node classes. Conceptually, I don't think a transformation should be expressed as a method of term nodes anyway.

Glitch seems to be solving a problem I don't understand. Is it an OO language? The fact that you need another language to solve a problem better than you can in OO hints at that OO isn't good at the problem you were solving anyway.

I never use additional tools

I never use additional tools for parsing or lexing, they just get in the way and don't add much. Note I'm very much influenced by Martin Odersky's scalac design, which is a mix of OO and FP with everything done is directly.

Node classes work well for me because....I have to persist nodes through multiple compilations. Likewise, given the very incremental nature of my project, I must focus on mutating rather than transforming! Unless you can come up with some automatic memoization that preserves history for error recovery, transformation is just the wrong approach.

Glitch is not a language, but rather an execution engine. It fixes coordination issues for imperative styles, and makes incremental computation more feasible.

So the notion that most compilers are written in OO isn't true; in fact, I would claim that most academic toy languages start off in a FP language.

Take my anecdote and raise it with your own :) It really depends on if the academic is into FP or not

Immutable data can be shared implicitly

Purely functional data structures share data implicitly. You can have multiple threads "share" this immutable data by transforming it and passing it back and forth among them. The transformation will be as "shallow" as possible due to the nature of the data structure. And clearly staying away from mutation is good way of avoiding insane bugs even in compilers - believe me I've seen them in a mainstream language compiler - namely javac.

The problems come when the

The problems come when the data must be mutable over time. No one has figured out how to build a compiler using FRP yet.

OMeta and Enso

OMeta (Alexander Warth and Ian Piumarta) and Enso (William Cook) have demonstrated very effective approaches to language manipulation in context of OO.

Can we reasonably argue a program "exposes the verbosity of OO" when, it seems likely, there are more parsimonious idioms known to OO? The phrase 'verbose code can be written in any language' comes to mind.

Pattern Matching

I'm not inclined to think that FP has any inherent parsimony advantages from pattern matching. To the contrary, pattern matching is inherently a source of verbose, redundant expressions (we syntactically repeat patterns and variable names in code). The most parsimonious approaches to parsing I've ever encountered are those developed in J or APL. Parser combinators are a close second, and don't rely on language-layer pattern matching.

Not my experience

There's a simple naive counting argument that employing OO methods are more verbose than employing a straightforward recursive function on terms. That is: you need to repeat the function/method declaration for each term; often over a large number of text/code files. Pretty horrible.

And if you use some tricks you can often get away with writing recursive functions on terms which use higher-order functions to deal with everything except for the few interesting cases where you actually do something and bypass the expression problem somewhat.

But I guess it's all in the eye of the beholder.

traits, dci

If you're counting each declaration of a method name, then an OOP language based on traits might come out ahead of an FP language that matches terms. Traits effectively offer an ability to create new classes on the fly by combining partial class-definitions. This seems to be a feature missing from mainstream OOP languages, but it can more or less be emulated.

I agree that shotgun editing is typical of conventional OOP, and is not fun. Clever OO developers have created idiomatic ways to reduce this issue - e.g. an layers of indirection involving runners and roles, such that an object is wielded through purpose-specific adapters (cf. DCI).

Anyhow, I'm not an advocate of OO. But my reasons for disfavoring OO are unrelated to parsimony. I think most paradigms can be parsimonious if the programming language and popular idioms allow it, and I'd hate to conflate properties of a paradigm with properties of the hosting PL. FP doesn't need pattern matching, for example... it's just commonly associated with FP.

No there aren't

Sean McDirmid:

There are still good reason to write a compiler in an OO language

I write compilers for a living, have done so in both FP and OO, and couldn't disagree more -- with just about everything you assert further down that thread.

This discussion will become

This discussion will become so interesting if we turn to specifics...

Note I focus on live

Note I focus on live programming environments where FP's aversion to state just doesn't work well (heck, even the code is mutable). Even if an FP language was used, one would be forced into an OO style just given the nature of the problem.

Also plenty of compilers get written in OO styles/languages and the projects don't fail (e.g. Roslyn), so what specifically about FP makes them do much better for writing compilers that you can convince us we are doing it wrong?

Recursive descent parser?

Interestingly enough, the C# language parser's implementation has the taste of being a simple, hand-crafted(?) recursive descent parser :

http://roslyn.codeplex.com/SourceControl/latest#Src/Compilers/CSharp/Source/Parser/LanguageParser.cs

Yep. This is very common in

Yep. This is very common in industry (scalac does this also). You get the most control when it comes to error recovery, and whatever performance loss is really in the noise compared to everything else. I think only academics use parser generators :)

Dude, that code is horrible.

I somewhat snapped two pieces of code which should do roughly the same. Compare parsing a parenthesized list of things functionally

    def parse_expr_tuple: parser ast =
        parse_position <*> \p ->
            parse_parenthesized (plus_sep parse_expr parse_comma) <@> 
            [ cons e nil    -> e
            | ee            -> expr_tuple p ee ]

With doing the same in OO.

       internal AttributeArgumentListSyntax ParseAttributeArgumentList()
        {
            if (this.IsIncrementalAndFactoryContextMatches && this.CurrentNodeKind == SyntaxKind.AttributeArgumentList)
            {
                return (AttributeArgumentListSyntax)this.EatNode();
            }

            AttributeArgumentListSyntax argList = null;
            if (this.CurrentToken.Kind == SyntaxKind.OpenParenToken)
            {
                var openParen = this.EatToken(SyntaxKind.OpenParenToken);
                var argNodes = this.pool.AllocateSeparated();
                try
                {
                    bool shouldHaveName = false;
                tryAgain:
                    if (this.CurrentToken.Kind != SyntaxKind.CloseParenToken)
                    {
                        if (this.IsPossibleAttributeArgument() || this.CurrentToken.Kind == SyntaxKind.CommaToken)
                        {
                            // first argument
                            argNodes.Add(this.ParseAttributeArgument(ref shouldHaveName));

                            // comma + argument or end?
                            while (true)
                            {
                                if (this.CurrentToken.Kind == SyntaxKind.CloseParenToken)
                                {
                                    break;
                                }
                                else if (this.CurrentToken.Kind == SyntaxKind.CommaToken || this.IsPossibleAttributeArgument())
                                {
                                    argNodes.AddSeparator(this.EatToken(SyntaxKind.CommaToken));
                                    argNodes.Add(this.ParseAttributeArgument(ref shouldHaveName));
                                }
                                else if (this.SkipBadAttributeArgumentTokens(ref openParen, argNodes, SyntaxKind.CommaToken) == PostSkipAction.Abort)
                                {
                                    break;
                                }
                            }
                        }
                        else if (this.SkipBadAttributeArgumentTokens(ref openParen, argNodes, SyntaxKind.IdentifierToken) == PostSkipAction.Continue)
                        {
                            goto tryAgain;
                        }
                    }

                    var closeParen = this.EatToken(SyntaxKind.CloseParenToken);
                    argList = syntaxFactory.AttributeArgumentList(openParen, argNodes, closeParen);
                }
                finally
                {
                    this.pool.Free(argNodes);
                }
            }

            return argList;
        }

Hmm, local variables, infinite loops, breaks, gotos, explicit memory management and comments to explain the voodoo...

Where is your error recovery

Where is your error recovery logic, marco? Real parsers are about much more than parsing, something many in the FP crowd don't seem to get...hey, look how easy idealized parsing is in FP, and how ugly real parsing is in OO.

Suffice it to say, I could do better also (using OO even), but I'm also not writing real production code.

Error recovery can be handled by the parser combinators.

Error recovery can also be handled (hidden) in the parser combinators trivially. It wouldn't change the functional code much, an extra line here and there within the function to handle a mismatch. It wouldn't change this particular piece of code.

Production code is a bad excuse for bad code. I am not sure the programmer was aware of what he was doing; except for being pushed by a specific design.

Your slur regarding FP

Regarding your slur with respect to the FP crowd. I am not in the FP crowd, I believe in using the right tool for the right job.

But parsing happens to be one of those places where FP got it right. It's damn easy to program out a simple set of parser combinators which will handle parsing correctly, robustly, ok-ishly efficient, and far more concise than anything you'll be able to cook up in an imperative or OO language.

I consider that a fact. Show me the lines of OO code where you do the same in four or five short lines.

Parser combinators

... are hardly specific to FP. Please look into OMeta, as I mentioned earlier. It's an OO parser combinator model developed in context of VPRI's FoNC project, which aims to develop a desktop system in 20k lines of code. The papers on the subject cover many examples.

Verbose code can be written in any paradigm. If you want to argue that Roslyn's parser code is exceedingly verbose, I wouldn't disagree. But I feel you're misrepresenting Roslyn's verbose encoding as inherent to OOP.

Show me the money

Extra tooling not allowed in this game. You can solve any problem with extra tooling and usually the extra tooling falls short of solving a problem right in a different context. Hence the reason that people handcraft their own parsers in mature tools. (The extra tooling necessary just tells me that OO is bad at expressing parsers.)

I don't see the error reporting, the position information, the error recovery, etc. A lot of lexers and parser generators in, for example, C also hold some global mutual state making it near impossible to parallelize the parsing of texts.

I am not interested in a holy war. But I have no difficulty to admit that FP is just good at expressing efficient parsers with minimal fuzz. (Okay, LL(k) parsers, not discussing the rest.)

Just show me the code of an LL parser in OO.

Is it that hard to admit?

OMeta is a library that has

OMeta is a library that has been implemented in a few dynamic OO languages. It is not what I'd consider 'extra tooling'. If you want to see the code, feel free to look up the OMeta papers. More generally, if you want a good argument that OO doesn't need to be verbose, look into the FoNC project.

OMeta does not parse as efficiently as parsers generated by external tools. Of course, this issue is common to many parser combinators, which cannot always perform deep optimizations.

I'm not an OO advocate. I consider many properties of commonly defined as aspects of OO to become problematic at scale (local state, reference aliasing, eventful updates, lack of general composition operators). You won't see a holy war from me in defense of OO. But I'd prefer arguments against OO at least be based on properties inherent to OO. To me, it seems fundamentally wrong to attribute to a paradigm a problem whose cause lies significantly with PL design, library design, or lack of knowledge or concern by the developer.

Read the Wikipedia page, glanced at an article. Now show me.

OMeta is an object-oriented programming language developed by Alessandro Warth and Ian Piumarta in 2007 under the Viewpoints Research Institute. The language is based on Parsing Expression Grammars (PEGs) rather than Context-Free Grammars and pattern-matching with the intent of providing “a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree-transformers”.

Looks to me you're talking about a different project than me. FoNC doesn't give any hits on google.

Dude. You're just stringing words together. I don't see because you're not showing me.

Show me concise production code of an LL parser written in a OO language.

Concise production OO LL

How about Boost.Spirit? Here's an example of a real production grammar. It's ugly for sure, but it is concise, particularly by C++ standards... And it works well and has good enough error reporting, which could be made even better at the cost of a bit more clutter.

I'm no huge fan of C++, so I'll only defend this so far, but this code works and works well, and you did ask for examples...

Now were talking

Parser combinators in boost templates! Yeah, I am asking for concrete code examples because otherwise you're just stuck discussing vague philosophical ideas. Boost wins from C# here.

It'll take a while to digest this. Roughly I can read the syntax and it does have a cluttered line-noise feeling to it; but in the eye of the beholder I guess. Must have taken some time to get right.

Does it augment the AST with position information and can it handle UTF-8? In that case I'll probably go and try to use boost myself.

Uh-oh...

Position information is tracked, yes, although there's (of course) some extra boilerplate elsewhere to set that up. I *think* it can handle UTF-8 as the char type is (of course) yet another template parameter throughout.

I warned you that I don't love this code, though, and I will repeat that warning. It did indeed take "some time" to get right, and perhaps 80% of that time was spent waiting for compiles. And when the compile fails, the error message is so incomprehensible that I pretty much just started regarding gcc as a boolean oracle: thumbs up, thumbs down, that's all you get. Abandon all hope, ye who enter here!

I very much agree that the syntax is cluttered and line-noisy, but if that were the only problem, I'd be glad to make the trade: if you're using C++, you're already mostly doomed in that regard anyway.

Now that this code is done and working, I think using Spirit was a good choice. When it wasn't working, I felt very differently...

I really prefer not to take party

I really prefer not to take party re: verbosity being inherent to OOP (or to C#'s OO facet, more specifically) or not.

(Simply because AFAICT, I don't really have anything to speculate about it that much generally...)

However, I certainly agree with your point implying that for the same kind of parser, say, a simple recursive descent one, and for a very simple data format like JSON, this can be written in C# alone with fairly significant variations in verbosity vs. terseness between implementations... and for comparable sets of features in the parsers' error reporting, public interface usability, statically-typed deserialization capabilities, etc.

Verbose code can be written in any paradigm.

Indeed, I had noticed it before, and I also eventually verified it by myself, for that matter.

Intuitively, I would assume that the potential for the implementation bias coming from the programmer which can be observed for the same kind of parser (and simple input languages) - in that given implementation language anyway - still holds for more complex input languages to parse as well, yes.

Just my .02

And yet you still haven't

And yet you still haven't shown me anything that doesn't just parse. You don't even have a clue what the Roslyn code is doing or you wouldn't have thrown up such a flimsy strawman.

Many people are stuck in the ideal past where parsing is just parsing, and they are using emacs or vim anyways and don't believe in IDEs. They'll make some excuse about how code completion or online error detection isn't needed because of their superior type checkers and the intellects needed to use their language. Then they'll slam the code that can do this as being "verbose" because they don't care about those features anyways.

Parser combinators were great in the emacs world of the 80s. But it's 2014 and we need much much more. C# isn't just another FP language with crappy tool support. And none of that is related to FP, of course, this problem occurs whenever elegance is prioritized over reality.

It parses, generates the

It parses, generates the correct ast, and does error reporting.

So it's bloody equivalent to

So it's bloody equivalent to the Roslyn code you are slamming? This is quite easy to test. Plopp each parser into a code editor and measure user experience. Oh...your parser is just batch?

Whatever man

You know as well as I do that by looking at the source code of that particular parsing snippet that there's nothing special about it; certainly there's no code handling the snapshots. It might as well be a batch compiler.

You don't need snapshots to

You don't need snapshots to be incremental, and you really don't know what you are talking about.

Ad hominem since you lack arguments

I'll talk to you again after you implemented and bootstrapped your own functional language to C compiler.

See, your own bias is

See, your own bias is showing up again. No one in the production world cares about your toy FP to C compiler. It's all about IDEs and tooling these days and you just don't get it. I'm tired of arguing, but judge Roslyn by it's intended role in visual studio, not by your bias.

Don't care

Dude, I don't care what people think about my 'toy' compiler. I wrote it because I liked to. I don't care what you did with your life but it can't have amounted to much.

I do get it: The shown code is verbose. I am interested in discussing that. You're interested in discussing ad hominems.

Grow up and learn to stick to a topic without addressing other issues. If OO code is verbose and unnecessarily complex it is verbose and unnecessarily complex until someone shows me that it isn't.

See. Grown up life ain't that hard.

If you want to be grown up,

If you want to be grown up, then don't poop on people who have to write compilers with real requirements for a living. If you want a counter example, check out this implementation of an if statement parse (configurable, so the "if" and "else" keywords are parameterized as Keyword and ChildKeyword):

protected sealed override void Parse(ref Cursor cursor)
{
  if (Parser.Expect(this, ref cursor, Keyword.NotNull()) == null) return;
  ParseRest(ref cursor);
}

protected override void ParseRest(ref Cursor cursor)
{
  Condition = Expr.DoExpr(this, ref cursor).Or(() => Condition);
  base.ParseRest(ref cursor);
  if (cursor.Current.Kind == ChildKeyword)
    Child = this.Do<ChildT>(ref cursor);
  else Child = null;
}

Then you can do magical things like

http://youtu.be/MtYyn_Rt8zk

I could try to find a better example, but my code is heavily object-oriented, so most of the parsing code is distributed across multiple classes.

I didn't poop on people

I said a particular piece of code in the Roslyn compiler is horrible (w.r.t. to what it could have been when written in an FP style.) Not people, code. That's different.

I get your point of view on updating but parsing remains parsing even if it is side-effecting and probably restricted to a piece of text updated.

I think the point is that had C# the syntax to concisely denote simple data constructors, simple higher-order functions and an operational fast semantics going with higher-order function evaluation you could have organized the code in such a way that you wouldn't need to revert to variables, gotos, and while loops.

But it doesn't.

So probably it isn't worth to implement parser combinator in C# for this particular problem.

The fact that we can read

The fact that we can read and understand the code at all means that it isn't that bad. And ya, it is incremental, you can see it on the first line.

If you want another viewpoint, you should check out Martin Odersky's Parsers.cs file (linked in a below post) to see something quite similar if less verbose because you know...Scala. This is where I learned how to do most of my real parsing. Before I got it, I was like...Martin...why didn't you use a parser generator?

Note the distinct lack of higher order function use. I think there is a good reason for that.

Yeah well

I looked at the Scala parser; I find that horrible code too. (He uses higher-order functions to parse parentheses out, though. That's a start.)

I can think of a number of reasons why Scala doesn't use parser combinators. Can't express them concisely. Can't compile the combinators out. Can't partially reduce the code generated. And, the thought just never came up.

I am hard to please.

Parsers should be argued from toolset, not hosting paradigm

Error recovery is addressed by many parser combinator libraries. The library I use most often - Haskell's parsec - only handles a few error messages by default (covering the 80% case with the `<?>` operator), but (via monad transformers, and assuming a clever developer) could be extended to handle probabilistic corrections or even reactive updates for time-varying hierarchical elements.

I don't think it's right to characterize this whole parsing issue along FP vs. OO lines. The toolset issue of hand-crafting (like Roslyn's and so many others) vs. parser combinator (like parsec or OMeta) vs. external generator (like YACC or Happy) seems mostly orthogonal to paradigm.

How specific can you get in

How specific can you get in a parser combinator about your error recovery? Is it incremental? How many parser combinators run as presentation compilers?

I don't think this an FP vs. OO issue either, just the fact that if you want N features in your parser, you'll have to pay for it in code no matter what you are using. And your parser framework is useless if it just supports N - 1 of those features, so let's just do everything by hand because ITS NOT THAT HARD ANYWAYS. From this perspective, the Roslyn design is very acceptable and shouldn't be compared against a "just parsing" one off.

N-1 or N+1 of N features

if you want N features in your parser, you'll have to pay for it in code no matter what you are using. And your parser framework is useless if it just supports N - 1 of those features

I agree this is a common issue... and not just in parsers. I observe as a common issue across all software design. Usually, we seek some extensibility properties to address a few remaining features. Parsec is relatively extensible, leveraging monad transformers and typeclasses. But my experience is that there are limits where extension becomes inconvenient or incomprehensible compared to designing features in from the start.

On the flip side, if you have N+1 of N (cross-cutting) features you need, you often end up paying significant complexity or performance overheads for a feature you don't use. This creates pressure to specialize code, and results in a lot of redundancy.

There's an interesting middle ground:

* implement a custom parser-combinator library by hand
* implement a parser using the parser-combinator library

This has advantages for maintenance, extension, reuse of the grammar in different contexts by generalizing the combinators. And IT'S NOT THAT HARD ANYWAYS. I've done it two times, for short-lived DSLs. :)

My impression is that William Cook's 'Object Algebras' generalize this in some way.

Right, you can implement a

Right, you can implement a parser combinator by hand, but what do you gain by that beyond not being able to debug your code anymore?

I had an interesting discussion with William about this last year, comparing his grammar based approach to Glitch, actually. My position was that we could enhance procedural code with a new execution model rather than going declarative, but we have different goals (I'm big on live programming and debugging).

what do you gain by that

what do you gain by that beyond not being able to debug your code anymore?

As a PL maintainer, you gain much ability to debug, maintain, extend, and reuse the grammar. It's a simple separation of concerns.

As a PL user, you gain potential benefits from custom tooling, features developed within the parser combinator. That custom tooling might (or might not) be designed for debugging, live programming, probabilistic or history-driven error recovery, or whatever.

Much like a hand-crafted parser must be explicitly developed to address debugging concerns, so must be a custom parser combinator implementation. You don't automatically "gain" some inability to debug. Why did you assume otherwise?

(The main loss, rather, is optimizations and performance. For example, it's difficult to implement differential grammar optimizations for a custom parser combinator library. It's difficult even for a popular, reusable parser combinator library, but at least the potential ROI is higher.)

we could enhance procedural code with a new execution model rather than going declarative

I wouldn't say you escaped 'going declarative'. Your recent designs make heavy use of (via assumptions about) commutativity, monotonicity, and idempotence. :)

I still haven't seen any

I still haven't seen any production code that relies on parser combinators, even Martin Odersky's scalac; e.g.

https://github.com/scala/scala/blob/master/src/compiler/scala/tools/nsc/ast/parser/Parsers.scala

I'm curious if even GHC uses parser combinators? (Edit: they don't, they use a yacc clone called Happy, at least its a generator).

That being said, I think the main thing you get with parser combinators is ease of programming, but for a tool that will be used by thousands or millions of programmers, it is really worth the effort to do things by hand to get the performance and features that you need.

The whole debugging data flow code is an argument to have another day, but it is one I'm very passionate about: by converting control flow into data flow, you lose the ability to debug your code very easily; you've basically abstracted away the control flow from even yourself. Any sort of declarative parser generator or parser combinator does this, and there is a good reason to avoid them if you don't get a bigger gain than the capabilities you lose. It is also an intrinsic problem of the declarative approach which necessarily works by hiding control flow.

I wouldn't say you escaped 'going declarative'. Your recent designs make heavy use of (via assumptions about) commutativity, monotonicity, and idempotence. :)

Very true. It's not like OO contradicts those features.

Haskell probably isn't recursive descent

Haskell is that old, the parser compiler generates tables, so I bet it isn't recursive descent/LL(k).

You usually don't pay for simple LL(k) combinators in a functional language but it depends on the implementation. (The parser combinators I have shown don't even do first-set determination.)

I just found out Daan Leeijen implemented a parser combinator library for F#.

Haskell is sometimes used in industry so I am pretty sure Parsec shows up there in several applications. Not sure I like Parsec.

by converting control flow

by converting control flow into data flow, you lose the ability to debug your code very easily

Spreadsheets are a fine example of debuggable dataflows. Box-and-wire pipelines are also debuggable, as we can easily place a debug views on any wire. Are you imagining a specific problem scenario when you speak of 'data flow'?

In any case, I think what you imagining when you read 'parser combinator' is very different from what I'm imagining when I write it. Most parser combinators in Haskell are monadic, meaning you can (and often do) compose them very much like imperative code, at least with respect to the sequential axis.

Further, at least with the `parsec` package, we don't use backtracking (search) unless we specify it explicitly - it's all LL(1) by default. So there isn't any losing sight of this critical performance aspect.

I still haven't seen any production code that relies on parser combinators

I've seen a few cases of people using Boost spirit, but I haven't paid it much attention. It seems many developers prefer to just toss in an XML or JSON parser, and format their 'code' accordingly, unless they're actually building an IDE.

Haskell developers make effective use of parser combinators for DSLs. Even its new suggested 'Read' class is based on them (though ReadPrec is a parser combinator with awful error reporting). But Haskell also has a long history of using parser generators, such as Happy. Based on glancing through a few comments, I think GHC still uses Happy.

Erm, spreadsheets are

Erm, spreadsheets are notoriously hard to debug, and this is a major research topic as well as a problem that costs the world millions, or even billions, of dollars a year.

Box and wire systems allow you to debug a program like you would debug a an electrical circuit; you only get to see the state for "now" and need something like an oscilloscope to even begin making this feedback comprehensible.

But this is a topic for another post in the future :)

notoriously hard to debug,

notoriously hard to debug, and this is a major research topic as well as a problem that costs the world millions, or even billions, of dollars a year.

Can't you say the same of imperative programs? :)

Spreadsheets can become quite complicated, of course, as can any program. I wonder if the notoriety of spreadsheets arises from them becoming 'complicated' most often in the hands of people who don't program for a living.

you only get to see the state for "now"

Again, this is a complaint that applies to imperative programs just as well, e.g. with step debuggers. Logs and time-traveling debuggers are good all around, I imagine, regardless of control flow vs. dataflow.

One advantage I feel exists for dataflow is an ability to see ALL the state 'now', along with some clear relationships between states, not just a peephole view through a control-flow thread with implicit relationships. Also, managed (logical) time is useful for a consistent concept of 'now'.

Anyhow, I'll happily leave such a discussion to the future. :)

Spreadsheets can become

Spreadsheets can become quite complicated, of course, as can any program.

The problem is that there is no way to really debug a spreadsheet. Live execution is nice, but not sufficient.

Also, managed (logical) time is useful for a consistent concept of 'now'.

Right, I'm actively working on having cake and eating it also :)

Slate

There are spreadsheet debugging tools, e.g. Slate (which I stumbled across one day while searching for Brian T. Rice's language).

I'm not sure why you'd claim there is no way to debug a spreadsheet. I can imagine quite a few: automatic clustering of relationships, highlighting sources and sinks in different colors, drawing or animating lines for every cell a given source affects, coloring those lines based on the operations involved, and so on.

Whether popular spreadsheet software implements such features, or can readily be extended to implement them, is a separate issue.

Right. I do think the

Right. I do think the declarative/FP community needs to invest more in debuggers. Those debuggers are not impossible to exist, just that they don't really exist now.

Spreadsheet debugging is a wildly active research area right now, especially for the company that I work for.

Please do post about debugging.

spreadsheets are notoriously hard to debug

Debugging technique is worth writing a lot about, in both abstract and concrete terms. I wish LtU had a whole department devoted to seeing what happens in code, plus design of interfaces and running systems to support effective feedback loops, so both devs and users can perform experiments rapidly with meaningful, fast turn-around.

Clearly your work in live coding is oriented toward such issues in IDEs with immediate user feedback. I'm more interested in feedback from backend systems in production code as well as debug builds, which must be compared to decide whether Heisenbugs might be lurking. (Does it only fail when I'm not trying to watch it run?)

Poor visibility of behavior, and effects, impedes grasp of systems at every level of experience from casual non-technical users to professional systems programmers. Of the two groups, newbies suffer more, because devs have experience forming (sloppy?) mental models about what might be implied by what little evidence can be seen. But everyone seems to enjoy a closer approximation to direct manipulation. For a dev, a REPL command line is sometimes a road to instant means of checking current values fast enough to experiment while experiencing a state of flow. (For end users, insert stock remark about WYSIWYG here and make clever analogy.)

The more powerful a system, the more wildly different it's current behavior can be from what is expected, because it's easy to ask far more be done than you can check easily by eye. Given a version of HyperCard with coroutine threads inside, it can be doing something all the time that has nothing to do with any current user action, and this would be mystifying without means of somehow presenting a view of it.

This is one of the reasons it's hard to make programming systems for non-experts, who might not even know to pose questions like: "Did I just give permission to mail off all my secrets, then erase my file system?" Absent the ability to destroy locally stored data (assuming one hopes to prevent erasing file systems), a user can then shoot themselves in the foot by forking threads consuming all resources, filling storage until none is left.

A user should not only be able to find out what happens as the effect of using a system, but also why those things occur, which is likely a harder design problem. Cause and effect is easy to get with a ball and bat, slightly harder to get the first time you ride a bike, but often opaque in computing tools.

There is nothing to parser combinators.

It is trivial to implement error recovery parser combinators. There's actually nothing to the naive recursive descent combinators used (which is how I like it.) The ones used in the snippet take about thirty lines and everybody should be able to implement recursive descent parser combinators which handle some amount of error recovery in an afternoon.

The point being: It is usually straightforward to write a one off parser (recursive descent) combinator library in an FP which implements exactly the N features you want. No tools needed.

You have a point w.r.t. presentation compilers but from the C# code snippet it doesn't show they're doing anything special. At least it doesn't show why their code wouldn't be better organized using a few higher-order constructs.

The C# code remains unconvincing in its verbosity.

Well. I concede. Undoable in C++.

Thought I would implement parser combinators in C++. It's technically feasible but at the cost of a tremendous amount of line noise.

Guess there are good reasons for not using parsing combinators in C++, or C#, for that manner. (Though a library for C# just became available. Not sure what amount of line noise that would involve.)

Guess there are good reasons

Guess there are good reasons for not using parsing combinators in C++, or C#, for that manner. (Though a library for C# just became available.)

Parsing combinators in C# have been available since soon after the release of generics in .NET 2.0 (that implementation is dated 2001). The more recent examples now available are exploiting LINQ's monadic structure, and have been around since at least 2007.

I just personally no longer use parser combinators because they are insufficiently general. For instance, separate mechanisms are needed to handle precedence, and non-trivial transformations are needed for certain recursive grammars (last I looked, which has been a few years I admit).

Wrong parser combinators

Hmm. I don't know what you mean that you need separate parser combinators because functionally it isn't very hard to define precedence rules. Maybe you are using the parsing combinators wrong, possibly it isn't expressible in the C# library you use.

But actually, it is the problem I am trying to solve at the moment. The following is the (largely untested) code which handles precedence in my compiler and I am scratching my head about how to translate it to C++. (Unsure it's entirely correct but this is the general idiom, after parsing an operator you pass it as an argument to the parse function which parses the rest of an arithmetic expression.)

    def parse_infix_expr: ast -> ast -> parser ast =
        app = [ e0, e1 -> expr_application (pos e0) e0 e1 ];
        i =
        [ e0, op0, e1, op1 ->
            if op_less (txt op1) (txt op0)
            then parse_infix_expr (app (app op0 e0) e1) op1
            else if op_less (txt op0) (txt op1)
            then parse_infix_expr e1 op1 <@> \e2 ->
                app (app op0 e0) e2
            else if op_left (txt op0)
            then parse_infix_expr (app (app op0 e0) e1) op1
            else if op_right (txt op1)
            then parse_infix_expr e1 op1 <@> \e2 ->
                app (app op0 e0) e2
            else failure ];
        [ e0, op0 ->
            parse_prefix_expr <-> \e1 ->
            (parse_infix_op <-> \op1 -> i e0 op0 e1 op1)
            <+> success (app (app op0 e0) e1) ]

    def parse_arithmetic_expr: parser ast =
        parse_position <*> \p ->
        (parse_prefix_expr </> \e ->
            parse_infix_op <-> \op -> parse_infix_expr e op)

Anyone with bright ideas is welcome to come forward and show that I completely don't understand the merits of C++ template programming. (Or show me that my code doesn't work. That would be good too.)

ADDENDUM: Bleh. Now I am thinking about correctness of this code which is what I don't have time for.

I once wrote an entire

I once wrote an entire parser for a real language using only precedence parser and brace matching components (ok, also add some linear propagation to set some token kind ambiguous precedences) . This was quite cool because it was incredibly error tolerant...being bottom up it would produce a tree for anything you gave it! Of course, you still have to walk the resulting tree to check the grammar (a form of grammar-based parsing without tree building), but this didn't negate the error recovery capabilities.

I guess the moral of this story is that this rule-based parsing thingy, which seems to be heavily reinforced by parser combinators, is not the only way, or even the best way, to do build parse trees. My current style is a hybrid of recursive descent, some precedence parsing, and brace matching (because you can match braces before you even begin to form trees, which is great for error robustness and brace completion features in a live programming environment).

Since you replied to me in the first place

Since you replied to me in the first place, just thought I could add, FWIW, and after I've read a little some more of it (at Codeplex) :

well, I rather agree with you, in fact, this probably would not have been my own style of putting in C#, should I have had to follow the same approach for the same task.

AFAICT, I've seen more appealing C# to the eye, and yes, even beyond the ad-hoc gotos, I don't quite get the point either of putting with all these "ref..." arguments, and a memory management not sticking more simply (more uniformly) to the common "using (...) { ... }" / IDisposable pattern that's been around since C# 1.0... but heck, maybe I'm still missing some subtlety sitting elsewhere, also(?)

Now, to play the devil's advocate (but still, not trying to excuse it completely) I've also seen many times much worse and/or much hairier production code in C#, and that wasn't even about a parser for a HLL striving to do some decent error recovery, etc...

I wouldn't flame it, but I wouldn't praise it either, anyway, as far as C# writing style alone is concerned (independently of what the code does).

As I'm still eating a lot of that thing these days after a decade plus, I have a pretty rudimentary litmus test with C# : I try to read and understand every single aspect of it, pretending no comment at all would be provided, besides those of its public interface, or the specs given elsewhere (and no, I'm no masochist... just of curious nature ;)

And there, this just didn't read easily enough to be "very agreeable", granted ;)

goto considered useful

I use goto sometimes also. There are obvious use cases where it is useful, like when you want to do something again but not usually, or when you have some hairy control flow to worry about.

P.S.

P.S.

But hey, it's still a little some more useful free stuff(?) coming from Microsoft, so, I'm not going to complain too loud or for too long!

(I, for one, certainly do not want to discourage them, there, anyway...)

"Better than nothing" they say!

Good grief

Good grief. Well, I know it's always in the possible with "dear OO", but I wasn't expecting to open such a can of worms with it, through this comment thread...

Oops! ;)

Idealism and reality always

Idealism and reality always causes explosions when mixed together.

Dude, please.

Stick to facts. As far as you were referencing me, I am not an idealist.

But your idea of standard

But your idea of standard practice is not standard practice. This is quite common; like how systems professors like to believe that bank account balances are always maintained atomically when eventual consistency rules in reality. Likewise, a lot of people believe that parser generators are ubiquitously used for building parsers; this is what we are taught in compilers 101 after all! But the truth is far more nuanced than that, you'll see recursive descent in production code more often than not for very good reasons.

You don't know me

You consistently assume you know what I think. You don't. You're incapable of having a discussion outside of the realm of your own fantasies.

I haven't stated anything on my ideas about standard practice. I have said the above C# snippet is horrible code. I still believe so.

You can't win a discussion by labelling other people idealist or by making assumptions on people's ideas on standard practice. Please try and formulate a technical argument. At the moment you're proving that you're lacking the intellectual capacity of doing so.

Fine.You find the C# code

Fine.

You find the C# code to be horribly convoluted and I find your counter example to be horribly toyish and insulting to professionals in the field (which may or may not have been your intention, that's how I took it). We can leave it at that.

Now we should just chill. The debate has gotten to emotional and I didn't mean to offend you or make assumptions about who you are. Technically speaking, it is hard to reason about these things rationally given that:

  • Your requirements are probably quite different from their requirements.
  • We can't judge simplicity simply by judging verbosity.

So we have devolved into a experience pissing contest. Sorry about that. This is a subject that I'm quite passionate about, and where I think the academic thinking (which you may or may not buy into) is leading us down the wrong path.

Not letting go

"I find your counter example to be horribly toyish and insulting to professionals in the field"

The whole point of my insulting code, or all programs I write, or languages I design, is to show how by trivializing programs one can arrive at simple, concise, readable, reasonably performant specifications which will do exactly what one thinks. (You think trivializing stuff is easy?)

I understand why the C# code is what it is. The tight coupling to the underlying imperative model. The unavailability of higher-order constructs; or rather, the inability to express higher-order constructs concisely. The lack of a simple syntax for tagged records. The paradigm which dictates an avoidance of superfluous method declarations which refine algorithms. The near inability of the compiler to optimize out functional code, or generate efficient code for it, even if it would be written in a functional style. The necessity of try/catch/finally blocks.

I agree with what they wrote given C#. It works. I don't even mind gotos when they are needed. I have seen far worse. And it is horrible, verbose and unnecessarily complex code.

The whole point of the language I designed is to arrive at something toyish which is good enough for certain problem domains. Parsing in my language, as in most FP languages, is just a problem domain which fits the paradigm very well.

Now I will let go.

Point taken. We shall see

Point taken. We shall see what approach yields the best results in the long run.

Incommensurable entities

Incommensurable entities.

Parser combinators are just

Parser combinators are just a way to write a recursive descent parser though. They encapsulate the common patterns that you may see when writing a recursive descent parser the old fashioned way. In an old fashioned recursive descent parser you have idioms for handling repetition, alternatives, precedence, etc. Parser combinators just implement those idioms as a function, taking advantage of higher order functions to customize the bits in the idiom that vary from case to case.

Parser combinators for LR grammars

Parser combinators do not necessarily produce recursive descent parsers, though. They provide an interface to describe a grammar, and you could check that it is LR and generate a LR parser automaton from that.

(It would be easier with explicit representation of recursion instead of relying on the one of the host language, but you never know which kind of hacks could make this subtlety go away.)

Indeed you could take a

Indeed you could take a subset of the functions in a parser combinator library, replace them with functions that simply build a grammar AST, and then create a LR automaton from that. For recursion in an impure language you could generate a circular AST, and then reintroduce explicit recursion by looking at the object identity of the nodes.

If you want to be able to express the full range of constructs that most parser combinator libraries provide, you're pretty much stuck with recursive descent though. In particular the monadic bind lets you look at what you've parsed so far, and based on that decide how you're going to parse the rest. That allows you to parse unrestricted grammars (the same class of grammars that a turing machine can parse). It's not just a question of being LR; it's not even context free. There is not even any construct in context free grammars that corresponds to parser monadic bind.

Stuff just fell out of fashion

In the old days of language development people were grammar nuts. These days, somewhat context sensitive LL(k) parsers just solve most problems encountered in the definition of programming languages.

I forgot the reasons why I shouldn't use LL in the definition of languages but in the eighties people had pretty strong opinions on them. I like LL(k). LL in a programming language for me just means that most production alternatives always start with a visual clue (you can even use to restart parsing.) KISS.

Monadic bind? It somewhat works against you if you want to implement more than LL parsers. You can construct more elaborate parsers in case you separate rules (in the form of data) and actions (in the form of functions). But there simply is no need and a scheme like that always presses you into a straight jacket with its own disadvantages.

Personally, I think the Helium (Utrecht) Haskell developers were foolish for developing an attribute grammar compiler to parse and compile Haskell. But, yeah, Doaitse Swierstra is a grammar nut.

(Then again. No idea whether Haskell is LL.)

But that is just it: parser

But that is just it: parser combinators emphasize expressing grammars at the expensive of everything else. You basically leave your control flow to a bunch of higher-order functions applied under the covers, which you can only tweak by hacking up your generic parser combinator.

But really, this applies to functional programming in general: as soon as you go with higher-order functions, your program is no longer under your control literally. This is not just a problem with functional programming, but also with the less discipline Hollywood principle (at least combinators don't require you to literally pass around HOFs) and any system where spooky control flow can happen at a distance (e.g. databinding, FRP, Haskell [1] in general). Parser combinators are then like WPF, where recursive descent is like a simple polling-based immediate-mode UI framework.

Once we take back control of our parsers (and programs in general), we can do really cool things, like error recovery that doesn't suck or memoization to be incremental. But in order to do that, we need to make control flow more appealing so our first thought isn't just to hide it.

I'm really only latching on to parsing because this where I realized that the solutions were ill suited to the challenges I was tasked with in getting scalac to work well in Eclipse. But the solution I came up with is far more general.

[1] Using monads for effectful programming is like vowing not to kill anyone and hiring an assassin to do it for you...your hands are clean!

But in order to do that, we

But in order to do that, we need to make control flow more appealing so our first thought isn't just to hide it.

I think this is a good approach, as long as you can decompose the parsing process into a simple set of impure primitives. For instance, Pratt parsing requires a token stream, a binding power for each token, a map describing the null denotations and left denotations for tokens, and a stack. You then have Turing-complete top-down operator precedence parsing, and you can make it incremental if the stack is reified as an explicit value.

Exploiting the Turing completeness is also pretty easy: the Pratt parser has an advance() operation that lets you manually advance the token stream and process tokens using any sort logic. So parsing a list of integers delimited by [] is as simple as:

var list = new List();
for (var tok = parser.CurrentToken;
     tok.Id == "(int)"; // some unique identifier for the integer rule
     tok = parser.CurrentToken)
{
     list.AddRange(parser.Parse(0));
}
parser.Advance("]");
return list;

dysfunctional programming

Ya, this is what I found also: most of the impurities of a parser are not really that impure and can be managed fairly easily. The question is how far can we go by managing impurities (say, the ones that can be rolled back and are commutative)...Heck, the same applies to type checkers...

In theory you don't really

In theory you don't really give up any control since you can always parse in any way you want using the more low level combinators. It's true that if you're doing that in all your parser then most of the built-in combinators aren't useful for you, so then the usefulness of the library becomes questionable. In order to support e.g. memoization in the standard combinators, you'd need to modify the library. I don't think this is a good reason not to use parser combinators however. A parser combinator library is a very small piece of code. If I were given the task to write a parser and I did not have any parser combinator library available, I'd first write the parser combinator library, and then write the parser using that. If you refactor a recursive descent parser to remove the duplication you'd end up with something like parser combinators. Similarly, if I were given the task to write a memoizing parser, I'd first write the memoizing parser combinator library, and then write the parser using that.

Do you have separate your

Do you have separate your type checking code from your parsing code then? Because I don't do this currently. No bother to generate trees at all.

Possible

The parse functions follow the grammar rules, can take arguments, and can return information. They therefore can easily express attribute grammars where during the parse inherited and synthesized information can be used to do the type checking.

So yes. It should be doable to typecheck the code without explicitly generating an AST, but I haven't seen it around so far.

(Which is the reason why I think it doesn't make too much sense to create a compiler compiler for attribute grammars for Haskell. Haskell should allow you to define stuff like that trivially in the language itself.)

Instead of building a parse

Instead of building a parse tree you can directly type check of course. What you do during the parsing process is entirely up to you.

I'm not sure if anyone has

I'm not sure if anyone has come up with type checking combinators yet, while attribute grammars never really worked out.

I don't really understand

I don't really understand what you mean here. Parser combinators are no different than a handwritten recursive descent parser in that you don't need to build a parse tree but you can do anything you wish the moment some subexpression is parsed. For instance when parsing arithmetic expressions you could calculate the answer on the fly instead of making an AST and then calculate the answer. The same goes for type checking and anything else you might wish to do.

Yes

You need a few design tricks (abstract binding trees, breaking ASTs into a syntax functor and explicit fixed point, and a distributive law of syntax functors over applicatives, and finally a context monad) but you can finish up with very simple and readable typecheckers.

I'll blog about it some time, maybe.

Sounds interesting, it would

Sounds interesting, it would be nice to read more about it.

That sounds like a really good idea!

Wonderful thought.

Old news

as soon as you go with higher-order functions, your program is no longer under your control literally

In the 60s and 70s, Real Programmers were arguing against programming languages in the same spirit: All these Algols, Pascals, PL/1s and Cs were taking away the "control" they had in assembly language.

I suppose one person's lack of control is the next person's gain in abstraction and compositionality.

I gather in the 50s

I gather in the 50s programmers objected to assembler programs for reducing programmer control of the machine. Nor were they wrong.

Accept at that point we were

Accept at that point we were moving from something low level to more concise. With HOFs, we just move from something direct to something indirect.

As in

As in: for-loops are as concise and high-level as 'map' or 'iter'?

Depends. If I don't care

Depends. If I don't care about debugging, I am not against using map or filter. If I need to debug, for loops are much better of course.

If only

If only for-loops weren't so often the reason that you need to do debugging in the first place...

That is just it isn't it?

That is just it isn't it? Haskell requires equational reasoning because you can't see very well how the program executes (beyond going the impure print route, what is the state of Haskell debugging these days?). And yet, throw a few monads into the mix and your equational reasoning isn't really there anymore, now what?

The dream of programming without debugging is still just a dream.

The dream of programming without debugging is still just a dream

Well. I agree with that. And I can't say I like Haskell a lot. That is, I like some of the ideas behind it, moderately to fervently disagree with some other ideas, but simply don't like what the language has become too much.

But I don't think people in FP claim that you don't need to debug your programs. Debug less maybe. Debug other errors than in imperative languages. Debug differently, maybe true too.

(The point of FP being that it should allow you to concentrate on logical errors only after you've fixed the typos in your code. But, for the moment, you do pay heavily for all that abstraction.)

Tim Sweeny once posted numbers on that

Something like 90% of all bugs were out of bounds errors, if I remember correctly.

Spooky control flow

I've been thinking about this for a while but at I would claim that in most functional languages you have the least control over control flow. It's all function composition. In a strict language, that means evaluate the arguments, then the function. (No goto's or breaks usually.)

Since you have HOFs, you can exploit that feature to express more elaborate control flow mechanisms by combining HOFs and have them evaluate in a particular fashion.

In my library, I didn't even bother to name the parser combinator parse combinators but named them search combinators since that is what they abstractly do. They search in a particular fashion over the input (with parsers, a list of tokens) to generate output (with parsers usually an AST).

The code (relies on calc, poor man's monadic composition, <* is bind), has primitive error reporting and exception mechanisms (I opted to handle exceptions explicitly though the language has exceptions support build in):

    type search_result = \t => [ _top t | _bot text | _exc text ]

    type search = \a \b => calc b (search_result a)

    def success:a -> search a b =
        [ r -> nret (_top r )]

    def failure:search a b =
        nret (_bot "")

    def exception:search a b =
        nret (_exc "")

    def message:search a b -> text -> search a b =
        [ p, m -> p <*
            [ _bot "" -> nret (_bot m)
            | _exc "" -> nret (_exc m)
            | x       -> nret x ] ]

    def nofail:search a b->search a b =
        [ p -> p <*
            [ _bot m -> nret (_exc m)
            | x      -> nret x ] ]

    def parallel:search a b -> search a b -> search a b =
        [ p0, p1, s0 ->
            [ (_bot r1, s1) -> p1 s0
            | l             -> l ] (p0 s0) ]

    def serial: search a c -> (a -> search b c) -> search b c =
        [ p0, p1 -> p0 <*
            [ _top x  -> nofail (p1 x)
            | _bot m  -> nret (_bot m)
            | _exc m  -> nret (_exc m) ] ]

A parser is a something which searches over a list of tokens. (A lexer searches over a string.)

    type parser = \a => search a (list token)

So parser combinators are actually things which express a different control flow expressed in a minimalist control flow setting. Abstractly search combinators define things which combine search functions over an input, might be list of tokens but could also be a tree which you want to query.

Still wondering what more you can do except for expressing search primitives though.

PS: Sorry for the code but I thought I would post it since people still seem to think parser combinators are necessarily complex things. They can be but there is no real need for that usually.

Searching vs Parsing

Searching seems inefficient? I think I would choose a simple L{L|R}(1) parser in preference. So a parser either returns false (consuming no input) or consumes input and returns true or an exception.

With parallel how do you deal with both parsers succeeding? How do you limit the scope of a search, surely you need to know what is next not what is 100 tokens ahead?

Parallel tests two alternatives

When the first search fails, it tries the second. (Unless something exceptional happens in the first search.)

On the efficiency

The combinators abstractly define how to combine functions which recursively search a space. They are written such that an optimizing functional compiler, by inlining and partial evaluation, would compile them out. The efficiency of the combinators in a functional language is therefore equivalent to a handwritten recursive descent solution, which is what most people writing compilers usually default to at some point.

It's the usual abstraction vs speed trade-off. With these parser combinators you have the reasonable guarantee that your parser is correct by construction. And cheap higher-order attribute grammar definitions from within the language isn't something you can even easily fully express with tooling (if you don't mind LL.)

But if you want raw efficiency, this approach will never beat a C lexer which tokenizes a text by bumping a pointer. But you do get 'correct by construction' back.

Correctness, Tokenising and Efficiency

How do sequenced searches work? For example say I have:

abc 123
345 def

So if letter blocks and number blocks are alternatives the first will go wrong if you search for numbers before letters and the second will go wrong if you search in the other order.

On the tokeniser issue, I don't usually use a tokeniser as you run into problems with overlapping symbols, I prefer a single pass parser.

I also find the argument that 'an optimising functional compiler' will solve the efficiency problem to be problematic in my experience. It is a mantra often-repeated, but I find it takes domain experts to code the abstractions to make things efficient. The compilers problem is that an optimisation in one domain is a pessimization in another. No amount of inlining and partial evaluation can make up for a bad abstraction.

I also disagree with the abstraction vs speed tradeoff, I believe that the right abstraction (if you can find it) is both efficient and beautiful (but may not be supported in any given language).

It just follows the grammar

I am sorry but I don't get the question and don't understand the grammar you're using. If you have an LL(k) grammar you just follow the production rules like you would normally write out a recursive descent parser where the parser combinators deal with the usual idioms. (I have more operators than shown; some primitive, some convenience. It's just that par and seq are the normal ones to showcase.)

I haven't given it much thought, just implemented it, so a short explanation below. (Don't expect published paper quality, my laptop battery is running low.)

Say you have a language which describes nested parenthesized numbers:

N → ('0' | ... | '9')+

E → '{', E, '}' | N

The first thing to observe for LL(k) grammars is that everything can be expressed in terms of three operators: <->, <*>, and <+>. <-> tests two production rules sequentially where the second one may not fail (for LL(1)), <*> tests two production rules sequentially where the second one may fail (for LL(k) expressions, not used often), <+> tests two alternative production rules where if the first alternative fails the second is tried. (There is more but that's mostly convenience for nice error reporting and often used idioms.) Also, we're doing attribute grammars, and the two sequential operators thread a state and a return value, therefore have a type like the monadic bind.

You analyze the production rule for expressions. It has two alternatives: One alternative for numbers, one alternative for parenthesized expressions.

E → '{', E, '}' | N
                  ^--- just parse this if the first alternative fails, <+>
     ^---------------- you test on this, if it succeeds
         ^------------ then this _needs_ to succeed, so <->,
             ^-------- as does this, so <-> again

Informally, a rough approximation without attributes is (by simple substitution):

E = ('{' <-> E <-> '}') <+> N

Say we want to return the number recognized. Decorating the grammar with attributes gives the next informal description:

E(n) = ('{' <-> E(n) <-> '}') <+> N(n)

You're going to return the number. So you're just writing out the attribute grammar like you would in Bison or Yacc. And then the code will look something like this (untested):

def match: text -> parser text = ... // do something smart with input here, note 'analytical' information is just returned

def parse_plus: parser t -> parser (list t) = ... // a higher-order function, usually predefined

def number: parser int = 
    parse_plus (match "0" <+> ... <+> match "9") <-> 
       \dd -> success (digits_to_int dd)

def expression: parser int =
       (match "{" <-> \_ ->       // first alternative
        expression <-> \n ->      // note the synthesized information going in
        match "}" <-> \_ ->
            success n)              // synthesized information going out
    <+>
       number                       // second alternative. synthesized information going up is implicit

It's all very straightforward if you know some parsing or compiler theory and some functional programming. This is sophomore level programming. And you can't really do it in C++.

Spooky control flow II

Okay, so in the above elaborate post I have shown that you can express different flow control primitives in a language which has minimal control flow. (I hope we're on the same page on that, unsure I got what you meant.)

Moreover, you can do that abstractly, concisely and with a reasonable guarantee, by visual inspection, that whatever you expressed will work^tm.

So now I don't get what your beef is? Where's the problem? Isn't this a good thing?

It's hard to predict which

It's hard to predict which thread will take off.

And in what direction.

So true. I'd blame it on Gödel... him, again! ;)

So true. I'd blame it on Gödel... him, again! ;)