Looking for a good online forum on compiler design and implementation

Hi, I'm looking for a good online forum that focuses on the practical details of building compilers. I've posted a lot of material on LLVM over the years, however that forum is really dedicated to issues specific to LLVM and not compilers in general. Similarly, LtU is more focused on theory than practice (which is not to say that I'm not interested in theory.)

I'd like to find some place where I could post questions about things like efficient multimethod dispatch, garbage collection algorithms, and so on, and not feel like I was off-topic.

Also, when it comes to feedback I would favor quality over quantity.

Any suggestions?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Is comp.compilers still

Is comp.compilers still active? Google groups is blocked here, unfortunately.

It seems to be.

It seems to be. comp.compilers also has a twitter feed.

Twitter is blocked in China

Twitter is blocked in China also :)

When I get to the office, I can look at these things.

Comp.compilers used to be the place for compiler discussions, I was subscribed a long time ago. It also had a strong moderator who was well known in the community. If it is still as active as it was before, this is the place for compiler heads.

They have a web site of their own too

Can you access http://compilers.iecc.com/?

I do second the recommendation for comp.compilers, too.

comp.compilers

I spent a few hours perusing topics. I agree with the recommendations. It seems active and high quality.

seconded

Yah. Looks like I was a bit hasty, looks high quality.

Do they take newbie questions? If so, I'll see whether I can convince them to say something sane about implementing parser combinators in C++.

You should run some

You should run some searches, first. Parsing comes up quite often on this list. And the website has a full 18 year history, all searchable.

Too far out of their scope

Doesn't look like I'll get an answer there; the answers I scanned don't go further than what I know. Maybe a very, very good C++ template programmer could show how to implement the specific parser combinator library I want without cruft.

I guess in the end it's not a parser question but a 'how do I express these HOFs in C++ such that it would give a reasonable usable library.'

I doubt it's doable but it wouldn't hurt to check.

Doubtful

I don't know how I got to be so cynical, but based on my experience, the idea that "a very, very good C++ template programmer" could show how to implement something "without cruft" is pretty funny.

I don't mean any disrespect to those who've invested in those skills, it's impressive and useful, but the entire edifice is cruft.

Parser Combibators

What would you like the parser combinators to do?

Spec

I bootstrapped my compiler. I shouldn't have but I did. And I would like to see whether I can use the code as a spec for implementing an interpreter in C.

The spec of the parser combinators is in another post. (I find it readable, but if not: The things in the square brackets are pattern matching rewrites.) Basically, underwater, any parse can succeed, fail, or fail with an exception; then you chain these together by either trying alternatives or sequences of parsers.

I think I now just enough C++ to implement it myself but I wouldn't be very sure about memory management of all the temporaries (parseresults) being created.

Secondly, you also usually create intermediates which store a parseresult up to a certain point and may be discarded when another parse sequentially fails. So there memory management becomes nontrivial.

Memory Management?

I'll take a look, I have a little parser that I have used in a few things, so I'll see if I can adapt it. Regarding the memory management comment, that sounds like very old style C++. I'm sure you know this already but you never need to call new (and I would suggest you should not) unless you are implementing a new container. In new C++ you can return containers with zero cost (as the compiler is clever enough to automatically use move semantics when the local copy is going out of scope) and you can declare containers on the stack and pass in by reference. The key is making sure containers have value semantics, (IE can be copied and assigned as if they were primitive types), which all the standard-library container are, again this is only relevant if implementing new containers, and you should prefer to use the standard ones.

I generally pass the AST in by reference and add to it incrementally. AST nodes have vectors to contain there children. You use subclassing where you would use disjoint unions in functional languages (you can do a more or less mechanical translation, with the type being the base class and the option constructors as subclasses of the base-class. This keeps to a shallow two-level hierarchy just like you would implement in a functional language). I generally use the visitor pattern to implement to operations on the AST.

C++ Parser Combinators

So I started looking at this, and at the moment I don't see the advantage of parser combinators in C++, as they make everything far more complex. You can get almost the same with a parser class (encapsulating the stream state and row/col information for error reports) with a couple of simple methods:

bool accept(char_pred const &t, string *s = nullptr);
bool expect(char_pred const &t, string *s = nullptr);

char_pred is a function-object selecting filtering a single character, and the output is put into the string (if it is not null). Because the methods return bool and the state is in the object, they can be combined using the normal short-circuit and, or and comma operators. You can use while to match multiple times, some examples:

return accept(is_backslash) && (space(v) || accept(not_eof, v));
while (parse_escape(v) || space(v) || accept(not_rsqbrkt_or_eof, v));

You can even use this in an expression like this:

bool parse_value(string *v = nullptr) {
    return accept(is_lsqbrkt) && (
        ({ while (parse_escape(v)
            || space(v)
            || accept(not_rsqbrkt_or_eof, v));
        }), expect(is_rsqbrkt) && (space() || true));
}

but the it may be more readable using a slightly more explicit control flow:

bool parse_value(string *v = nullptr) {
    if (accept(is_lsqbrkt)) {
        while (parse_escape(v) || space(v) || accept(not_rsqbrkt_or_eof, v));
        return expect(is_rsqbrkt) && (space() || true);
    }
    return false;
}

The whole library is 207 lines, and that includes character predicates for all the 'C' isspace style operators, and some utility parsers for names, numbers, signed numbers and spaces, no templates needed.

Typesafe (higher-order) LL(k) attribute grammar descriptions

Is what it gives you. But, granted, it's a difficult thing to express, or use, in C++. Lots of people have tried, and failed, to implement (more weak) usable parser combinators in C++.

It's more an exercise in how far you can push C++ than anything else; though it is somewhat of a weakness of the language that you can't do it that easily. But most C programmers won't like that it'll probably give you something in the MB/s instead of the GB/s range, so whatever.

Maybe I'm not getting this

How is this not equivalent to parser combinators? In Haskell you need to use the monadic approach as the state needs to be threaded through the computation. In C++ we can store the state in a mutable object and provide the parsing functions of methods of that object. The choice, and sequence combinators:

bool choice = parser1 || parser2
bool sequence = parser1 && parser2

provide the same functionality as parser combinators but use the language builtins. Really its just a different mechanism for maintaining the state between parsers (object vs monad). With the mutable object we do not need to replace the '||' and '&&' operators as the state is passed implicitly in the object to which parser1 and parser2 methods belong. In my example the parser is implemented as a subclass of the parser library, hence the methods do not need to be prefixed with the parser object. In Haskell we need to implement new monadic version of '||' and '&&' for each different monad, which is where parser combinators come from.

It seems to me parser combinators are a response to a set of language limitations (features), and are idiomatic of functional languages. State encapsulation in mutable objects is the C++ idiomatic way of doing this. Trying to use the wrong model just introduces unnecessary complexity and layers of abstraction.

Having said that, there are some efficiency improvements using function-objects for parsers (and parser combinators) in C++ due to the ability to inline function-object definitions. So, I'm having a go at converting my parser library to this style for comparison.

I don't do monads unless necessary. And don't do this in C++

I really don't like to discuss the monadic aspects of parser combinators. If you want to write good functional programs fast it helps a lot if you don't over-concentrate on algebraic observations; if you need to thread state, something you should avoid at all costs, then use the monadic idiom, or a self-defined variant of it. Monadic programming is just a lousy fix for i/o and for people who can't. (Please don't inform other people about my opinion on monads. Some get confused easily.)

Uhm. I am going to have a go at writing these combinators out in C++ myself but I would strongly advise against going in this direction.

Unless you're really, really, really good. Meaning, that good you fully understand what's going to happen to all local temporaries created, the underlying space you're skipping over, AND you know how to overload C++ operators such that you can create (in a boost.spirit kind of manner) concise grammar descriptions.

Don't do this except if you want to publish a paper, or blog post, somewhere on it where you can abstractly discuss the merits, or disadvantages, of C++. It's an academic problem, I would never encourage anybody to try it in C++.

It's very likely to become a gigantic mess if you try this. The language isn't suited for it.

Partial implementation.

So I have put an initial attempt to implement my small parsing library as parser combinators here:

https://github.com/keean/Parser-Combinators

Here are what some simple parsers look like:

auto const is_minus = is_char('-');
auto const space = many1(accept(is_space));
auto const number = many1(accept(is_digit));
auto const signed_number = option(accept(is_minus)) && number;
auto const name = accept(is_alpha) && many0(accept(is_alnum));

Currently the parsers input from a stream wrapper (that counts chars and lines), and output strings. I am looking at ways to incorporate non-string output in the results.

For someone who did it

Look here.

I didn't take a close look yet so I don't know what he's doing in his approach.

Well. He's doing something else. No idea what he's smoking.

Oh that's really good

Ah. I am reading your code now. Very good code. I've got the feeling you can do it a bit simpler but I am not sure yet. (Why the either type?)

Ah, you don't do attribute grammars. Hence the string remark.

Yah, I think, hope, you can do it like I described in the other post. You basically need these primitives where 'parser a' things are functions, or classes with associated functions:

def <->: parser a -> (a -> parser b) -> parser b // any synthesized information a goes into the next production rule as analytical information
def <*>: parser a -> (a -> parser b) -> parser b
def <+>: parser a -> parser a -> parser a

And something like:

def success: a -> parser a // the attribute grammar epsilon rule
def failure: parser unit // Prolog's cut rule. rarely, never, used
def exception: parser unit // rarely, never, used

Some people have programmed out monads and as you can see the first two operators follow the type of monadic bind. Probably all boils down to looking at some of their code, instances of monads, and just rewriting it in such a way that a state can be threaded and a parse result can be inspected.

It's a 95% solution but I am not sure the use of anonymous functions as arguments to bind to express attribute information flow would render the result unreadable in C++.

Parsing non-AST data

Although I gave used this parser in its non-combinator guise for languages, I also use it for reading data files. For example conjunctive-normal-form for my SAT solver, and structured-game-format for my GO engine. Although SGF could be read into an AST, CNF files are best read straight into a vector (of vectors). So I think I would need something more general than attribute grammers. At the moment I am thinking of a parser combinator that takes a parser returning a string and reads the argument into an arbitrary other type.

Oh, but you can?

Any definition

def R: a -> parse b

can just be thought of as an attributed production rule for R where analytical data a flows in (from the top) and synthesized data b flows out (from the bottom.) Abstractly, I just used the monadic idiom for bind to patch together specific manners of fixing the data flow around sequences of production rules. It's looking at a type from a different perspective.

It can be any data of any type. It doesn't need to be an AST, though in a compiler it usually is.

Sorry for the attribute grammar rehearsal but that's just a simple mental model for thinking about data flow around a (recognized) production rule.

If you can live with LL, it really, really doesn't get more general than this though it may become a pain to work with in C++. (You can even describe grammars which take grammars as arguments. Never seen that before.)

I guess you can do what you proposed but I don't see how you're going to patch the pieces which the string parsers return together. Though I guess if you only need to read sequences of number that'll work.

Simple attempt to make it more general

I have fixed the rvalue handling, and added a simple attempt to handle more complex parsing, added some new, and got rid of some unneeded combinators. I have included an example parser that reads an integer. You can use this, for example, to parse a CSV:

auto const parse_csv = many(
    lift_vector(sepby(
        lift_vector(parse_int),
        option(space) && accept(is_char(',')) && option(space)
    ))
);

vector<vector<int>> a;
bool success = parse_csv(in, &a)

Looks like an LL(1) recognizer over strings to me

Yeah. I scanned the code once again. I am studying lvalue and rvalue semantics in C++ at the moment. Funny how you find Algol remnants back in C. But with the references, the template mechanism, type hacks and other semantics it has developed far further than I thought. And the local form of discourse they use to reason about type hacks. Hmm..

Looks like you programmed an LL(1) recognizer over strings to me, I don't see backtracking on the input. That's cool. And it's fine for a lexer but I want more.

Polymorphic in parse result

Its not over strings, the parser result is polymorphic. To show this I have just added a "lift_map" parser, which enables any parse result to be inserted into a map with a label:

auto const parse_line = sepby(lift_vector(parse_int), option(space)
    && accept(is_char(',')) && option(space));
auto const parse_csv =
    lift_map(make_pair(string("line1"), parse_line))
    && lift_map(make_pair(string("line2"), parse_line))
    && lift_map(make_pair(string("line3"), parse_line))
    && lift_map(make_pair(string("line4"), parse_line));

The result type of this parser is a map<string, vector<int>> Both "lift_vector" and "lift_map" have the drawback that all the parsers have to be the same type, what is needed is a "lift_tuple". Of course you can always define a custom parser to return a specific object/record type, which could be an AST node, but a tuple (or a tuple of name value pairs) would be more generic, maybe with type-level labels.

I don't see a need for LL(k) with k > 1 at the moment. Any particular motivating examples?

I don't see a need for LL(k) with k > 1 at the moment

Not really. It depends mostly whether you're in control of the language you need to parse. I just don't like to do left-factorization manually so I avoid it by backtracking LL(k) if I can. This has the additional advantage that your parser code can usually just follow the original grammar.

Optional suffixes in programming languages can be a pain with LL(1) though. The start of an optional suffix to something can often be just the start of another production rule somewhere higher. It happens a lot when you start 'overloading' a character, like '.' in an OO language, to have several uses.

I don't have the feeling you will solve your typing problems unless you do something which has the monadic bind type like I have shown above. It chains production rules with several attribute types to one production rule with the correct type; you can't really do that in another manner. But I don't get all of your code yet.

(But I guess you can split it. You'ld need something like

def &&: parser a -> parser b -> parser (a, b) // sequencing

And a different 'epsilon' rule which will just modify the attributes.

def epsilon: parser a -> (a -> b) -> parser b // twiddle the attributes but don't parse input

Effectively separating the combining and update facilities of the monadic bind combinator type.

To match a parenthesized expression you end up with code like:

 match '(' && expression && match ')' 'epsilon(\((l,e),r) -> e) // pseudo code

But that will be a pain too. Pairing all the synthesized attributes of a sequence of production rules is overhead, you manually need to check that your pairing matches the combined production rules, and somewhat unreadable when you need to decompose that result. But likely more readable in C++.)

Associated types

Each parser object has an associated type value_type that indicates to other parsers the type it produces. This is like the 'a' in Parser a. For the simple parsers (accept / expect) this is a simple type, but for the combinators this is the type production rule, for example:

typedef pair<typename P1::value_type, typename P2::value_type> value_type;

would be for a parser Parser a -> Parser b -> Parser (a, b)

Currently my sequencing operator does this:

(&&) :: Container a => Parser a -> Parser a -> Parser a

Where 'a' is constrained to be a container that implements the push_back method. However the alternative combination rule:

(&&) :: Parser a -> Parser b -> Parser (a, b)

is equally possible. I think you need probably need both, and my suggestion at the moment is a variadic tuple, where any number of parsers:

lift_tuple :: Parser a -> ... -> Parser z -> Parser tuple<a ... z>

Yah

Conceptually I agree you don't want tupling but something where ((a,b),c) = (a,(b,c)) = (a,b,c): Ta x Tb x Tc.

No idea if that flies in C++.

But the advantage, as I noted before, is that if you separate combining grammar descriptions from the attribute calculations you can do something 'stronger' like calculate first sets.

I had the feeling this is what they did in the C# parser combinator library. No idea though.

Applicative

What Keean is describing with the tuples is essentially part of a variation of the Applicative class.

class (Functor f) ⇒ Applicative f where
  unit :: f ()
  prod :: f a → f b → f (a,b)

class Functor f where
  fmap :: (a → b) → f a → f b

This is equivalent to the Applicative class used in Haskell.

He could probably flesh it out the rest of the way. It's still not as expressive as monadic bind, but applicatives are pretty useful too.

The whole `((a,b),c) = (a,(b,c)) = (a,b,c)` is usually a bad idea for generic programming. But I've done it with templates before (via template specialization).

This is a deep and profound insight!

Great thinking.

The whole ((a,b),c) = (a,(b,c)) = (a,b,c) is usually a bad idea

But it would help here. If you want to go in the direction of separating the construction and attribute evaluation. (I, of course, mean that this is what you want to generalize to tuples of values and types.)

You're still stuck with that attributes of production rules are essentially bound at the wrong place. But I agree you can probably write stronger parser combinators this manner.

(It's actually funny that you can express type constructs in C++ you can't in Haskell. Or can you?)

Out arguments

In parser generators like ANTLR/OMeta they have a neat solution to this. e.g. in OMeta:

<num>:a $+ <num>:b  =>  [a+b]

So when you call a rule within a rule, you can name the result with <rule>:result. You can then use those variables on the right hand side to compute the result (usually you'd build an AST there). Maybe you can make this look almost as good in C++ with by reference arguments.

what's the point?

That's very simlar to what I am doing but it makes me question the whole parser combinatorial thing again. The combinators are generative not applicative. With a simple parsing method we get:

int a, b;
if (parse_int(a) && parse_int(b)) {
    return a + b;
}

With combinators which are generative we can parse two ints:

auto const parse_int2 = parse_int && parse_int;

pair≪int, int> i2;
if (parse_int2(in, i2)) {
    return i2.first + i2.second;
}

You end up trying to push more and more into the generative code, so that you can do something like:

auto const parse_int2 = parse_int + parse_int;

You could even add meta-variables. But you have just implemented a whole heap of code to get the generative form to look like the applicative one. In a functional language you do this to hide the state and the input stream, but in the imperative OO language it is already hidden in the object that has the parser methods. You end up re-implementing conditional an looping constructs in your generative sub-language, when you could just use the applicative versions and save all that code.

One difference is that in

One difference is that in the imperative version it's harder to do backtracking since the parse functions imperatively modify the current location in the input. But if you can address that I don't see any reason why it would be bad. Parser combinators are functions of type input -> (rest of input, result | error) | fail after all. You can model that in lots of equivalent ways, one of which is to pass the result value via a by reference argument, and the error value by throwing an exception.

The point?

Well. I want to trick C++ into an identity crisis and make it believe it's Bison.

Polymorphic fold over parsers

There's something about parser combinators that keeps me coming back, the generative model lets you implement things like 'many' and encapsulate control flow for example the following is a common pattern:

if (accept(is_x, &v)) {
    while(accept(is_x, &v));
}

And with plain old (applicative) parsing functions you can't do the higher order stuff in C++, so the generative / combinator approach looks good at first. However, you quickly get sucked into a combinator explosion. I think if you can get the combinators right there might be something small and elegant to use in there (although the implementation type-mechanics is not).

So the question is what is the minimal, sufficient set of combinators?

To that end I have implemented a polymorphic fold over parsers in C++. This lets you supply a polymorphic reduction function of the form:

template <typename T> struct acc_int {
    using value_type = RESULT_TYPE;
    void operator() (value_type *a, T b) {}
};

The result is passed by nullable-reference (aka a pointer), so the type is almost the same as a functional fold: reduce :: (a -> b -> a) -> a -> [b] -> a. (I will use the same type-letters for the rest of this description) However the initial value for the accumulator is passed on application, not generation, and the function argument 'f' returns the result by reference, so the generative type is reduce :: (f(a, b), P b...) -> P a where P b is a parser that returns type 'b'. The applicative part is the same as all the parsers: (in, a) -> bool.

So a simple polymorphic higher order function to illustrate would be:

template <typename T> struct acc_int {
    using value_type = vector<int>;
    void operator() (value_type *a, T b) {}
};

template <> struct acc_int<int> {
    using value_type = vector<int>;
    void operator() (value_type *a, int b) {
        a->push_back(move(b));
    }
};

This accumulates integer parse results into a vector, and ignores parsers which do not return integers, and can be used to parse a CSV like this:

auto reduce_int = many(reduce<acc_vector>
    (many(reduce<acc_int(parse_int,accept(is_char(','))
    && space)) && reduce<acc_int>(space)));

Generative vs Applicative. And don't.

I don't really get what you mean exactly with the distinction between generative and applicative so it would be helpful if you could explain that to me.

I don't know exactly what you're trying to solve but it doesn't look helpful to me. From experience I would say that:

  1. It's a good idea to allow for mixed typed attribute rules
  2. It's a good idea to do second level, attribute, calculations in the host language

The monadic or functorial combinator style previously discussed will solve that problem for you. BUT, for N sequential production rules with N attributes the monadic style forces you to write N nested anonymous functions which is a pain in C++. AND, for N sequential production rules with N attributes the functorial style forces you to dissect an N nested tuple. Both are cumbersome.

I don't see a lot in unifying the attribute types of production rules. It might solve trivial problems but more complex things I think you want to do will become kludges.

I think I am going for the functorial approach, since it allows to create a grammar described as objects, but I am lazy at the moment.

(I think you're solving the wrong problem. I am more wondering what state, from an OO point of view, is handy to store in the parser and whether it is even a good idea to want that.)

(Ah. Hmm. My mind's eye doesn't really see how to express analytical information with the functorial approach. Hmmm.)

only functors not monads

Without closures, monads in C++ are going to be difficult and ugly. It might be possible using a map in the hidden state, or using nested lambdas, but I don't think you would want to use the result.

Generative vs applicative is about when the execution happens. Functions are applicative, you apply them to values and get results. Some functors (like c++ function objects) are also generative, in that they generate stuff.

The "many" combinator is a good example:

many(accept(is_space))

Looked at applictively if many and accept are ordinary C++ functions, accept will be run once before many and its result passed in, which is not what you want. You can pass a function pointer in, but how then do you provide the arguments to accept? Many could take a character as its second argument, but that would severely limit its uses. (edit: this might work with a variadic template)

Using it generatively is one way to solve the problem. Accept(...) Is the object constructor and generates an object containing the argument as a member. Many takes this generated object as an argument, and returns a new generated object that when applied apply the accept function object as many times as needed using operator().

I don't understand your two points. The polymorphic fold is precisely to allow a polymorphic reduction function. You provide a reduction rule in the host language (specialised template function objects) and reduce applys it in turn to each of the parsers. The correct template function object to apply is selected by the result type of the parser. The example I give pushes integer parse results into a vector and ignores other results. It could populate a shared structure with different typed members.

This has been an interesting

This has been an interesting discussion so far. Given me a few ideas.

You can pass a function pointer in, but how then do you provide the arguments to accept? Many could take a character as its second argument, but that would severely limit its uses. (edit: this might work with a variadic template)

If all arguments are generalized to lists of characters, then this doesn't seem problematic. Then you just need combinators to convert lists into the data types you want.

Edit: although I should note that it's probably better if many is not provided thus requiring users to write their own looping function using the base combinators. It's not that bad.

Ah well

Well. It was nice for me to find out wtf I was doing. But I decided that what I encoded doesn't have an easy translation in C++, neither do I see myself arriving at a library people will use. Too specific and people probably won't get the mental model. Shame really.

Embedded prolog, codata, and parsing

I have a prolog interpreter written in C++, and I was thinking that could be used to write backtracking parsers if you could find the right embedded syntax.

Prolog = grammar with unification on attributes

Yeah well. The combinators I have now have directed flow, which is what I needed, whereas Prolog has DFS on the unification of attributes, with somewhat undirected flow, where simple tests are epsilon rules. It got me thinking too how to express that functionally and I guess you could fix it with generators for the attributes and general unification. Got the cut rule wrong though.

What I wanted to say

Prolog combines DFS with (backtracking) unification. Unification is a more expressive than rewriting, it's another operational model, but it's usually overkill for compilers where you simply want to process the information as you derive it.

It might be interesting for natural language processing or testing more elaborate typing schemes but I simply don't see much use for it in compiler technology.

Compiling is Prolog

Unification is the core of type inference and type checking, so you are going to have it for that. If you want backtracking for parsing too, then you find both of these stages fit into that model. Prolog can also fit nicely with the optimisation (which can be viewed as partial evaluation/partial deduction). You can write very neat short compilers in prolog, although that is not quite what I am suggesting.

The idea is that if many stages of compiling fit the same model of computation, you can abstract this out, and leave the rest simpler and easier to understand.

- Parsing: backtracking
- Type inference: unification maybe backtracking for type classes.
- Type checking: unification
- Optimisation: partial deduction which is theorem proving can use PTTP model elimination. For example see register allocation as graph colouring. I think you can go further here and use answer-set-programming to express the graph colouring in a few lines, and effectively have a SAT solver do the hard work. Not only elegant modular design, but will probably out perform hand-rolled approaches due to leveraging all the advances in SAT solving in recent years.

About the only part of compilation that I don't see neatly fitting this model is code generation, but post optimisation this can be trivial anyway. All the code generation layer does is output opcodes, and provide a description of those opcodes to the optimiser. You can view this as: parser generates a source language AST, which then has type inference/checking done, then the optimiser transforms the source language AST into the destination language AST, which is trivially output by the code generator.

This is the compiler architecture I am currently working on. So far I have modular type checking in embedded prolog. I plan to slowly upgrade this to an embedded alpha or lambda prolog, this allows HM type inference to be described in 3 (short) lines. In effect the compiler can read the type checking rules from the source file - effectively allowing custom extensions (or entirely different) type checking to be used on a basic lambda calculus. Type-classes can easily be expressed as predicates in the meta-language. I am just starting to look at optimisation at the moment, but already have an embedded SAT solver (an API can allow an external SAT solver, but I implemented my own in order to better understand it, and how that imementation may relate to other parts of the compiler).

Did anyone ever bootstrap Prolog?

And got a working, performant, system? (I thought SWI-prolog but it looks like it's written in C.)

I mean you have nice academic ideas but I wonder anyone ever pushed Prolog into a bootstrap. As I said, and you seemed to agree, I think you can abuse it for some language exploration, primarily on type systems since you get backtracking unification for free, but I doubt anyone ever got a real performing bootstrap to work. Just wondering.

Bootstrapping Prolog

Warren definitely implemented Prolog compilers in Prolog, although I don't thing the WAM was written in Prolog. Erlang JAM was written in C, and the compiler in Prolog, to bootstrap it.

Why a polymorphic fold?

What I want is the parsers to return the AST, and not a tree consisting of vectors and tuples, as it seems more efficient to build one node at a time, than return a massive vector/tuple structure. However I don't want the user to have to write their own parser combinators because the template stuff is complicated. So the idea was to allow a polymorphic-fold, or polymorphic-map to insert the user-specific code without them having to write a full combinator. Of course you can just use 'seq' and to return a tuple/vector tree, and process the output at the top level, its an extra function, not taking anything away.

Sometimes LL(k) for k>1

Sometimes LL(k) for k>1 results in a more natural expression of a parser.

For example, my current language has a rule to avoid visual confusion between words and numbers: a word starting with '+', '-', or '.' must not follow this character with a digit. Numbers may not actually start with '+' or '.', but they may start with '-' for negatives. Thus, upon reading '-', I don't know whether I'm parsing a number or a word.

Technically, I could implement this as an LL(1) grammar, by having a special global rule to handle all symbols starting with '-', returning either a number or a word. However, I find it more natural to express and organize this as a composition of two separate LL(2) rules - one to handle all numbers, and one to handle all words.

Presumably, a grammar optimizer could reduce this back to LL(1). But most parser combinators don't do that sort of optimization. (Though, it is certainly feasible to create optimizing combinators using staged computation. A language with explicit JIT might even make it worthwhile, for performance.)

Much of traditional parsing

Much of traditional parsing theory is not useful for parsing programming languages. Context free grammars are not the right formalism since they have the following three problems:

  1. They are too powerful: the ability to express ambiguous grammars gets in the way when parsing programming languages. It also gets in the way of generating good error messages.
  2. They are not powerful enough: you can't express context sensitive grammars, which you need for things like user defined operators.
  3. They don't have sufficient mechanisms for abstraction. Programming language grammars when expressed as a context sensitive grammar have a lot of duplication.

Greedy parsing is the way to go: PEGs, greedy recursive descent, or greedy parser combinators. Unlimited lookahead, but greedy sequential composition. If you have a rule S = A B then A parses greedily as much as it can, and then B parses the rest. These kind of grammars are deterministic: there is at most 1 way to parse an input. This makes generating error messages far easier, since you can always point to the location in the input that caused the parse to fail, unlike with context free grammars. It's easy to express context sensitive grammars, and it's easy to add an abstraction mechanism: grammar rules can be parameterized by values in the host language. e.g. you can have a rule infix(S) where S is the set of operators with their precedence.

Hooray for USENET!

Hooray for USENET! Comp.compilers was/is definitely the place. I was sure it is dead, or at least much less frequently joined. I certainly learned a lot by reading it religiously for several years.

Not very active

As far I see.

I also want a place to talk about that. I have some answers like http://programmers.stackexchange.com/questions/209382/why-is-float-the-default-in-the-majority-of-languages and http://programmers.stackexchange.com/questions/234900/go-like-interfaces-multi-methods-make-sense and others around, but none are proper places to talk at lengthy (and find people with the same passion).

For example I can't still understand how do this: http://stackoverflow.com/questions/20365649/how-create-a-debugger-for-self-made-language

Instead of the common path of build a language (do the lexing, parse, define the grammar) I wanna start by the hardest thing: The debugger, the REPL, and the stuff around.

What about start one forum? Where will be the best place?

Well, that's my entire

Well, that's my entire research area (throw in IDEs + code completion), but I'm not sure of any place that talks about the traditional ways to implement debuggers or REPLs.

Good to think about debugging early.

You can almost talk about that here. If you can make it about programming language(s), it's on-topic enough for a forum discussion topic here at LtU. The PL angle is a sort of filter for what's suitably on topic, though folks do veer off on side topics. Online communities need some gardening to keep them healthy, so odd content must be treated like weeds to be pruned, lest chaos ensue and regular members depart to escape noise. When the ObPL (obligatory programming language) element is present, you can probably keep going if someone wants to discuss an idea. Your goal should be to get PL into posts. For example, a debugger DSL (domain specific language) might keep debugger implementation on topic indefinitely.

Instead of the common path of build a language (do the lexing, parse, define the grammar) I wanna start by the hardest thing: The debugger, the REPL, and the stuff around.

That's a good idea, since debugging should be something PL devs should worry about. Best would be some kind of standard debugger for a language that was cross platform, and not tied to one implementation, specified as some kind of abstraction of the execution model.

But you either need to use an existing IDE or write your own, since at minimum you need to be able to edit and/or view source code when debugging. Otherwise it might be so clunky folks avoid your tool, the same way no one prefers to use ed instead of vi as an editor on Linux. You might be able to work up integration with other tools like emacs. For example, emacs has a gdb mode, where gdb is the debugger, but emacs navigates to source code that gdb says is being viewed currently. Last I looked a while ago, the MacOS XCode IDE also used gdb as the actual debugger, while XCode acted as a front-end.

Actually messing around with gdb might be mind-numbing though, unless you've been working with the open source parts involved. I find the utils around managing symbols creepy myself. (Suppose you wanted to write your own utils to deal with symbols in executables, and started by looking at existing tools to think about cloning things in a rational, orderly manner. If your skin crawls in horror, that's what I mean by creepy.) If you're thinking about compiling to native code, and folks will actually use gdb to debug, you might not have any choice but to ensure gdb gets suitable debug symbols it expects.

Rolling your own editor as part of the process sounds painful, so I suggest using some widget library as a frontend, or even a browser as the user interface, since that would at least be very platform independent. You want the frontend of the debugger to talk to the backend that is actually executing. If you write the executing engine, you can make the way they communicate anything you like. This is where you can make the means of coordination a language if you want. You could try using a PL to coordinate front and back ends of the debugging tools. Old traditional dumb debugging tools deal a lot with line numbers, but you can do better than that, if you want to track offsets of expressions.

If you compile to some other language, like C, then its compiler copes with generating debug info for the debugger used for that language. It's possible, but maybe unpleasant, to support source level debugging of your original version at the same time as debugging the low level language you target. (I think I have more to say, but here's a good place to pause and take a breath.)

Well, (and sorry because I'm

Well, (and sorry because I'm naive about this stuff) I have thinking in mimic the way I was used to work in Visual FoxPro, that let you have a combo repl+debugger+visualizers (ej: Show a table in a grid) all at the same time. It was the most intuitive debug experience I have used so far (even counting python).

I make a look at lua (https://github.com/slembcke/debugger.lua), and look like is possible to build on top of it one. I have thinking in use ZeroMQ as protocol to control the debugger -and maybe be able to do it remotely-.

But the problem for me is how do the minimal stuff. Yes, I see the lua debugger code, but then how I can use my own language with different syntax? Ie: I don't know how mix the debugger to a language, only that some language have a debugger.

From there, be able to break and do step-in, step out, I think that have decent introspection in the language is my next goal.

For example, if I have a function called "sum" and I step inside it (all inmutable):

- this.name tell me the name of the thing that wrap my scope
- this.params tell me the params
- this.callstack the stack
- global.vars, etc

and stuff like that. If the debugger backend is small but the introspection is good, I could eventually connect it to a HTML5 front-end and do visualizers for the stuff, Or perhaps do the foxpro thing and provide a secondary screen in a terminal or similar where the repl, debugger, watchs, etc, is available.

But all of that depend in know how start..

framing the problem

It sounds like you're approaching this from near the beginning of a new project, and you're still in a "framing the problem" phase, during which you tune the description of goals and sub-goals. You might get something from problem-solving topics you find by Googling "framing the problem". Your original focus above sounds like a "how?" question, but asking "what?" first might be helpful too. For yourself, try writing about what a debugger does, and what must occur for it to do those things. When you write definitions for things — what does it mean to suspend a program? (define suspend) — any basic assumptions you make have a chance to make themselves evident so you can refine requirements.

I find problem-framing a lot of fun: actually the most fun part of development. Your language's compiler or interpreter (output) will execute in a process, either native OS process or a green user-space pseudo-process. When stopped by a debugger, you don't want a debugged process to do anything when scheduled, if it is scheduled, if you support single-stepping. You might want to write (as a plan) about how you intend to have your style of process stopped until stepped, and how this relates to events causing a break in your debugger. If you're going to compile to native object code and debug that, you likely need to study how this sort of thing is done in existing debuggers. I bet it's easier to compile first to a virtual machine of your own devising, so you can define and control every bit of what all of it means.

There's a lot of interesting questions to answer. If your language has green (user) processes, can a debugger run in the same OS process as the one running code you're debugging, because only one green process gets suspended? Stuff like that. The better you write your initial definitions, the easier it might be to resolve complex interplay among features in a work-flow you want. It's easiest to fix bugs when you find a rule that's violated. If you have no rules, finding violations will be hard. The crisper your rule set, the better. If someone writes an app using your language, and it has a graphical user interface, what happens to the GUI when the debugger has a debugged process suspended? Ask "What if?" questions early to think of solutions sooner.

Above it sounds like you know what kind of work-flow you would like to experience when using a debugger, when you're the app developer using this tool. This sounds more like a frontend focus, compared to a language's backend implementation and debugger that may not have a user interface as such, unless and until you add it. You might want to consider the UI as view in a model-view-controller system where the app state machine is the model and the debugger is an exotic controller with it's own model too. (Even if this doesn't fit, trying to make it fit and seeing what is wrong may be an effective "what if?" exercise.)

Knowing how to start comes from thinking and writing. The writing part is optional, since you can substitute another mechanism for criticizing ideas as you go. It might depend on how many viewpoints you can hold in mind at once. You need at least one dense virtual reader in mind who seems determined to misinterpret everything. If your description still fits a wrong interpretation, it's not a very good description. (Think of it as debugging definitions.) I like to actually write my plans out as text, because I often discover my plan is broken as a result of review in what is basically a rubber-ducking process. First I'll do X, then I'll do Y, and then Z because ... ah, darn, it can't possibly work ... hmm. It's cheap to discover bugs like this early.

Debugging Compiled vs Interpreted

Debugging for object code is hard:

1) Symbol table: All the symbol names that the compiler usually strips out need to be included. If you are outputting to a standard binary format like Elf (this will be platform specific), then you should read the Elf docs on how to put symbol/address maps into the binary. When the debugger loads the binary it substitutes the addresses back into symbols to make it easier to read. Source line references in the output are important if you don't want to just debug the disassembled code.

2) Breakpoint insertion: This is highly processor specific, but the idea is there is normally a register in the CPU you can flip that will make it fire an exception after every instruction. There is normally a separate register set for exceptions, any you can then inspect the 'user' mode registers (like the program counter) from the exception. When you return from the exception the next instruction is executed. Some processors don't have multiple register sets, in which case you need to push all the registers you are going to clobber onto the stack. Some processors do not have debug exceptions, in which case you need have a virtual program counter in a register, you inspect the memory pointed to and manually calculate the length of the instruction from the op-codes (which means you must know this for all op-codes), and then store the original memory on the stack, overwrite the memory after the ins ruction with a jump back into the debugger, manually set up the registers and stack frame, then jump to your virtual program counter.

Debugging for an interpreter is trivial:

Instead I would suggest creating a language that can be interpreted and compiled, and writing a specific interpreter for debugging. This allows debugging at the level of the language itself. You should be able to track the source code and environment quite easily. Basically the REPL is a debugger, you read some command (run function, step to next in debug mode), you evaluate the code (in the interpreter), you print the result (and you can print the complete state, environment etc in debug mode).

"creating a language that

"creating a language that can be interpreted and compiled" is something that I have think, specially after see the LLVM library support for both. But the integration with LLDB look scary as like with GDB (plus: I have no expertise in C/C++, only python, obj-c, pascal) and if the xcode experience is the best thing in this area, then it sucks (as I say in the thread above, my gold standard is foxpro in this area).

How "Debugging for an interpreter" is trivial?

By the way, I have nothing

By the way, I have nothing in firm yet, just researching about this. I have thinking in use luajit (convert my language to lua) or LLVM. Will not use raw C/C++ for sure. Maybe run on top of obj-c.

My top "Is hard, need to solve what to do about" list of "requirements" are:

- Python-like syntax: Easy
- GO-like class system. Or Julia: Julia harder? Do both?
- UTF8 strings: Easy
- Implement go-like channels or actors: Hard
- Functional abilities. Not crazy about purity: Easy
- LINQ/SQL-like way to query collections: Hard?
- Integrate database(sqlite) and data manipulation: Easy
- Be able to run in iOS: Because if not run in at least 1 mobile platform it not have future (IMHO)
- Have a REPL
- Have a (decent) debugger

What make me easier to fullfil this list then Is what I will do.

Code maps & runtime code modification

The two approaches you suggest for debugging have the rather significant disadvantage of decreasing runtime performance and operating at the target language layer (in this case, machine language) rather than the host language. In practice, this loss in performance means that people often won't be using the debug implementation when problems occur, and will need to rebuild and run (slowly!) until they hit the bug again.

An alternative is to maintain a map, a program database (e.g. `.pdb` file), separate from the runtime program, which tracks how the runtime program relates to the original source. If your compiler has multiple intermediate languages, you might need many such maps. When an error occurs in the target language, it can then be mapped back to the source... but without slowing the target down at all in the common case where no error occurs.

But a code map doesn't cover everything, like stepping through code or watches. Rather than 'interpret' the entire subprogram from the outside, which is really slow and complicates the compiler with two modes, a useful technique is to modify specific runtime code where it is needed to call debug functions. One can modify the an entire function's code, perhaps relocating the function via forwarding jump.

Besides offering effective performance, both of these techniques are relatively easy to tie back to the semantics of the source code program. Unless you're debugging the compiler, I think we shouldn't be working with debuggers at the target language layer.

Ok, any place where I can

Ok, any place where I can see this implemented?

MSVC++, for one. Though

MSVC++, for one. Though their `.pdb` (program database) format is proprietary.

gdb can do it too

You can output the symbol table to a separate file, so that it is loaded by the debugger, but allows the program to run without the symbol table.

In practice there is no performance penalty for having the symbol table in the same file as those pages are simply not mapped by the MMU unless used, so I don't understand the original comment.

There is a performance penalty for single stepping, which is what I was describing. Setting breakpoints or trapping exceptions into the debugger has no penalty with the method I was describing.

In practice there is no

In practice there is no performance penalty for having the symbol table in the same file as those pages are simply not mapped by the MMU unless used

If they're kept in separate pages, I agree there isn't a performance penalty. I'm not experienced with ELF (other than as a user). I'm familiar with a few bytecode formats where symbol information was kept with each function or class. The performance can depend on the format.

There is a performance penalty for single stepping, which is what I was describing.

We really don't want single-stepping at the machine-code layer, e.g. 'exception after each instruction' at the CPU. We need it at some higher layer in the source code, e.g. at steps within a function.

More generally, single-stepping and breakpoints is an awful approach to debugging, no matter how conventional. We are better off recording traces, states, input streams so we can review, replay, and compare them side-by-side or in overlay.

I Agree

I agree with this, but thats "state of the art" in most compiled languages. I hardly ever debug like that myself. I generally use logging (conditional on some debug flag), and unit test functions, but I got the impression I was in the minority, and the IDE based debugging (using single step and breakpoints) was the most popular way.

Just in case it wasn't clear the single stepping one line of code is normally done the same way as the machine code single stepping. Effectively the debugger inserts a breakpoint at a machine code address derived from debugging symbols in the code. If you look at the object files with debugging enabled you will see a symbol for each line number that translates into an address. When you single step one line of code the debugger sets a breakpoint (using the platform native CPU features) at the next line, sets up the registers and jumps into the code. When the breakpoint is triggered, the debugger removes the breakpoint and inserts a new one after the next line etc.

Well I look at embed dtrace

Well I look at embed dtrace or something similar to have the same functionality as "log everything" but without fill with logs the code (without a real need).

Or have a kind of "attach" to the events of the language (this was what kick the idea):


#Function hook

def startDef:
self.cache['start'] = now

def endDef:
performance.register(self.function.__name,'time', now - self.cache['start'])

hook(sample,pre = startDef, post = endDef)

#Instrument

server= instrument.Attach('/c:/Proyecto/Programa.exe', ApiKey= '****')

server.hook(sample,pre = startDef, post = endDef)

print performance

But still the ability to step into the code is too convenient, not only for new developers, but also to check why something not work as expected. A avalanche of log data is hard to grasp...

Logging is not record and

Logging is not record and replay. If you have a sufficient trace for record and replay, you can step through the code. You can also step backwards, or spread code out on a timeline, or compare two traces and highlight the differences, or see it a variety of different ways.

Search for phrases 'time travelling debugger', 'conversational programming', 'live programming', and 'learnable programming' to see some of the other ideas and IDEs out there. Also check out 'code bubbles' and 'xiki'. Don't let your mind be trapped by mainstream convention.

Just note that you can't

Just note that you can't really step backwards unless your programming model is invertible and therefore energy free. What you can do with record and replay is provide lots of illusions that you are stepping backward.

Illusions or History

Perhaps 'rewind' would be a better description, a metaphor to working with recorded video or sound. You can step backwards, and forwards, but the timeline is already set.

If you are using replay, you

If you are using replay, you can flip back to the last checkpoint and then step forward. If you are recording everything (like in glitch....), then it is just a matter of running your rendering code in the right time context. But glitch doesn't really allow for stepping, you are really just scrubbing through events.

Stepping Backwards

Reversible computing is interesting, but requires storing the 'junk' bits to allow undoing. I find complete logging of everything too much, and prefer to insert specific debug logging statements. I can see the use of a trace of everything, but it would need a good UI to work well as otherwise it is likely to overload with irrelevant informations. I prefer only to monitor the specific things I am interested in.

From a language perspective, meta-conditional, 'if-debug' (or better debug-levels) that generates the code block if a debug flag is set, and does not generate the code (but still type checks it) if the flag is not set (so there is no runtime cost without the debug flag set) is the best option for me personally. Include the ability to read meta-data (function names, function call stack, argument values and types etc...). Combine this with unit-tests, such that you can make a test-suite as an alternative to the main program or library, which allows all tests or specific individual or groups of tests to be run, and I will be happy.

Record and replay doesn't mean reversible

I wasn't talking about reversible computing. Simply of recording, and replaying. In this context, stepping back is like rewinding a video recording - you can't change the past, but you're free to review it. The advantage comes from viewing many perspectives.

Of course, one might leverage short segments of reversibility to optimize the storage requirements (fewer snapshots). But brute force recording and compression would often work just as well.

I prefer only to monitor the specific things I am interested in.

A fundamental weakness of this approach is that you don't initially know what you should be interested in. You can make a hypothesis and insert a few logs or watches. But then you must run the program again, study these in an ad-hoc fashion, and (unless you're lucky) develop a new hypothesis and try again. Ultimately, these repeated efforts consume a large amount of time, and their ad-hoc nature hinders development of generic tools to explore them.

A good recording, on the other hand, allows trying and exploring many hypotheses from a single run. And since such histories would have a uniform structure (up to compiler) they also can work with generic visualization and diff tools. There is great potential to improve your efficiency with debugging.

is the best option for me personally

Is it really? If you haven't experienced and achieved proficiency with all of the alternatives, how would you know?

I can see the use of a trace of everything, but it would need a good UI to work well as otherwise it is likely to overload with irrelevant informations.

I agree. You need a development environment that can take advantage of these traces.

worth a try

You're right of course, it might be even better than what I am currently doing. Can you point me at an open source tool I can use with C++ to do this kind of thing? I would want something standalone as I don't use an IDE, and would prefer flexibility in which editor I use. Preferably something that you run from the command line with the program to trace as an argument, which then opens a GUI window.

Having said that I am pretty efficient as things are now. Often a single print statement inserted into the code is enough to find and fix the bug.

I've read of some time

I've read of some time travelling debuggers for C++, but I've never tried them and don't recall their names. (I do recall a short, fairly ugly video clip.) I've recently sworn off C++, so I don't plan to look. Record and replay features are certainly not evenly distributed across the PL landscape. :)

I sincerely doubt there are enough options for this feature in C++ to be picky about command line features and so on.

Well this sound interesting,

Well this sound interesting, and probably easier to implement. I wonder if this could have limitations because the inability to inspect a variable, or if requiere a well typed language.

I also would be interested

I also would be interested in answers.

I suspect there should be some suitable mailing lists, but I don't know them. There doesn't seem to be a good Google+ community for compilers yet.

Haven't seen alternatives yet

It's a broad field. You're unlikely to get very sane answers on the Internet unless you join a community around a specific compiler.

Theory vs. Practice

There’s something to be said for bringing more of the practice of compiler implementation here. Many of us are interested in practical applications of the theory; I, for one, often use LtU as a paper trove to plunder for implementation ideas.

We always had an

We always had an implementation department. But it has mostly been about run time issues, and static analysis, since parsing is typically not that exciting.

Just Theory vs. Practice?

Just Theory vs. Practice? Design always feels left out.

The practice is horrible

Nine out of ten compilers are absolutely horrible. And then there are some (usually non-performant) compilers which die of academic beauty. And once in a while a compiler comes along which, I think mostly by chance, gets it right.

Then again, I might have scanned too many open source or academic compilers. I don't think IBM gets away with spaghetti code.

Which one do you think are

Which one do you think are really good?

Depends

It depends on what you want to do. I guess if you want to know how compilers are written Go is a good start.

I don't think IBM gets away

I don't think IBM gets away with spaghetti code.

IBM would be a bad example. There is one guy in France that did most of the IDE compiler for Eclipse/Java. Your head would probably explode if you saw his code (it is mostly heuristic), but it gets the job done. Then there are people like David Grove who are very famous in the compiler community but whose code (like Jikes, JRVM, X10) probably might not match your idealistic vision of how compilers should be written.

Sure dude

You got it all figured out.

Nope. Just not hopelessly

Nope. Just not hopelessly idealistic ( and have worked for and dealt with lots of IBM's code in the past).

Yah.

Okay. To be honest I was thinking about Intel's C compiler and confusingly wrote IBM. I would love to see that code.

Well, for people interested,

Well, for people interested, look like this could be a good place:

https://www.reddit.com/r/Compilers/

Is still a bit empty, so come join!