Looking for papers describing advanced language topics in terms of C programming

I'm interested in increasing my understanding of some of the more advanced programming language constructs I see discussed here and in other places. Specifically the ones that are common in functional programming (first class everything, continuations, closures, etc.). Now the best way to understand them is to actually use them in a language that implements them. But that requires a good investment in learning those languages (Haskell, Lisp, etc).

What I'm looking for are some papers that describe the various advanced topics in terms that a regular C programmer, like myself, can understand. In other words, pretend that you are extending the C language -- how would the syntax look like for the various constructs?

Here's an example explanation for first class functions:
In C, you can take the name of a function, and pass it as a parameter to another function, like what is used in the standard library qsort(). You can also assign those function names to variables and array elements (which is useful for creating state machines). However, you have to create those functions somewhere else in the program. This can make it difficult to read the code and figure out what is going on. In the case of qsort(), which requires (among other things) a pointer to a comparison function, that function is probably only used once. So it would make sense to define the function right there where it is used. This would require first class anonymous functions. If C supported this, this, then the syntax would probably look like this:

qsort (my_array, 25, sizeof(*char), lambda(const void *p1, const void *p2) { return strcmp(*(char * const *) p1, *(char * const *) p2); });

In this example, the keyword "lambda" creates a function on the fly, and returns the address of that function as it's value. You can also use the expression:
x = lambda(int arg1, int arg2) { /* contents of function */ };
And you can now call the function via: "x(arg1, arg2);".

So, if C added the "lambda" keyword with the semantics given above, then it would have full first class anonymous functions.

...

As you can see, I think I've got an understanding of first class functions. What I'd like to see are write ups similar to the above explaining other concepts such as closures, continuations, monads, or anything else that could be expressed in an imperative language such as C. Or, if anything doesn't fit in that context, an explanation of that also.

A couple of postings that I've found that come close are Joe Spolsky's article "Can your programming language do this?" (which gives examples in Javascript), and the paper "Functional programming for the rest of us" on defmacro.org.

In case you are wondering why I'm looking for this, first of all I kind of understand some of the concepts, but would like to further cement my understanding. And secondly, I'd like something I can send to other people I know that are primarily C (or similar language) programmers (since I don't understand the concepts enough to explain them). Also, I'd like to extend the language I'm developing to support as much as is relevant (yes, I know, why write another language -- I'm doing it mostly for my own education, and possibly to use as a resource for others to learn from).

Thanks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I would suggest not doing

I would suggest not doing this in C, mostly to not bog yourself down. For example, while I agree what you've described is a first-class function, an overwhelmingly key point you did not demonstrate was lexical scoping of open variables in the function body: we call this a closure, and, oversimplifying, it is why Douglas Crockford saw JavaScript as The Next Big Thing. It's a pain to encode this notion in traditional C/C++, yet it is also the starting point for many interesting things.

Considering you seem more interested in novel control structures than building verifiable languages, and want to build up language features, the build-your-own interpreter approach sounds right down your alley. PLAI or HtDP have a great presentation of this, and PLAI is freely available online so you can give it a trial run. While you could probably implement a good chunk of it in something like JavaScript, you'd miss out on being able to experiment with the features you're trying to implement (e.g., continuations and macros done right), and would again get bogged down with things like syntax. Instead, go the traditional (PLT) Scheme road. At my undergrad PL course, students learned all the necessary Scheme to get going in the first night; it ain't bad. It might be worth trying to figure out how to turn your Scheme code into machine code (or at least C code), but there is a lot of noise like dealing with parsing and C's type system that take too much time: you explore a *lot* of language features in this approach, so you don't want to deal with non-essential issues.

After, you'd be in a great position to go more theoretical (type theory with TAPL), or more low level (I'd like to recommend a compiler book, but am not really satisfied with any intro level ones for interesting languages).

Finally, "Concepts, Techniques and Models of Computer Programming" (CTM) covers a bit more, such as concurrency, but skips other things like syntactic abstraction (macros). I haven't seen it used in a learning environment, so I can't yet vouch for it. All three of these resources are *very* clearly written.

I've been skimming through CTM...

I have learned a few things from CTM, but in order to get further I feel like I need to get my head wrapped around the OZ language that the examples are presented in. Maybe I'm just lazy, but am looking for that "ah-ha" moment with a bit lighter reading (although I do agree that CTM is very well written). I'll start looking through the other texts that you mentioned.
an overwhelmingly key point you did not demonstrate was lexical scoping of open variables in the function body: we call this a closure,
Yep, I forgot about scoping issues. What I've done in my language, 2e, is as follows:
A function is defined by specifying "@name(expression)", where expression is any amount of code. This not only assigns the expression to the given name, so it can be called like a C function "name(p1, p2, ...)" but the definition also returns a value that can be assigned or called. Now anonymous functions have two forms: "@@(expression)" and "@(expression)". They can both be assigned to a variable or array element, but the second form has its variables in the same scope as the parent function. So in that sense, you can do closures (I think). Like this:

addfunc = @($1 + $2);
add5 = @(addfunc($1, 5));

result = add5(7)   #result should be equal to "12"

In this case, the "add5()" function closes over the "addfunc()" function. This is the example that I've seen in other discussions on closures, hopefully I've got it right. If anyone wants to try out some variations, I've uploaded the latest version of my language to lang2e.sourceforge.net -- version 0.9. If I have got closures correct, then I'm going to go for continuations next (or at least get enough in so I can implement co-routines as described by Knuth).

BTW, one weakness that I've got is that all variables in my functions are "static". So you can use recursion, but it is a bit tricky with variables getting clobbered (I've got code samples to show how to work around this). And another weakness is that I haven't done anything with tail call optimization yet. I've figured out an easy way to do it, just got to get time to test it is all.

Thanks for the reading list, it looks like I've got some (fun) work ahead of me.

Weakness

BTW, one weakness that I've got is that all variables in my functions are "static".

If all your variables are global, and your syntax has no way to refer to outer function arguments either (which seems to be the case), then you don't have closures. For the same reason, C does not have closures. Closures are an implementation technique for accessing variables (including function arguments) that are local, but to some enclosing scope.

So your example does not exhibit closures. Here it is in lambda-calculus notation:

add = \(x,y).x+y
add5 = \x.add(x,5)
result = add5(7)

You witness closures when you are able to express it like this:

add = \x.(\y.x+y)
add5 = add(5)     -- closes over x
result = add5(7)

Of course, this is a rather boring example. It becomes more interesting if the add function actually does something before returning the inner function, e.g.:

add = \x.(let y = some_f(x); \z.y+z)

Then, some_f only gets called once, when add5 is defined. With your version you could not avoid calling it each time add5 is used.

What I mean by static variables

What I mean by static variables is that within a function, variables values don't disappear after the function exits, similar to C when you use the "static" keyword to declare a variable. (This is partly do to laziness on my part in implementing the interpreter). But there are two types of functions -- one type shares variable scope with whatever function it was defined in (function definitions can be nested), and the other type has variables that are in a private scope (but they are still persistent after the function exits). Only anonymous functions defined with a single "@" symbol have shared variable scope -- ones defined with two "@@" symbols have separate scope, same with named functions. So far I don't have any syntax to get global variables (except for the fact that the names of named-functions are global).

But I can see now that this fact still doesn't help me with closures. I think I understand why I don't have closures, but will have to experiment a bit to see what it would take syntax-wise to get them. From what it appears, you have to be able to have a function with local variable scope, but also be able to attach a group of variables to it that are seen within it's scope as well as the scope of whatever function attached them to it. Kind of similar to a C++ class definition (but somewhat different)?

Closures and objects

Kind of similar to a C++ class definition (but somewhat different)?

There is a very strong correspondence between OO style objects and closures.

I think I understand why I don't have closures, but will have to experiment a bit to see what it would take syntax-wise to get them.

You probably don't need syntax. Most languages with closures don't do anything to specially flag variables that can be captured. They typically just follow the normal block scoping conventions for the language. E.g. in the curly braced syntax of JavaScript* it looks something like this

function add(x) {
 return function(y) {
    return x + y;
 }
}

var add5 = add(5);
alert(add5(7));

At the implementation level there may very well be a difference between captured variables and non-captured variables but that doesn't mean the syntax needs to show it.

Also, one point to clarify from your original post is that anonymous lambdas aren't a prerequisite for first class functions and closures. Again, some JavaScript

function add(x) {
  function addx(y) {
    return x + y;
  }
  return addx;
}

var add5 = add(5);
alert(add5(7));

If a language has local first class named functions then anonymous lambdas are just a syntactic convenience.

* note that JavaScript doesn't really have block scoping, but the difference isn't that important for this example.

Changes I need to make...

So in my languages context:

f1 = @@(
    fv1 = 123;
    f2 = @(
        print($1, fv1)
    );
    ### additional code;
    f2("foo")
)


In this case, f1() has separate namespace from the rest of the program since it is defined with two "@@" symbols. But f2() shares namespace with f1(). Therefore, the print statement when f2 is called prints "foo123". So in this sense it looks like I have closures (f1 closes over f2).
But here is where it falls apart:

some_variable = 123;
f1 = @(
    print(some_variable)
);
f2 = @@(
    some_variable = 456;
    $1()
);

### now call f2, passing f1 as it's argument
f2(f1)

In this case, the variable name "some_variable" that f1() sees is the one that was defined above it, so it prints out "123". Even when it is called from within f2(). I think that to make this useful, I should get f1 to access the variable as it is set from within f2, when it is called from there.

(Also, to make it useful, I should define a way so that named functions can get either a separate or shared namespace, just like I have for anonymous functions).

Scoping

Most languages are lexically scoped by default. That just means that the meaning of a symbol is bound based on its position in the (static) code rather than when the variable is examined during the dynamic execution of the program. Assuming that in your second example you have two different variables named "some_variable" and you aren't just mutating one global variable, then lexical scoping would say that "123" is the right answer.

The alternative you propose is called dynamic scoping (again, assuming you aren't just mutating one variable). Before committing one way or the other be sure to dive into some explanations of the trade-offs involved. There's a very good reason that lexical scoping has been the far more common default for the last several decades : it's usually much easier to reason about. On the other hand, dynamic scoping does have power and even some very recent languages offer dynamically scoped variables as a non-default alternative.

I'm confused by what you are trying to accomplish with namespacing.

What I'm trying to accomplish

Right, the way the interpreter is currently implemented is each variable, when it is found by the parser, is assigned an index number into the variable storage data structure. So the variable name is combined with the function reference i.d. that it is found in, and this is used to find the variable's index. This index number is what gets loaded into the final compiled code. So lexical scoping is the most natural at the moment.

But right now I don't have any implementation for object oriented features. So as a quick fix, I thought that if a function could access the variables of a wrapper function, then that would be a way to get a form of objects. Now they would act different that C++ classes, but that also fits in with one of my goals -- by re-inventing the wheel without exactly copying what was done before, I might discover something a bit more powerful (kind of like looking at old ideas with fresh eyes).

Another idea was to emulate objects strictly through operator overloading. So far I've got a syntax that works great for using these "objects", but setting them up is a pain an non-intuitive. But I guess that is part of the learning experience.

BTW, I like your idea of defaulting to lexical scoping, and using alternate syntax to get dynamic scoping. I was thinking of something like "namespace::varname" where namespace is the name of some other function, and this will retrieve the variable from that other function ( ::varname will grab variables from the main function which is unnamed)).

Recursion and stuff

So the variable name is combined with the function reference i.d. that it is found in, and this is used to find the variable's index. This index number is what gets loaded into the final compiled code.

Does that mean functions aren't reentrant? Well, I have to say that's...bold :-)

for now...

yes, even though functions have separate namespaces, if a function calls itself then the inner calling can clobber variables that may be needed once the inner loop returns. The way I get around it is to pass a counter parameter to the function, and replace any variable that needs to be preserved with an array -- using that counter parameter as an index. So,

foo(var1, 0);

@foo(
    x[$1] = something;
...
    foo(var1, $1 + 1);
    bla(x[$1]);
...
)

In this example, x is a variable that is used before and after a recursive call to foo(). So it was replaced by an array, and everytime foo calls itself it increments the second parameter by one.

In a future version, I'll use a built-in internal function, called "varpush(x)", which will push the current value of x onto a stack at the beginning of the function. Then when the function exits, it will pop the old value back off the stack.
This way, instead of setting up and tearing down all of a functions variables every time it is entered / exited, only the variables that need to be preserved will get extra handling, thereby keeping the performance up.

Yes, bold, I know, and potentially error prone. But I couldn't bear the overhead of messing with all the variables every time I entered or exited a function (and it makes the interpreter code that much easier to understand).

Based off of this

Based off of this conversation, I really suggest following something like PLAI and using Scheme. You'll write the entire thing in 20 lines of functions and lists and quickly progress well beyond this point. For example, if you properly wrote the interpreter, making such changes should only impact 1-5 lines of code.

Recursive functions, lexical scoping, mutable variables, continuations, futures, etc. require minimal but conceptually powerful changes in such an approach. You'll figure out the naive implementation approach, and, more importantly, the basic 'clean' semantic design space and sample usage patterns. After you understand what's interesting to include, then start thinking about what optimizations (local or global) are possible for your restricted set of features: for example, profile-driven inlining of functions and subsequent expression elimination transforms might yield a much bigger win than a funky stack discipline.

In essence

In essence, all your variables are mutable as well as global (though not necessary globally visible). That's the opposite of lexical scoping - or any scoping at all, for that matter - and a sure recipe for disaster. And it is not only extremely error-prone and precludes recursion, as James points out, it also makes first-class functions practically useless.

I feel that the discussion is evidence that your approach of trying to understand advanced PL concepts in terms of C does not fly. You should take the advise already given by others and invest time to actually learn some high-level language and respective implementation techniques.

Confused

My variables are mutable (so it's not pure-functional), and internally they are all stored in the same data structure, but from the 2e script a variable "foo" in one function is unique to that function, and the same variable named "foo" in another function (or the main program) is unique there, just like in C or Perl, etc. Isn't that what lexical scoping is? They are just statically allocated at "compile" time instead of dynamically allocated when a function is entered (which I'll change at some point).

Now another idea I've toyed with is to have a function inherit variable from it's parent function, but it would be copy-on-write. In this case, they would behave like environment variable inheritance in regular OS processes. That may be closer to what some of the functional languages provide?

As for mutability, if I wanted to make a pure functional language then a couple lines of code in the assignment operator handler could make sure that variables can only be assigned a value one time. At the moment, however, I don't want to go for that, since I want to keep a regular procedural language that has some functional features (maybe this is a contradiction?).

BTW, I've been reading through CTM, and find it extremely interesting, although it is a text that I will have to put a lot of energy into in order to get enough out of it. Also, PLAI appears to be helpful. I just hope it doesn't have a lot of "Use this magic Scheme feature to implement the same feature in our Scheme-like interpreter", since it can be a circular definition (I've seen this too much in other texts).

So far through this project, every time I've needed to add a new feature or change an underlying construct, I've managed to do it with only a few lines of tweaking. But to get the data management part straightened out is going to require a bit more re-working. I'll make sure I keep good notes on all the wrong paths I've taken, and the corrective actions, so that anyone else interested can learn from it.

Not lexical

a variable "foo" in one function is unique to that function, and the same variable named "foo" in another function (or the main program) is unique there, just like in C or Perl, etc.

That's exactly like C and Perl except that it's totally different. In both languages the memory for locally declared variables is freshly allocated on each invocation. That means that a function can call itself (directly or indirectly) without clobbering data set up in a previous invocation. And all that happens without any extra effort on the part of the user.

Isn't that what lexical scoping is?

Nope. You've got globally scoped variables that are namespaced by their enclosing function. Scoping and namespacing are two different things. Namespacing is just a clever way to let two things appear to have the same simple name because their simple names are in different "spaces"*. But namespacing doesn't really change much - if I can write foo.bar anywhere to access bar in the single globally available foo namespace then that's about the same difference as naming the variable foodotbar and putting it in a global namespace. Whether namespaced or not, it's one binding and it's accessible globally.

A binding is a relationship between a symbol and its "meaning." For variables you can roughly say that that means the relationship between a variable name and a location in memory (there are more formal ways to describe this but you're an implementation focused guy so "location in memory" will do). In C, if a function foo has a local variable named bar then there will be a new binding for bar each time foo is invoked because a new location on the stack will be allocated with each invocation.

Global scoping means that a binding made anywhere in a program holds across the entire program. As I've illustrated above, you've described a language with globally scoped variables. There's only ever one binding for the foo.bar variable and that binding is globally accessible. Lexical scoping means that each time a function is called it makes new bindings for its declared variables and that a binding made in one part of your program is available over a certain chunk of the program that can be determined by the lexical position of the declaration. Often lexical scoping is based on nested blocks. In lexical scoping bindings are not in general available across chained function invocations (except perhaps to lexically nested functions) nor are they global - they aren't available anywhere except in their lexical environment. Dynamic scoping means each function invocation makes new bindings for its declared variables and each binding made in a function is available across the calls (and the calls from calls) that the function makes but the binding is lost when the function returns to its caller.

Hope this helps clear up the confusion. Good luck with those books.

* Some languages also add access protection schemes to namespaces - at which point namespaces have become simple modules.

Those books primarily* use

Those books primarily* use lambda and lists (and maybe numbers, if you feel like including them). CTM's use of Oz makes it the more exotic one to me.

*I think they resort to the native macro system when discussing macros, but I never read them that far.

Thanks, you just gave me an "ah-ha" moment...

Now I see. So in my documentation, I'll describe this as a global scope / local namespace implementation, since the variables are bound to a memory location at compile time. If I wanted lexical scoping, I'd have to bind the variables to a memory location at run time, whenever a function (or a separate scope) is entered. More likely, the variables will be bound to an index at compile time, but that index will be an offset to a chunk of memory allocated (at run time) whenever a new scope is entered.
And, instead of scoping based strictly on functions, I should have a generalized scoping syntax such as braces, that denote when I'm entering a new scope. Once this is straightened out, I think I'll have a serious language implementation (I originally didn't intend it to go even this far, but I've got most of the features of some of the mainstream languages, plus a few nice extras that they lack -- but that should probably be a separate topic).

Thanks.

closing over

There's a terminological snafu you keep repeating that suggests a deeper misunderstanding. Functions don't close over other functions (well they can), they close over their free variables. In your first example code, if you had higher order functions, f2 would be closing over fv1. In your second example, f1 closes over some_variable, though if some_variable is viewed as being in a "global" scope, then we don't usually close over them, though it is fine to do/say so but then we should say that f1 closes over print as well (unless print is a primitive construct in your language.)

interesting

It's a pain to encode this notion in traditional C/C++, yet it is also the starting point for many interesting things.

makes me wish for a taxonomy / hierarchy of such things, with examples of what each new "level" allows you to do vs. "lower" "levels".

While not complete, my

While not complete, my libconcurrency implements coroutines and continuations in C. The copying branch is a simple, naive implementation of continuations by stack copying. The other context functions operate via portable jmp_buf bit twiddling technique.

Did you...

Stress test it? If you go massively parallel, or deeply nested, won't you run out of stack-space fast?

(I'm only asking because I've got major problems with ML running out of stack space.)

No stress testing yet to be

No stress testing yet to be honest, and it's in a state of flux since I discovered that the latest glibc swizzle the jmp_buf in an odd way.

This library is not intended for massive fine-grained parallelism, but medium-grained parallelism. The approach is basically the same as Lua coroutines. On a 32-bit machine, you would max theoretically max out around 500,000 coroutines with 4 kB stacks. But at that point it would seem you're already incurring too many context switches, so a better algorithm would be preferable.

Intuition, not mechanism

Derek: I think there is a hazard in your approach. In one sense, C is just a macro-assembler, and just about anything can be expressed in it. In another sense, C is a language with strong history. Users of C develop in their minds a set of idioms, assumptions, list of hazards, and so forth. These inform and constrain how they use the language. Expressing problems in C (or any language) constrains how you think about it. This is equally true when expressing problems in lambda calculus. The problem with asking "What would this look like in C', which is some augmentation of C" is that you may get a handle on the mechanism, but it's unlikely that you'll develop a sense of the idiom or impact of a feature.

From an idiom and impact perspective, there are three features of modern languages (in my opinion) that have huge conceptual impact: safety, closures, and GC. These really do change how you look at things in a qualitative way.

If your goal is to understand how this stuff actually works at the bottom, then I invite you to come study the BitC compiler or any number of other compilers that accept a modern language and emit C[1]. Look at how they do the transforms and why. At the risk of bias, I'ld especially like you to look at BitC and ask your questions on the bitc-dev list so that we can document the missing information in the code as we respond to your questions. Everybody would benefit that way, and we're at a stage where a "virgin" set of eyes would help us quite a bit.

[1] Notable exception: GHC emits C, but it's a bad example for study, because (as I understand matters) it emits C that exploits incestuous knowledge of the internal operation of GCC. As a result, there is a lot of "hidden" magic that isn't obvious in the C code.

Not quite what you're looking for, but ...

I wrote a few articles on building a very basic Scheme-like interpreter in C++:

http://blog.lab49.com/archives/2013
http://blog.lab49.com/archives/2220
http://blog.lab49.com/archives/2241

While it's not an extension of the C language, it's still a kind of explanation of things like closures in terms of C++ primitives.

As for continuations, I also made this interpreter in C++:

http://blog.lab49.com/archives/724

You can use it to view the CPS-transform (just a very simple term transformation) of arbitrary terms. The interpreter is implemented in C++ as well, and uses a "trampolining" method to evaluate the CPS-transformed programs. I can make that source code available as well, if you're interested.

I think I get it now...

Now that I've had a chance to digest the advice everyone has given here (and after going through some Scheme in SICP), I think I've got a better handle on name spaces vs. lexical scope, closures, and to some extent continuations. I've also found some video lectures going through SICP from UCB. So, let me see if I've got this right:

First, having variable names be reusable in different functions without colliding (separate name spaces) isn't enough -- the variable names need to refer to a unique chunk of memory whenever the function is entered. From an implementation standpoint, that means that variables can't be bound to memory locations (or indexes into a data structure) at compile time -- they have to be bound whenever the function is entered at run time. This is usually done by having the variables point to offsets within the call stack or a heap.

Secondly, a closure isn't necessarily a language feature, it is a programming technique that needs certain language features present in the language in order to use this technique easily. Basically, a function definition needs to be able to see certain variables in the parent function (assuming nested functions), and those variables then need to persist after the parent function exits (assuming the parent returns the inner function as a result). So either garbage collection needs to be implemented (assuming the inner function can see the outer function's variables), or function definitions need to include syntax that allows the outer function to specify which variables get passed to the inner function. In this second case, those variables would then get stored with the same data element that holds a pointer to the inner function (possibly passing them as default arguments when the function is called).

An example of using a closure is defining a function which returns as it's result another function (or function pointer) that has behavior which is influenced by arguments passed to the first function. In other words, you have a function-creator function.

Now a continuation is still something I think I get, but I'm probably still missing a few things. From what I understand, a lexical environment is normally set up each time a function is called. So a continuation is a variable that refers to one of these lexical environments (in other words, an "in-flight" function), along with possibly a starting instruction address within that function. And in languages such as C, when f1 calls f2, f1 actually passes a continuation to f2 which is what f2 calls when it executes a return().

So, have I got the right idea this time? (or at least some of it right?) Thanks.

If you describe STM in

If you describe STM in C, you can show that it works but you can't show that it's safe.