What is so bad about compiling?

No we haven’t lost our mind! We know that compilers and the languages they represent are the absolute foundation of computing. Compilers are so basic and originate so early in the history of computers it is easy to forget what a compiled language really is: A compiled language is simply an extension of the machine codes defining the register machine. As long as we insist that a language compile we are insisting that the register machine must be our computing architecture. This is because no matter how we disguise it the basic properties of the register machine will leak through into any language that must compile directly without an intervening virtual machine.

The property we are most concerned about is “flow”. We read and execute one instruction after another in order. If we need to execute a different string of instructions we use conditional jump instructions. On a higher level a conditional jump would be a goto statement with a label. Goto’s are very powerful but may lead to what we used to call “spaghetti code”. It can be impossible to “see” where the program is going next. We bring this up because “spaghetti code” still lives in the form of if/then/else statements, but is more subtle.

Another example of the “flow” problem is that the code must be analytic. The register machine has no built in way to back up or recover from errors. We can and must check for errors but the only option then is usually to terminate the program. The logic has been broken! The analytic assumption is reasonable as long as the program is not connected to the external world somehow, perhaps through user inputs, sockets, or external activity. The need to be analytic leads the “fortress” mentality; radical steps to ensure that the code is analytic.

A solution exists that we like to call synthetic logic. A synthetic logic is a set of event rules with a works or doesn’t work logic. Synthetic logic is a logic of possibility; the event X is said to work if events A, B, and C work in order. If something goes wrong the system simply backs up to the starting point automatically without any specific error processing. Synthetic logic can work along with an analytic true or false logic. The dilemma is that synthetic logic in whatever form will require a virtual machine to execute because it will be based on backward or forward chaining control. In other words it won’t compile on a register machine.

It seems to us that language designers must address the larger problem of languages that are hard to compile or impossible to compile. These languages usually don’t make it into prime time because they don’t compile. If we continue to insist that mainstream languages compile we will simply be getting more of the same subtle problems. In the search for the ultimate language the compiler should not be a consideration. Any language concept can be implemented as a virtual machine. With the fast computers that we have today this may be all that is needed. If that language should actually turn out to be the ultimate language then there would be plenty of money to design a machine on which it could be compiled directly.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Dude, you are blocking my sun

Nuff said.

Still have to compile

Can't see where you eliminate the need for compilers, as you still have to have some program that translates the high level language used by the programmer into the instruction set used by the VM.

What you are arguing against is not compilers, but rather the target instruction set of the compiler.

Yes to all that. I could hav

Yes to all that. I could have been more clear. It is possible to pseudo compile on a virtual machine, and certainly the vm must be compiled on a register machine since there is no other to run it. We use c++ for that.

I have no idea what you are talking about

Most of the text doesnn't make any sense to me, and its fragments don't have anything in common. Sorry, I think I can distinguish between "I don't understand it because it relies on advanced concepts I'm not familiar with", "I disagree with some assumptions", and "it's total rubbish". This looks like the last category.

To make this critique more concrete, you say that "no matter how we disguise it the basic properties of the register machine will leak through into any language that must compile directly without an intervening virtual machine." This is false on the high level (the semantics of a compiled language doesn't depend on the target language of the compiler but on the source language - it's the compiler which decides how constructs will behave), and while in some sense it's true when we are going into details (for example the interface to the external world is constrained by interfaces available from the target language), it's the same for interpreters.

"The register machine has no built in way to back up or recover from errors" - again, this is no different from interpreters. Errors in the program detected by the semantics of the source language can be recovered from in both compilers and interpreters. Ditto for errors reported by libraries used (I/O errors etc.). OTOH errors in the compiler or interpreter (i.e. "impossible errors") generally can't be recovered from, no matter how the language is implemented.

The paragraph about control flow makes no sense at all. "Spaghetti code" is a characterisation of a particular style of programming; it is completely irrelevant to the way in which the language the program is written in is implemented.

Perhaps we can clear up one t

Perhaps we can clear up one thing. You seem to be talking about compilers and interpreters. My compairson is compilers and virtual machines. I am also thinking of virtual machines that process if/then rules with backward or forward chaining. This type of virtual machine has very different properties from a register machine. From working with several virtual machines including a Lisp vm I have concluded that the machine, or the vm as the case may be has more to do with the properties of the language than the language itself. It follows that imperative languages built on register machines are also alike in important ways because they are all built on the same machine. The two ways I pointed out have to do with with the flow of control in a register machine. Control flow in a rule vm is quite different, either forward or backward chaining. Flow in a register machine is line to line except for conditional jumps.

The differing behavior of if/then statements used as flow control is quite different from the use of if/then rules with backchaining control. It is my opinion that rules are better.

Error recovery is a property of backward chaining. It is possible to recover from any error whatever. Code errors, execution errors, meltdown of external equipment without explisit code such as throws or terminations. The system will continue to run if the rules tell it what to do.

Thanks for responding!
There is a working example of such a system called Lewis. http://home.sprynet.com/~hthedick/homepage.htm

More clarification

You aren't talking about the difference between compilers and virtual machines, you are talking about the difference between target languages; one of which is a register machine, and another of which is some sort of virtual machine. However, this logic is completely flawed. Compilers are simply functions between programs in given languages. Example:
P1: C program
P2: x86 program
GCC: P1 -> P2
(this is a simplification, since GCC is a collection of compilers).
Now, theoretically, for a compiler to be valid, for all inputs x,
P1(x) = P2(x)
Like functions, compilers can be composed:
P1: C program
P2: LLVM program
P3: x86 program
llvm-gcc P1 -> P2
llvmc P2 -> P3
(llvmc o llvm-gcc) P1 -> P3

given llvm-gcc and llvmc are valid:
for all inputs x,
P1(x) = P2(x) and P2(x) = P3(x),
therefore P1(x) = P3(x), so
(llvmc o llvm-gcc) is a valid compiler as well.

Now, an interpreter is a function which takes code in a given language, its input and returns a result by executing it. We can write these as functions too:

P1: Python code
x: some input
y: some output
cpython: P1, x -> y

P2: Java bytecode
x1: some input
y1: some output
jvm: P2, x1 -> y1

P3: x86 binary
x2: some input
y2: some output
athlon P3, x2 -> y2
etc.

virtual machines are simply bytecode interpreters. If there is a difference besides the name, I have yet to see one. physical machines are interpreters too; they "interpret" binary code to give it a result.

Notice as well that compilers and interpreters themselves are programs, so:

gcc(cpython) -> P
where P is x86 asm such that:
P(python code, input) -> output

Now, partial evaluators are programs such that:

P: program in language L
P, x, y -> z
x1: an input, valid as the first input for P
PE(P, x1) -> P1 such that:
P1, y -> P(x1, y)
and P1 is a program is language L

this is actually simple currying, but many partial evaluators do sophisticated optimizations and such.
Notice that partial evaluators take two inputs. Stipulate that there is:

PE: partial evaluator written in language L, over programs written in L
I: interpreter, written in L, which interprets programs written in L1
P, x -> y where P is written in L1
PE(I, P) -> P1
such that
P1: x -> y where P1 is written in L
So we have just compiled P into L.

Furthermore:
PE(PE, I) -> C where
C: P1 -> P2 where
P1 is a program written in L1 and
P2 is a program written in L and
for all inputs x P1(x) = P2(x)
So, from our interpreter we have generated a valid compiler from L1 to L.
Application:
PEx86: a partial evaluator written in x86 over programs written in x86
LBI: a lewis bytecode interpreter written in x86 (a compiled C/C++ program will do)
LCB: a lewis compiler which outputs lewis bytecode, written in x86 (again, it might be written in C/C++ and compiled to x86)
(PEx86(PEx86, LBI) o LCB) -> LCx86
where LCx86 is a valid Lewis compiler targetting x86.

This is my mangling of an old idea, Futamura's second projection. But as you can see there is definitely some form of equivalence between (bytecode or not) interpreters and compilers, and therefore virtual machines as target languages and physical machines as target languages.

Being a vague observer of the

Being a vague observer of the field, I found this explanation of the term "Partial evaluator" most educational.

Wading through the Turing tar

Control flow in a rule vm is quite different, either forward or backward chaining. Flow in a register machine is line to line except for conditional jumps.

Nothing stops you from compiling whatever control flow you like onto whatever (Turing-complete) register machine you like. In general, the premise that register machines limit the languages which compile to them is flawed: Turing-completeness and Turing equivalence undermine such arguments.

There's also a more concrete way to see why the premise here is flawed: if a virtual machine exists for a language, and you can run that virtual machine on a register machine, then you can design a compiler which compiles the source language directly to the register machine. All such a compiler has to do is replace each virtual machine instruction that would otherwise be generated, with the register machine instructions that would be executed for that virtual machine instruction.

In fact, it is possible to automate the generation of such compilers from virtual machines, and plenty of work has been done in this area — try googling for "compiler generator" and also "partial evaluation".

Looking at it from another perspective, if you could show that there was some limitation of register machines that prevented a particular computational language, or type of language, from being compiled to those machines, you would have a result of similar significance to those of Turing. This would require a major adjustment of the Church-Turing thesis, at the very least.

From working with several virtual machines including a Lisp vm I have concluded that the machine, or the vm as the case may be has more to do with the properties of the language than the language itself. It follows that imperative languages built on register machines are also alike in important ways because they are all built on the same machine.

This logic is also flawed. If one chooses to design a virtual machine that is a good semantic match for a language, the source language obviously affects the design of the virtual machine. Any influence that the VM has in the other direction, on the source language, is essentially an illusion: it's by choice that the semantics of the source language and VM are close to each other, not by some unavoidable necessity. The existence of many non-Java-like languages that compile to the Java VM demonstrate this point.

None of this implies that the target machine — whether a register machine or otherwise — limits the design of languages that target it. There are excellent examples which demonstrate otherwise, in fact: the functional languages implement some variation of the lambda calculus, which is a concise mathematical notion of computation that was in no way influenced by register machines. The lambda calculus has a relatively straightforward translation into register-machine code, e.g. via CPS transformation, and many functional language implementations rely on this.

It seems to us that language designers must address the larger problem of languages that are hard to compile or impossible to compile.
There are no languages that can be run by a virtual machine that are impossible to compile. Some languages are easier to compile than others, but there's no evidence to indicate that difficulty of compiling is preventing good languages from being developed. Many languages have been designed without any consideration of register machines, and many of those have had successful compilers written for them. Higher-level hardware might make it easier to compile certain languages, but it would also be likely to make other kinds of languages more difficult to compile.

Yes but- - -

Thanks for the education. You all really do know your stuff. However this brings up a problem I often face and seems to follow any language discussion. The fact that one can transform any language into another is important from a theoretical and practical reasons but misses the point. A language is about expression one chooses a language because it expresses the problem better. It is like choosing a coordinate system in physics. Polar coordinates work better for orbital problems than rectangular coordinates.

Choosing a virtual machine may be a necessary consequence of choosing a form of expression. If we want to express backward chaining we will probably write a vm for that. This will express certain types of flow control better than using a register machine expression of flow. This is why we choose one language over another. A better expression makes it easier and quicker to do the job.

My point about compiling is really about the practical environment that we find ourselves in. The register machine is the only machine we have. A large number of languages directly express the control flow found on register machines. These languages are normally compiled directly, there is no transformation through a vm. Other languages that directly express backward chaining flow need a vm as a practical matter. The vm is itself written in something like c++ and compiled on a register machine. Such languages as a practical matter don't compile. Which is to say the vm is the best implementation. We simply run them dynamically on the vm. This is one reason why such languages don't become mainstream. We need to work on these languages and forms of expression in spite of this.

Interpretation isn't a barrier to acceptance

This is one reason why such languages don't become mainstream. We need to work on these languages and forms of expression in spite of this.

But then of course some of the most popular languages at the moment are interpreted: Perl, Python, Javascript. Interpreted Lua gets a lot of use embedded into larger applications, including real-time video games. Interpreted Erlang has been used in high-end commercial products.

Implementation strategies

However this brings up a problem I often face and seems to follow any language discussion. The fact that one can transform any language into another is important from a theoretical and practical reasons but misses the point. A language is about expression one chooses a language because it expresses the problem better. It is like choosing a coordinate system in physics. Polar coordinates work better for orbital problems than rectangular coordinates.
Agreed. The problem you refer to which often follows language discussions is common enough to have a name: the Turing tar-pit. However, this discussion isn't really an example of this problem - more below.
Choosing a virtual machine may be a necessary consequence of choosing a form of expression. If we want to express backward chaining we will probably write a vm for that. This will express certain types of flow control better than using a register machine expression of flow. This is why we choose one language over another. A better expression makes it easier and quicker to do the job.
Various alternatives are being ignored here. A language can also be implemented as a direct interpreter, or as a compiler to an existing systems language or high-level language. Possible target languages include e.g. C, Scheme, OCaml and Lisp. Some languages, such as Scheme, even allow unusual control flows to be embedded directly in the language, by use of first-class continuations, which with the assistance of macros, allow you to embed unusual languages directly in Scheme. The register machine is in no way constraining the ability to design and implement new languages that do this kind of thing.
Other languages that directly express backward chaining flow need a vm as a practical matter. The vm is itself written in something like c++ and compiled on a register machine. Such languages as a practical matter don't compile. Which is to say the vm is the best implementation.
Which languages are you thinking of? If "such languages as a practical matter don't compile", I suspect that's more to do with the familiarity of the implementors with the available technical options, than any fundamental constraints seeping through from register machines.
We simply run them dynamically on the vm. This is one reason why such languages don't become mainstream. We need to work on these languages and forms of expression in spite of this.

There are a number of interpreted and VM-based mainstream languages, so the idea that lack of native-code compilers inhibits acceptance is arguable. Also, many people work on non-mainstream languages for which few or no native-code compilers are available. Many Scheme compilers compile to C, for example (although that's mainly for portability rather than due to any constraints of register machines).

I'd agree that most languages should be designed without consideration of the constraints of register machines. In most academic language research, language designers do exactly that. As a result of this, a lot of work has also been done on how to compile languages of all kinds to register machines, and to systems languages like C. Many very successful techniques exist to do this. See e.g. Compiling with Continuations. The barriers you perceive may not be real, although without specifics it's hard to tell.

The differing behavior of if/

The differing behavior of if/then statements used as flow control is quite different from the use of if/then rules with backchaining control. It is my opinion that rules are better.

But such languages can be compiled too. For example Mercury and Haskell both have very different evaluation semantics from C (Mercury uses backtracking, Haskell is lazy), yet both have compilers which create C (among others).

The point is that having a compiler doesn't limit properties of the language.

Relative performance?

I wonder if you have any info on the relative performance of say haskel with a vm and haskel compiled to C?

Performance

One of Haskell implementations, GHC, has both a compiler and a bytecode compiler with bytecode interpreter, which use the same runtime (e.g. garbage collector and representation of most objects) and the same compiled libraries. The interpreter is used for interactive toplevel. Unfortunately I couldn't find any performance comparisons between them.

OCaml has a native code compiler and a bytecode interpreter, which again use the same runtime. In contrast to GHC the bytecode version of OCaml doesn't mix with native code and uses bytecode-compiled libraries (except core functions implemented in C). Note that OCaml's execution model is similar to conventional languages, in particular the native code version uses the system stack (GHC has its own stack). It's hard to find good performance comparisons, but I found some claims that the native code version is 4 or 5 times faster.

I think I don't know any other language which has a pair of implementations which differ only in this aspect. Lisp and Scheme has both compilers to native code and to bytecode, but there are lots of other differences between particular implementations, so it's hard to say which part of performance difference comes from the way of encoding instructions.

Generally the difference between compilers and interpreters is fuzzy. There is a continuous spectrun between generating "real" code, generating code which is mostly driven by data tables, generating bytecode which is then interpreted, or interpreting another intermediate representation. There are almost no implementations which interpret the source text directly. All kinds of implementations of languages not resembling C contain something which could be described as a VM. They often use a custom calling convention, their own stack, virtual registers, garbage collector etc., no matter whether instructions are inlined into compiled code or dispatched from an array of bytes or from an abstract syntax tree.

.

Thanks

Thanks for this info. This is an area I don't know much about but it is important to me. I think I can explain my previous comments this way. Suppose there is a source S that runs on a virtual machine V. When I said that S doesn't compile I meant that it doesn't compile directly on a register machine. However one way to compile S and be consistent with theory is to simply build a C source containing the V source code and data structures containing S, and probably a little more stuff and compile it all in one chunk. bingo! an execuitable source. This is a possibility I haven't considered befor. I think it would be a little faster than loading V and running S dynamically but looses the the feel of a dynamic environment.

As far as I can tell all this does is repackage the vm and execuite it in a different way. We are still stuck with the problem of trying to develop and promote languages that don't seem to run fast enough even though they may solve other problems.

Compilers and interpreters

However one way to compile S and be consistent with theory is to simply build a C source containing the V source code and data structures containing S, and probably a little more stuff and compile it all in one chunk.

Exactly! I think this is how NHC compiles Haskell.

This is also how CorelDRAW drawings are printed to PostScript. Instead of producing PostScript commands for particular image features, CorelDRAW emits a fixed Postscript program which interprets the CorelDRAW format, and then the original CorelDRAW file is appended.

This style in its pure form is rare I think. Most language implementations either go one direction and inline instructions as code sequences (for performance), or go in the other direction and either materialize bytecode in standalone files or construct it in memory (for ease of development and portability; in particular it's easier to obtain dynamic code loading this way).

I've also heard of generating code in memory which consist of pointers to functions instead of instruction numbers, to replace dispatching by number with indirect jumps.

Threaded code [OTish]

"I've also heard of generating code in memory which consist of pointers to functions instead of instruction numbers, to replace dispatching by number with indirect jumps."

A common pattern in Forth VMs. Interestingly, another common way top implement interpretation/compilation in Forth is subroutine threading, where, instead of only storing pointers, the compiler emits calls. It isn't more complex, but provides a nice transition between compilation and interpretation; it is then rather simple to inline simple words (subroutines), and running a peephole optimiser on that instruction stream does not require rewriting the infrastructure. Someone better-versed in Forth will be able to offer more information.

With that in mind, I'm not sure I agree that there is a fork between code generation and bytecode interpretation, but more of a gradient. [EDIT: As you've written earlier. My fault.]

Bytecode performance

I found some claims that the native code version is 4 or 5 times faster.

Update: there is also a case where OCaml compiled to native code runs 21 times faster than OCaml compiled to bytecode: http://groups.google.com/groups?as_umsgid=429e604b$0$7560$ed2619ec@ptn-nntp-reader03.plus.net

I guess it's because OCaml bytecode boxes all floats, while its native code sometimes passes them directly.

Not sure I follow

As far as I can tell you are arguing that compiling is bad because a language that compiles to register machine instructions cannot support backtracking.

However, Scheme can be compiled and can support backtracking using "amb" (easy to implement in terms of call/cc). Prolog can also be compiled and is based on backtracking.

What am I missing?

Definitely not! All backtrac

Definitely not! All backtracking languages run on register machines usually with an intermediate virtual machine. I think that prolog compiles to the instructions of its vm or may be interpreted by its vm. Prolog is usually used dynamically, whether compiled or interpreted. I don't know about Scheme. Generating an execuitable seems to be a matter of compiling the source and the vm together as we talked about above.

Many systems now days are interpreted(or vm) and compiled. For example c++ has a large memory management subsystem that runs as a part of any execuitable. But I am really not the right person to talk about this.

The issue I wanted to raise is that compiling directly to the instructions of a register machine committs one to the control strategy of the register machine. The most troublesome example being If/Then/Else statements. An alternative exists in the form of If/Then rules with backchaining control. The backchaining vm covers up and replaces one control method with another. This gives us the feeling that we are useing a different machine. In fact we are because we could replace the vm with actual hardware. It is interesting and very fortunate that compiler theory allows us to create a new machine out of a simpler and very different machine, ie the register machine.

Tail-recursive discussion

The issue I wanted to raise is that compiling directly to the instructions of a register machine committs one to the control strategy of the register machine.

As I've tried to point out above, this is false by all usual definitions. As Glomek points out, Scheme is one example of a language which has native-code compilers, and supports arbitrary control flow, including backtracking, via first-class continuations. The SML/NJ implementation of SML is another such example.

You seem to be arguing that in order to do this, you effectively have to implement a virtual machine even in a native-code compiled language implementation.

There's some truth to that, in the very broad sense that implementing any language requires implementing something which could be described as a VM for that language. However, such broad definitions eviscerate your own point: if one implements a language using the CPS transformation strategies I referenced earlier, targeting native code, why doesn't that qualify as compiling "directly to the instructions of a register machine"?

By this logic, as you've pointed out, C++ and even C have virtual machines, to support heap management and e.g. runtime type handling in the C++ case. In fact, no language but the most primitive assembler does not have a virtual machine in this sense.

Continuation-based compilation strategies require about the same "virtual machinery" that is required to implement threads — in fact, continuations can be seen as snapshots of threads. For both threads and continuations, the system must manage multiple stacks, which is a capability explicitly supported by register machines.

In short, the register machine model does not actually impose limitations on control flow or control strategies. What does impose such limitations are languages like C and many other procedural or OO languages, which have limited support for anything other than the call/return model of control flow. I think to some extent, you may be incorrectly attributing the limitations of these languages to the underlying register machine.

It is interesting and very fortunate that compiler theory allows us to create a new machine out of a simpler and very different machine, ie the register machine.
This comment is more on track, but as I have pointed out, this is required for any language other than a simple assembly language.

Language, compiler and VM

Which reminds me of chapter 12 in EOPL1 (sadly missing in EOPL2): Compiler Derivation.

Section 12.1 is titled: Deriving a Compiler and Machine from an Interpreter...

Partial eval resources

I didn't realize EOPL used to have that. But for anyone interested, there are quite a few resources on readscheme.org and its sister site partial-eval.org. Many of the tutorials, surveys and overviews provide a fairly quick intro. Darius Bacon's A Hacker's Introduction to Partial Evaluation is one of the more easily accessible ones, and covers converting an interpreter to a compiler, with Scheme code. (Not a native code compiler, but the principle is there.)

Perhaps what is needed in thi

Perhaps what is needed in this forum is a more theorerical definition or explation. Unfortunately I don't have that kind of training, but I will try anyway. By your argument the register machine does not pose liminitations on control flows or stratigies. You point out that Scheme can easily construct all sorts of control flows including backtracking. Granted! But arn't you simply constructing a virtual machine by adding behaviors not present in the register machine itself?

The point I want to make is that the register machine itself has a certain behavior. Languages that use that behavior and don't modify it like Scheme are the ones I am talking about: C, C++, and almost everything else.

But in a way even C can modify the basic register machine control when it is configured as a backchaining vm. I do that myself! But as I see it I am no longer working with C. I have constructed something new. ie Lewis.

This way of thinking also has a practical aspect in that the vm can be turned into hardware and Lewis could then run as fast as C.

Perhaps it comes down to this. Choosing ones control structure and not simply using what is offered up by default.

Generality of register machine control flow

You point out that Scheme can easily construct all sorts of control flows including backtracking. Granted! But arn't you simply constructing a virtual machine by adding behaviors not present in the register machine itself?

Right, but as I pointed out, we do this with any language other than a simple assembler (i.e. an assembler which directly corresponds to the register machine itself). So there's no special quality that backtracking languages have that make them more or less suited for register machines.

The point I want to make is that the register machine itself has a certain behavior. Languages that use that behavior and don't modify it like Scheme are the ones I am talking about: C, C++, and almost everything else.

You're making an artificial distinction here, placing C in a special position. C modifies the behavior of register machines by restricting the language it implements, for the most part constraining functions to return directly to their callers. Most register machines do not impose this constraint, but C does. Scheme doesn't impose this constraint, and in that sense is closer to register machines than C is. As I said, I think you're attributing aspects of C and similar languages to the register machine.

Perhaps it comes down to this. Choosing ones control structure and not simply using what is offered up by default.

Even C does that. The core control structure "offered up by default" by register machines is the jump. In addition, most register machines provide support for control stacks, but don't constrain you to a single stack. These are very low-level primitives which allow you to construct any sort of control flow. C's behavior is by no means a default one for register machines.

This way of thinking also has a practical aspect in that the vm can be turned into hardware and Lewis could then run as fast as C.

It's likely to be more practical to exploit modern compilation techniques to develop a good compiler.

Virtual machines

But arn't you simply constructing a virtual machine by adding behaviors not present in the register machine itself?

What do you mean by virtual machine? Perhaps here is the misunderstanding.

For me a virtual machine is a low-level language designed to be interpreted by software. Or its implementation.

The distinction between a VM and an intermediate language is fuzzy; the same low-level language can also be compiled into another language (e.g. a real machine code), and then it can be both a VM and an IL. It's a VM when it's being interpreted. It's not a VM when it's always translated as a whole to something else before execution. The borderline case is when it's translated on the fly and the result of the translation is cached. Life is fuzzy.

definitions

The words “virtual machine” are definitely a characterization. Perhaps no more precise than defining a duck in terms of how it walks or quacks. This is bound to be a problem for any theoretical discussion. In the absence of any theoretical definition it is probably best to use it simply as it is used in the literature. I think of a vm as code that can be said to work like some real hardware and substitute for it. Vm’s used for various rule languages can be thought of this way. This makes it easy to compare a register machine and a vm. The vm doesn’t morph or transform through continuous shades of gray.

But this doesn’t address all the issues that compiler designers might face nor does it completely justify my expression “directly compiles”. By invoking the word compiles in this forum I am invoking a lot of theory that I know little about. (ie CPS transforms) I tried the follow this link but was blocked. Never the less I think I get the idea of Lambda calculus being transformed into machine codes. But this is just the idea I was hoping to avoid by using the word “direct”. I tend to think of compilers in terms of what I see on my debugger screen when it displays the machine codes generated by a line of source.

Never the less “vm’s” are becoming ubiquitous and frequently don’t fit the limited definition used above. I am inclined to say that code that builds new control structures is doing something special. It is adding functionality. Suppose there is a register machine R, a vm built on R called V, and a source that runs on V called S. Is there any theory that suggests that S can be compiled directly to R. The definitions used by S are contained in V. The definitions used to compile V are contained in R. Compiling always seems to involve two steps.

What debugger shows me

I tend to think of compilers in terms of what I see on my debugger screen when it displays the machine codes generated by a line of source.

My debugger shows me the source code, not the generated code.

Push the show assembly button

Push the show assembly button.

Why?

My point was, why would you want to see the generated code. You are presumably debugging the source code, not the compiler.

Code Transport?

Well, if the code in whatever form is being pushed across the network, you need someway to debug it on the receiving end. Proof carrying code would be another reason to have generated code in some form of container other than raw machine instructions.

Compilation

Is there any theory that suggests that S can be compiled directly to R.

I'm afraid your notion of "directly" can't be formalized, so there can be no theory.

But there is practice. For example Mercury (there are some papers there which explain their strategy) and a particular funny way to compile Scheme by putting stack on the heap and then implementing the heap on the system stack, used in the Chicken compiler.

This was about compilation into C. Generating assembly code has even fewer constraints and doesn't have to employ tricks to realize various things which are easy in principle but not easily available in C (e.g. tail calls, integer arithmetic overflow checking, exceptions, user-level threads).

More Partial Evaluation References

Hank Thediek: Suppose there is a register machine R, a vm built on R called V, and a source that runs on V called S. Is there any theory that suggests that S can be compiled directly to R. The definitions used by S are contained in V. The definitions used to compile V are contained in R. Compiling always seems to involve two steps.

First, I have to say that I think that this is a quite articulate expression of the question—and it's a difficult question! A possible answer has already been alluded to in this thread, but let's see if I can't dig up some more references. First, the alluded-to general answer to your question falls under the rubric of "partial evaluation." Partial evaluation, in basic terms, just means that you have a program, "p," and some data, "d," that you can apply p to, that's known at analysis time. With a partial evaluator, let's call it "a," you can a(p, d) and get p', a new program that has done "as much of p as it can, given d," but since not all of the data needed by p was known at analysis time (that is, d is the "static data" and what's left is the "dynamic data"), p isn't done yet. But when we have the dynamic data, let's say "d'," we can p'(d') and get the same result as if we had p(d, d').

While partial evaluation had been known by various names for years prior, in 1971 Yoshihiko Futamura published his seminal "Partial Evaluation of Computation Process—An Approach to a Compiler-Compiler." This paper formalized the notion of partial evaluation and defined what are now called the "Futamura projections," which are well, and amusingly, summarized here. The Futamura projection that's relevant here is the second, in which the partial evaluator a is applied to itself and an interpreter for language L: a(a, InterpL). The result is a compiler from language L to whatever language InterpL is written in.

If we understand "InterpL" to mean your "vm," we see, then, that the short answer to your question "Is there any theory that suggests that S can be compiled directly to R" is "Yes: just apply a self-applicable partial evaluator to itself and vm." In practice, it's a tough exercise; see The Second Futamura Projection for Type-Directed Partial Evaluation, The Program Generator Generator PGG, Similix, and Tempo for more practical information.

Time

There may be a more subtile aspect to this discussion. It is easy to forget how important time is, and the changes that go along with it. Hence the need to consider not only what to compile but when. The proliferation of "vm's" in whatever form, and their interpreters certainly suggests the need to be dynamic, another difficult word to use. The idea of an intermediate language may be a key. It leaves some constructs open after the compiler runs?

Compilation at runtime

While generating native code quickly and loading it dynamically is hard (can't be done portably using C alone), there are implementations which do that. For example various Lisps, e.g. CMUCL.

Direct compilation

Never the less I think I get the idea of Lambda calculus being transformed into machine codes. But this is just the idea I was hoping to avoid by using the word “direct”.
[...]
Compiling always seems to involve two steps.

Two or more steps, actually. Even C compilers compile through intermediate forms, such as the Static Single Assignment form (SSA).

In general, compilation is the process of transforming one language into another, and it's very common and natural for multiple transformation phases to be used in that process. The intermediate transformation stages often use a third language. This third language is usually designed to better support automated transformation, as in the SSA case.

SSA uses a side-effect-free approach to better support algebraic transformation of intermediate code. In so doing, it becomes a variant of functional programming, as observed in papers such as SSA is Functional Programming and A correspondence between continuation passing style and static single assignment. So, any compiler which compiles through SSA — including e.g. C and C++ compilers — is actually first converting the source language to a slightly strange variety of lambda calculus, then performing additional transformation steps on that, and finally generating machine code. In a sense, functional language compilers could be said to be more direct than e.g. C compilers, since they start with a sugared lambda calculus variant which they desugar and simplify, but it's lambda calculus all the way down, until the transformation to machine code.

A "direct" compiler which somehow goes from a source language (other than assembly language) directly to native code (e.g. in a single transformation step) is not a realistic compiler. Realistic compilation to native code isn't trivial, and the need for multiple transformation steps and intermediate languages isn't much affected by whether the source language is C, a functional language, or a backtracking language. Constraining compilation to be "direct" in this context isn't useful.

Compilation to virtual machine instructions that are semantically close to the source language is obviously much simpler, and can be done more "directly".

I tend to think of compilers in terms of what I see on my debugger screen when it displays the machine codes generated by a line of source.
In that case, you're only looking at the input and final output of the compiler. It's easy to imagine a "direct" connection between the two if you didn't have to do any of the hard work of transforming one into the other.

Speaking of Intermediate Languages

I was recently recommended SPJ's paper on Henk: a typed intermediate language (mostly interested in the explanations of Lambda Cube).

Was wondering whether Henk is still actively being pursued, as well as what others think about Typed IL's in general. This talk of compiling stages reminds that not all is necessarily known at compile time - some components not being available until runtime. In which case, you can either revert back to dynamic linking, or have a compile type checker during runtime. (Alice seems to rely on the type information being available to link packages at runtime).

Some successors to Henk

You may be interested in Pure Type Systems for Functional Programming, which describes a Henk-like language that can describe a large class of Pure Type Systems, including those of the Lambda Cube.

There's also JHC, a Haskell compiler that uses a Henk-inspired intermediate language.

Unusual control flow doesn't prevent compilation

Generating an execuitable seems to be a matter of compiling the source and the vm together as we talked about above.

The fact that it's possible to satisfy a formal definition of "compilation" this way doesn't imply that there aren't compilers of languages with unusual control flow rules which fit with the traditional spirit of compilation.

I've already mentioned Mercury. It's a statically typed hybrid of Prolog with functional languages, and there is a compiler which targets C. It really emits honest C code with functions corresponding to the program being compiled - it's not just a fixed set of functions with the program appended as data.

There are Scheme compilers which generate C or assembler, e.g. Chicken and Bigloo and various others. Scheme is dynamically typed (which shows that practical compilation is not limited to static typing) and has first-class multishot continuations. Continuations allow to implement backtracking (and more), and their availability prevents using the same function calling discipline as most other languages, because when a function returns its local state might be needed later again when the same computation returns another time.

The most straightforward compilation strategy of Scheme puts linked "stack" frames on the heap; they are reclaimed by the GC. I put the word "stack" in quotes because in general they form a tree rather than a stack. The pointer to a stack frame together with code pointer can be materialized as an object, and later entered again. The return address is not pushed onto the system stack but passed along with a jump, and it might later be stored on the heap.

So what? Scheme can be compiled in the same sense as C++, Eiffel or OCaml. Properties of the language do influence how function calls can be realized, but they are compatible with generating direct machine code with no central interpreter loop and without dispatching instructions at runtime. In which sense it's not a compiler?