Hidden complexities of tail-call/tail-recursion optimization
In most imperative and functional languages, a function foo(x,y,z)
upon entry establishes a frame for itself on stack. The frame contains
the values of function arguments, saved registers, as well as "local"
variables (bindings created during execution of the function). Upon
the return, the frame is popped from the stack. The existence of a
frame makes it easy to implement recursive functions. Each
separate invocation of fact(x) in the code below
int fact(int x) {
if( x < 2 ) return 1;
return x * fact(x-1);
}
has its own frame with its own copy of variable x.
Tail-call optimization applies to situations such as follows:
foo(int x, int y)
{ ... return bar(1,y,x); }
Function foo() invokes function bar(), and upon bar's return, pops the
foo's frame off the stack and returns to the foo's caller. It may be
possible to discard foo's frame _before_ the invocation of bar rather
than after that. That's the essence of the tail-call
optimization. Function foo() will discard its own frame, prepare the
parameters for bar() and enter the bar() via a jump rather than a
'call' instruction. Function bar's frame will override function's foo
frame. When bar() returns, the control will directly pass
to the foo's caller. If the function foo() invokes itself in a tail
position, the tail-call optimization becomes a tail-recursion
optimization. The obvious benefit is that recursion can execute in
constant stack space.
More extensive discussion of the tail recursion -- including the
reference to the formal definition, rationale with a bit of history,
and the specification for a _tail context_ can be found in Sec 3.5
"Proper tail recursion" of the R5RS.
Although tail-recursion/tail-call optimization appears
straightforward, there are a number of complex issues, some of them,
in general, preclude such optimizations in languages like C and C++.
Evaluation of argument expressions for the tail call:
In the simplest case such as
int fact_tr(int x, int accum)
{ if(x < 1 ) return accum;
return fact_tr(x-1,x*accum); }
The compiler can directly over-write the existing frame: it can
multiply the old value of x by the old value of accum and write the
result back into the 'accum' variable. The compiler can now decrement
x (overwriting the previous value) and jump to the beginning of
fact_tr. Note how the compiler took advantage of the unspecified order
of evaluation of argument expression. If the compiler was required to
evaluate argument expressions in their lexical order, x-1 first and
x*accum later, the compiler had to use a temporary location to save
the value of x before decrementing x. In general, the compiler must
use temporary locations to save the results of evaluation of argument
expressions. Only when all argument expressions are evaluated, the
compiler should overwrite the current values of function's
arguments. In Scheme, the use of temporary locations for argument
expression results is mandatory unless the compiler can prove that
none of the argument expressions capture continuations. See #1, #2 for more discussion.
Order of invocation of constructors and destructors:
Given the following C++ code
int objr(int x)
{ ObjA obj(x);
if(x < 1 ) return obj.to_int();
return objr(x-1);
}
evaluation of objr(2) leads to the following sequence of invocation
of ObjA's constructions and destructions:
ObjA() ObjA() ~ObjA() ~ObjA()
If the tail-recursion optimization is performed, the sequence will be
ObjA() ~ObjA() ObjA() ~ObjA()
The order of invocation may matter as both a constructor and a
destructor may perform i/o or execute other side-effecting operations.
Unless the C++ compiler can prove that the order will not matter, the
compiler must not do the tail-call optimization.
In Scheme and other functional and non-functional languages (Java),
only bindings are allocated on stack while non-primitive values go on
heap. When a frame is dropped and the bindings are destroyed, the
values are unaffected (and will survive the collection if, e.g.,
another binding points to them). In C not only the bindings are
allocated on stack but the values (e.g., structures, strings and other
arrays) as well. Dropping a frame has more consequences then.
FILE * open_png(char * file_name_stem)
{
char file_name[256];
if( strlen(file_name_stem) + 1 + 4 >= sizeof(file_name) )
return (FILE*)0;
sprintf(file_name,"%s.png",file_name_stem);
return fopen(file_name,"rb");
}
The call to fopen() is a tail-call. If open_png() destroys its own
frame before passing the control to fopen(), the name of the file to
open will be destroyed as well.
Thus in general, tail-call optimization in C/C++ requires global code
analysis, and is possible only in exceptional cases.
More discussion here and here; Lazy Semantics - The tail-call issues look different in a lazy context; capturing continuations.
Executing setjmp() prevents any tail-cal optimizations in
C/C++. Indeed, if function's frame is dropped, longjmp() will have
nowhere to return to. In contrast, in Scheme execution of
call-with-current-continuation does not affect the tail-call
optimization. A Scheme compiler is obligated to do the tail-call
optimization whether continuations are captured or not.
|