Expressions vs Statements

While trying to generalize the structure of a program and its source code, I came across this fundamental question, that I still don't manage to answer properly : "what is the difference between expressions and statements".

I am pretty sure that there are many folks here with a different answer from "42" ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Value

Expressions have a value, while statements do not. If you can pass it as an argument to a function, it's an expression. If you can't, it's a statement. Control flow constructs in C-derived languages are generally statements (you can't pass an 'if {}' block as a function parameter). Consequently, those languages tend to provide an expression form for 'if' (like the ternary operator). Of course, the whole point of many functional languages is that everything has a value, and that functions are first-class values, which means that everything is an expression. Statements usually appear in imperative languages, where you wish to write commands that don't necessarily return a value.

Like procedure v. function

Expressions have a value, while statements do not.

I don't think I ever made the connection between the difference between statement v. expression & Pascal's procedure v. function difference. (Which I thought was a useful distinction. About the only thing of value I got out of learning Pascal.)

Oh so many years ago when I moved from my first language (BASIC) to my second (C), one of the first things that struck me was how odd it was that assignment was an expression instead of a statement. Heck, even though I'm accostomed to it, it still seems odd to me.

Mandatory subject lines considered harmful

First it helps to make clear the purpose of expressions: the purpose of expressions is to have values. In a certain sense, a denotational semantics is nothing more than a set of rules for assigning values to expressions. (The rules are usually defined inductively in a syntax-directed manner but that is an "implementation detail".)

If I were forced to similarly specify the purpose of statements--and keep in mind that the statement vs expression distinction is not at all universal, not even among imperative Algol-style language--I would likely say: the sole purpose of statements is to have side-effects. Note that I am not saying that expressions cannot have side-effects. I am merely saying that since statements are not assigned values, they can only be good for side-effects, so that is their sole purpose. There is a grey area due to the 'return' operator in imperative languages but I think it's reasonable to consider 'return' to be internally side-effecting.

Now I'll sit back and quietly wait for the people who know a lot about programming languages to tear me to shreds. :)

The "do notation" of Haskell...

...is an example of "statements" in a purely functional language.

A statement has type m () or m a

A statement is either (a) an expression of the unit type in a monad or (b) a binding expression in a monad. Even in do, it's just sugar for expressions (using (>>)). In a language like C, the monad is an IO monad with some IORefs defined for global state.

When statements are a syntactical feature, the language has implicit state. (I would be very interested to be proven wrong here, and not on a technicality like declarations or directives.)

Oh, and 42 is an expression. In C, it's also a statement, because every expression is a statement.

Statements are commands, expressions are requests

Statements are commands, i.e. 'do this', 'open file', 'set collection size to 1', 'clear collection' etc.

Expressions are requests: 'calculate this', 'give me the contents of this file', 'give me a collection with size 1', 'give me a new collection'.

Superficially, they are both commands, in the sense that expressions are commands to perform calculations. But the real difference lies inside statements and expressions:

-In statements, the programmer tells the computer where to store the results of computations.

-In expressions, the programmer does not tell the computer where to store the result of computations, but takes the result in his/her hand, and passes it to other expressions.

The problem created by statements is that the programmer often forgets the right place to store the result of computations, and therefore uses a wrong place, thus creating havoc with a program; whereas in expressions the store does not matter, because the result is to be immediately used in a nearby expression.

It's a very subtle difference, but once someone understands it, he/she automatically understands the real value of functional programming.

Re: where to store results

That sounds interesting, although I think you have to be more clear/precise: When I bind a variable to a value that came from an expression, that sure looks an awful lot like telling the machine where it can shove the results. I would suspect that there are lots of people in the world who would see that as pretty much the same thing as assignment in C. How would you explain the difference further?

The C variable corresponds to

The C variable corresponds to a block of memory, the Haskell variable is literally just a name and may well have no run-time representation whatsoever.

Depends...

...on whether the C variable is an rvalue or an lvalue. ;> If it's an rvalue it could get optimized out of existence. Technically that's also possible for lvalues, but more likely for rvalues.

Re: blocks of memory

(Perhaps a tad devil's advocate, because I think it is honestly confusing to folks coming from the C side of things.) Technically, everything is stored in memory somewhere, be it a register, cache, a DIMM, on the hard-drive, on the Ethernet wire, etc. Perhaps addressing the idea of C having a machine model would be a way to explain the difference? I don't know enough about the Java spec, but maybe it would allow for Java to e.g.: not have to make a variable correspond to a block of memory in the fashion you are thinking?

[Edit] Given the optimizations that happen in C++ compilers mentioned, maybe the machine model ("if you use syntax X you will get a stack-allocated item every time") isn't the full story, but perhaps is still relevant and helpful to explain the nuances?

Depends on how things are declared and used

Local variables in C/C++ kind of live a dual life. Originally in C, it was assumed that variables (including function locals) would live in memory somewhere; the C language reflects this by provinding the address-of (unary &) operator which can be applied to most named terms. C also provided the "register" keyword; which did request that a given variable be register-allocated rather than stack-allocated.

Modern optimizing compilers, of course, try and ignore the "variable = storage location in main memory" paradigm as much as they can. As pointed out above; local variables are often cached in registers or optimized out of existence completely. Many such variables never find their way out of the CPU on many modern C/C++ implementations. The "register" keyword is considered to be deprecated.

Taking the address of local variables (usually not a good idea, other than for call-by-pointer in C) generally does require that the variable be assigned a location in memory; as does declaring it to be volatile.

lvalues versus rvalues

The difference shows up in the formal semantics, as well, and can best be understood in terms of lvalues and rvalues.

In Algol-like languages, there's a distinction between using an expression as an lvalue and as an rvalue -- that is, between using an expression as the targest of an assignment, and using it for its value. The reason this distinction exists is because the abstract semantics of Algol. In the semantics, a program state is a pair of a store and a heap, and a command updates the store and heap. By "store", I mean what C programmers would normally call "the stack", but in a formal semantics it's just a mapping from locations to their contents.

A variable is an expression that names a particular location in the store. When you use a variable as an lvalue, that expression denotes the location in the store. When you use it as an rvalue, you denote the contents of that location. You can write some expressions that calculuate lvalues -- for example, this is how array indexing works. a[10] = 5 is an expression that computes an lvalue and then stores 5 in the contents of that location.

So, to recap, the meaning of a variable as an l-expression is a location in the store, and its meaning as an r-expression is the value in the contents of that location. However, in a language like ML, there's no store in its semantics. Instead, variables are only a map from names to values, and there's no distinction between l-expressions and r-expressions. State gets expressed entirely in the heap -- you can have variables that denote pointers into the heap, and then update the contents of that pointer.

Since the only place that mutation can happen is in the heap, this is sometimes called having a pure reference semantics.

Incidentally, you can go much further than C does in making l-expressions and r-expressions equally powerful; probably the language that goes furthest in this direction is John Reynold's design for Forsythe. In it, you could write things like:

(if complex_condition then x else y) := 17;

PL/I pseudovariables

PL/I allows assignment "into" some of the builtin functions. The routine are called pseudovariables when used this way. See here for some examples.

Being able to create your own pseudovariables would be cool, wouldn't it?

SETF

Isn't this what Common Lisp's SETF does? By defining a setf expander, you can make any form - function call, if statement, array reference, whatever - into an L-value.

C++ reference returns

You can do similar stuff in C++ by returning references--while a returned value is an rvalue; a returned references is an lvalue. If the object for which a reference is returned has operator = defined (or is a primitive type), the result of a function call can be on the left hand side of an assignment.

The STL depends on this capability. If you have an iterator p defined as

vector<SomeClass>::iterator p = someVector.begin();

you can then do

*p = someObject

Which is essentially

p.operator *() = someObject

As operator * of any iterator will return a reference; it can be an lvalue (same goes for operator []). If we strip off one more layer of infix syntactic sugar, it becomes

p.operator *().operator =(someObject);.

Common Lisp, IIRC, has something similar. Of course, C++ references aren't first class (though C++ pointers are), and are frequently implicated in many nasty bugs (such as returning references to the functions locals), but they can be damn useful.

It should also be pointed out that the result of the ?: operator in C/C++ may be an lvalue. Thus

(complex_condition ? x : y) = 17;

is perfectly legal C/C++ code; assuming types are valid. Unfortunately, you cannot write your own lazy operators in C/C++ (if you want laziness, use Boost or write your own); and this doesn't work with if/else. ?: is also limited in that--coming back to the subject of this thread--it can only contain a single expression or statement (or a comma-separated chain), it can't contain multiple statements.

In C, too

While C doesn't have references or operator overloading; you can fake it with macros. The canonical example in any multi-threaded OS is of course:


int *__errno(); /* return ptr to thread-local errno */
#define errno (*(__errno())) /* errno expands to lvalue*/

Of course, such things are decidedly far from first-class constructs, and are generally regarded as ugly hacks... but such works. C++ references can be viewed as a formalization and extension of this technique:

int &r = foo;

is sort of equivalent to

int * const r_ptr = &foo;
#define r (*r_ptr)

But much better behaved.

Lisp

Incidentally, you can go much further than C does in making l-expressions and r-expressions equally powerful; probably the language that goes furthest in this direction is John Reynold's design for Forsythe. In it, you could write things like: [snip]

As a random aside, this kind of thing works in Lisp if you use SETF.

(defun f (n)
  (let ((x 0) (y 0))
    (progn
      (setf (if (eq n 0) x y) 1)
      (list x y))))

(print (f 0))    ;; prints (1 0)
(print (f 1))    ;; prints (0 1)

No


Error in FDEFINITION:  the function (SETF IF) is undefined.
   [Condition of type UNDEFINED-FUNCTION]

The nearest that I can think of is:

(set (if (> (random 5) 2) '*foo* '*bar*) 5)

which is admittedly much less sexy.

Strange

I actually tested my code using SLIME + whatever Lisp implementation comes with Lisp in a Box. Maybe this is a non-standard setf expander that happens to come with this implementation but isn't part of the standard?

Just to prove I'm not totally crazy, here's a transcript I've cut-and-pasted from my REPL session.

CL-USER> (defun f (n)
  (let ((x 0) (y 0))
    (progn
      (setf (if (eq n 0) x y) 1)
      (list x y))))

F
CL-USER> (f 0)
(1 0)
CL-USER> (f 1)
(0 1)

Maybe

I hate this about Lisp -- there is a real standard document so I can't just find out if code is correct by pasting it into a shell :-). Life is too short to RTFM!

Strange Lisp

That is a strange Lisp then, and it is not in the standard. To find out which Lisp you have you can call lisp-implementation-type.

That comes straight from Algol 60

The above is perfectly correct Algol 60 as well as correct Forsythe. What's more, you can pass such things by name to Algol 60 procedures and have the result depend on the current value of complex_condition at each assignment within the procedure.

First-class references

Your example:

(if complex_condition then x else y) := 17;

is actually legal ML code:

$ hamlet
HaMLet 1.2 - To Be Or Not To Be Standard ML
[loading standard basis library]
- val complex_condition = true ;
val complex_condition = true : bool
- val x = ref 1 ;
val x = ref 1 : int ref
- val y = ref 2 ;
val y = ref 2 : int ref
- (if complex_condition then x else y) := 17;
val it = () : unit
- !x ;
val it = 17 : int
- !y ;
val it = 2 : int

All you need is first-class references (and a compiler that can optimize away unneeded indirections).

Storage is an implementation issue though.

What you are saying is true, but it is only an implementation issue. For the computer, expressions are commands, and data have to be stored in some memory.

For the programmer though, there is a difference: an expression offers the result at hand, whereas a statement places the result of a computation at a predefined position. Using expressions makes for better software reuse, because expressions can be easily used as software building blocks (one expression passing its result to another), whereas statements can only be used the way the language creator imagined.

See it this way:

An expression is a piece of code with zero, one or more inputs and one output. From this simple definition, it's easily understandable that expressions can be combined in various ways.

A statement is a piece of code with zero, one or more inputs and no output. From this simple definition, it's easily understandable that statements can not be freely combined. Since statements don't return something, the programmer has to know the internals of a how a statement works in order to successfully combine two or more statements.

Therefore, an expression is a black box: we are only interested in its output (if we already know it is correct, of course). A statement though is not a black box: it has one or more external predefined references that it uses...thus the programmer must know jungle more things when using statements, and here exactly is the problem with the modern imperative programming languages.

I wonder what programming languages would exist in Star Trek...

Storage isn't entirely an imp

Storage isn't entirely an implementation issue - I can take a pointer in C, whereas I can't do this in Haskell (ST monad et al aside). I agree with your black box view though.

Star Trek's programming languages would be spoken english and feature supercomputational disambiguation algorithms and the like. Or perhaps I'm just being cynical?

Star Trek's programming languages...

...are declarative ones, similar to Prolog.

The main difference being is the following: When an inconsistent system is specified by a programmer (or a clever starship captain), rather than failing with a simple unification error, instead the underlying hardware malfunctions and goes haywire.

"Norman, I want you to listen to me carefully..."

:)

Well some people in the star

Well some people in the star trek universe seems to be programing in Var'aq.

When would a PL designer make an expression or a statement ?

I think you managed to state the difference really clearly. Looking at the "statement vs expressions" from a combination point of view (expression can be combined, whereas statements cannot) pointed out the right thing IMO.

Now, when would a language designer decide to make an operation a statement, and when would he make it an expression ? For instance: I have an operation to mutate the value of a slot (a name memory cell). Should I make it available as an expression or a statement ?

I would tend to say that everything should be an expression... but maybe things have to be turned to statements for performance (implementation-related) issues ?

Parsing

I believe that making some expressions statements makes the grammar simpler to deal with (for instance, there's lots of places you know you can't find an 'if' or 'while' in C).

That sort of thing actually m

That sort of thing actually makes the grammar harder to work with given modern parsing tools, simply because it makes the grammar itself more complex.

Error reporting

In the grammar department I think one good reason for having "exceptional" syntax for conditional statements and loop statements (as opposed to the approach taken by Lisp, say) is that it can make it easier to provide better syntax error reporting. When everything looks the same structurally, errors are sometimes only detected far away from the actual place of the error.

As far as programmer friendliness goes, I think perhaps the best reason for having exceptional syntax for such constructs is that it serves a similar function to punctuation and whitespace in human languages: it helps make code more easily scannable by human readers. I think this is the main disadvantage of Lisp syntax. It isn't actually harder to write when you get used to it but it is not as easily scannable as most languages with Algol-style syntax, even when good indentation is used.

Agreed. While I could live wi

Agreed. While I could live without if/then/else in Haskell, I'd hate for pattern-matching to become just another function somehow. Somehow I don't find it such a big deal in loops vs HOFs, but the only other time I find a really big chunk of code in parens is when I'm building big data structures manually.

A good way to supply more specific errors for frequently used HOFs would certainly be nice to have.

Re: errors for just another function

Ja, if the reason to keep things as 'special cases' is error reporting, then it seems to me a better situation would be for our systems to let us extend the error reporting for any given 'regular' function.

I think its a question of fle

I think its a question of flexibility vs safety. If you make everything an expression you can use anything anywhere, but you can do this both by purpose and by accident. Its basically the same tradeof as between static and dynamic typing.

for example:

if (x = 7) then ... ;

in a C-like language would often be an error what would be caught if "=" was a statment but not if it was a expression

fun {$ X}
 if x==7 then ...
 else {Raise ...}
 end
end

in Oz on the other hand doesn't work becourse {Raise ...} is a statment at a expression position.

Syntactic abuse

I'd say the problem with C rather is the abuse of the innocent looking, suggestive symbol "=" as an assignment operator. That just invites trouble, either way.

Use statements to sequence effects

Rule of thumb: Use expressions for "things" that play nice with substitution. Use statements when substitution doesn't work.

always expressions, never statements.

A destructive update (...I like the term 'destructive' because it can really destroy things by allowing logic bugs to creep in) should be modelled with the least number of interpretations possible. If assignment is declared as an expression, then people can interpret it in more ways than if it was a simple statement. For example:

if (a = x > 2) { printf("%\n", a); }

The above can be interpreted in many ways:

  1. a is assigned the boolean result of x > 2.
  2. a equals the boolean result of x > 2; when browsing code, eyes can not easily distinguish '=' from '=='.
  3. a is assigned the value of x.

There are also many ways to interpret the conditional: either a is being tested, or x > 2 is being tested.

I believe that a language shall be as clean as possible, even if it means typing more characters. IDEs can assist in typing, but they can not assist in program disambiguation.

Slogan version

axilmar says ... an expression is a black box ...

... and a statement is a black hole. (The results come out somewhere else, possibly a different universe.)

Assignment versus binding

The difference is of course if you are binding a new variable or assigning to an existing one. C-style languages unfortunately hide this difference in syntax, but if you bind a new variable, even if it shadows another one by having the same name, you are not risking putting this value in the wrong place, because you just created a new place for it. But an already existing variable may be referenced in any number of places, and they will all see the new variable that you replaced the old one with when assigning to it. "Putting the wrong value in the wrong place" is really only possible with assignment.

Re: binding vs. assigning

(Sorry to be thick as a brick, but I am trying to be as much an obnoxiously slow C/++ user as possible since in many ways that is probably what I am and because I think people in general haven't done a good job of succinctly explaining the difference to non-FP users. Maybe the answer is "you just have to use it for a while" but I hope there is something more pithy we can come up with.)

I would say we have to further tease out how those 2 things are happening in C++ because (at least with g++) I can have block scope shadow a variable, and when that block scope ends I get to see the shadowed value once again. So that sounds a lot like binding and a lot like it avoids the issue you describe.


int main()
{
     int iVal = 42;

     printf( "%d\n", iVal );

     for( int iVal = 0; iVal < 3; iVal++ )
     {
          printf( " %d\n", iVal );
     }

     printf( "%d\n", iVal );

     return( 0 );
}

The FP situation is equivalen

The FP situation is equivalent to only assigning to variables when you define them. Rebinding is creating a new variable in an inner scope such as that created by a let statement. Effectively, where C++ has block statements that contain a sequence of definitions and statements, FP languages typically have let statements consisting of a sequence of definitions and a single expression.

Binding in C/C++

The equivalent of binding in C/C++ would be to assign a value as the variable is created, i.e. in the variable declaration. Any further assignment to that same variable then consists of assignment. As you say, you can by introducing new pairs of braces introduce new scopes, in which you can shadow variables from outer scopes.

I agree that the machine model of C/C++ makes the difference between binding and assignment less clear, but even so it is still an important distinction to make.

Re: assign when created

It would be interesting to give people C++ homework assignments where they could only use binding. Make everything 'const' and use RTTI style? That might be a nice way to give folks a feel for the difference, and act as a stepping-stone of understanding?

RTTI style?

What does RTTI (runtime type identification) have to do with binding-vs-assignment?

Re: RTTI

Actually, nothing, as far as I know. I meant to say RAII (which is something to do with the Italian media hegemony? :-). D'oh.

Binding semantics

Actually, binding vs. assignment in C++ is perfectly clear. Binding occurs when the constructor of an object is called, and assignment occurs when the assignment operator is called. It's unfortunate that the same symbol is used to perform binding and assignment in some cases, but from a language perspective the distinction is crystal clear.

Objects and objects

Well yes, when you design your own classes. But then as you say you are allowed to make the syntax mean anything you want to anyway, which certainly can make things less than crystal clear. When you have variables of the primitive types, it is also not clear at all.

Example?

Show me an example where I cannot tell you if binding or assignment is occurring. Binding can *only* happen in a constructor. Assignment can occur anywhere else (except a c'tor), but once the constructor is called, binding cannot occur again. And primitive types are still "constructed", even if the constructor is trivial. That is, given any C++ program, I can tell you where binding occurs because I can tell you where constructors are called, even if those c'tors are only implicit for builtin types. One might be tempted to say that in the following code:

  int x;
  x = 5;

the second line is where the value 5 gets bound to x. But that would be incorrect. In fact, x has a value before the second line, but that value is singular. So the proper semantics of the code above is that x is bound to a singular value in the first line, and is assigned to 5 in the second line.

Of course you can tell. After

Of course you can tell. After all, the machine can tell too. I am saying that it is hard to, because these two very different operations are hidden under a similar syntax. This makes the users of those languages less aware of the difference, causing them to reuse variables instead of creating new ones. It doesn't help that original C required you to bind all variables at the top.

Machine Intelligence?

After all, the machine can tell too.

Do you really think the machine knows or cares about the distinction between binding and assignment? Isn't it all just code? It seems like a semantic notion that only exists at the level of the source language, and is only significant to programmers.

Re: Machine Intelligence?

Do you really think the machine knows or cares about the distinction between binding and assignment?

Sure it does. Think of it (a little loosely) like this: binding is creating an lvalue [associated with a name]. Assignment is updating an lvalue. Big difference; without the former, the latter couldn't happen. See neelk's comment "lvalues versus rvalues" for more on that subject.

Another way to put it is with the following code:

int x = 1;
int *p = &x;
x = 2;
cout << (p == &x);

The fact that this code outputs a truth value (1) indicates that there is some property of the variable x which is unchanged by the assignment "x = 2". That property is the "location" mentioned in neelk's comment — x is bound to a location, and the binding of x to that location does not change because of an assignment.

Definition of "machine"

I guess it depends on what level of abstraction we're talking about. I agree that at the source level, the compiler needs to know about binding in order to emit the proper code for allocating space, etc. But below that level, the distinction becomes less and less relevant.

Which machine

I'm talking about the machine that C is supposed to be close to. :)

Arguments of the form "it's all just bits in the end" aren't very enlightening. The point of saying something like "the machine can tell" is to point out that the distinction in question is not something purely imposed in our minds, but that it does make a difference to the operation of the machine, whether or not humans are always clear on the difference. Whether it's at the level of the runtime, or of the compiler that the "machine can tell" isn't the issue.

Besides, something like that constructor definition of binding you gave works pretty well at lower levels, too: e.g. for local variables, binding is happening when the stack gets pushed, whereas assignment is happening when a location on the stack is updated. This can usually be quite clearly distinguished at the opcode level.

Fair enough...

David B. Heid: I guess it depends on what level of abstraction we're talking about. I agree that at the source level, the compiler needs to know about binding in order to emit the proper code for allocating space, etc. But below that level, the distinction becomes less and less relevant.

Indeed it does, and since the "source level" means "the language level," I'd call it a critical point: we rely on our languages precisely to enforce a wide variety of abstractions, even more than we rely on our libraries, design patterns, etc. to!

Name

Just for future reference, my last name is "HELD", not "HEID". I used caps to highlight the difference in spelling, not to shout or anything. It's an Anglicization of the German "von Heldt". I just don't want to encourage any association with Dr. Jekyll. ;>

At some point the compiler ta

At some point the compiler takes the binding (introduction of a variable) or the assignment (to an existing variable) and produces machine code for it. This machine code looks different for these two cases, so obviously the compiler knows the difference.

This is actually binding

When you do something like TYPE VARIABLE = VALUE, you are doing a binding between a value and a variable (which is allocated). So more precisely, the compiler allocates a slot for the variable and assigns the slot with the given value.

When you do something like VARIABLE = VALUE, the variable is assigned with the value. This meands that the compiler must find the slot corresponding to the variable (this is resolution) and then assign it the value.

So in this case, binding = slot allocation + assignation.

Statements and Expressions as a design choice ?

I almost agree with you, excepted that I would say that some statements are not bound to computations, for instance those which manipulate the program control flow : "break", "continue" or "yield" for instance.

However, what you said made me think that maybe the difference between statement and expression is simply a choice of the language designer : what is hidden (not accessible by the language user) should be a statement, otherwise it is an expression.

For instance, in Python, statements like "break", "continue", or "yield" are there because there is no available value (object) that references the control flow and allow to operate on it. If there was any, then we could use operations/function to modify it and then have expression with identical effects to statements.

Anyway, I'm really not sure if what I'm saying is correct...

Elegant has an interesting answer

I'm sure that Lex Augusteijn's interesting DSL Elegant has been discussed on LtU before.

In Elegant, all expressions are evaluated with a "context type" which can be an empty type; such types are effectively monads, if I understand all the concepts correctly. The value of the expression is then coerced to the context type, which can have an effect (or side effect, if you prefer).

So, for example,

{ "Type " (t1) " is incompatible with " (t2) } astype Message

is an expression containing a block of four statements (two strings and two simple parenthesized expressions), all of which are evaluated in the context type Message. Message is an empty type which has coercions whose effects are to display on stderr.

Consequently, in Elegant you can define a function whose result type is Message (or, more usefully, Output); each statement/expression in the function body is then evaluated with that context type, with the consequence that the function is effectively a template.

I fear that I have not explained this very well, and refer interested people to the tutorial and reference manual on the Elegant home page. Elegant has a lot of interesting features, and is, in my opinion, worth the look.