Two Computers & An Ethernet Cable -- Also A Computer?

Say we want to run a computation on two computers, with a
network cable between them. This gets very complicated -- one
must make all sorts of provisions for network failure, failure
of one machine, &c -- unless one is willing to fail fast, to
fail the whole thing if any component fails.

After examining this problem for quite some time, I wonder if
the network computer is not really a computer, in the sense of
a general purpose computer, but rather a lesser machine, not
able to run all the algorithms that a general purpose computer
can run. However, I lack the formal tools to say anything
about this. Is there a clear case to be made one way or the
other? What's a good book to start with, assuming I would like
to demonstrate it for myself?

The network computer is a

The network computer is a stochastic computer in the sense that there's a probability of things failing - there was a lambda calculus mentioned on LtU a while back that deals with such things for reasoning about handling hardware failure. Which moves onto the second point: all hardware computers have that issue (just like none of them actually have unbounded memory), but the probability is higher with the network computer.

That said, where errors are detectable you can always just restart a (sub-)evaluation assuming the computer as a whole isn't now permanently broken. More generally, it's 'just' an engineering problem, just like any other hardware computer. You may, however, have just discovered the reason for triple redundancy :-)

By Philippa Cowderoy at Wed, 2008-07-30 13:41 | login or register to post comments

The network computer is a

The network computer is a stochastic computer in the sense that there's a probability of things failing - there was a lambda calculus mentioned on LtU a while back that deals with such things for reasoning about handling hardware failure. Which moves onto the second point: all hardware computers have that issue (just like none of them actually have unbounded memory), but the probability is higher with the network computer.

The difference that I see is that in the event of RAM failure on a single computer, it is acceptable for the whole system to fail. Whereas it is generally not okay for a network computer to fail if one piece of RAM on one computer fails. So it is really this failure handling that strikes me as different. The rate of partial failure for distributed applications is very high relative to the rate of failure of my laptop, say, so there is a real difference in style.

By Jason Dusek at Wed, 2008-07-30 18:48 | login or register to post comments

The difference that I see is

The difference that I see is that in the event of RAM failure on a single computer, it is acceptable for the whole system to fail.

Not necessarily! ;-)

The rate of partial failure for distributed applications is very high relative to the rate of failure of my laptop, say, so there is a real difference in style.

Except in harsher environments, like space, where pervasive radiation can flip many bits randomly. The above faulty lambda calculus is a step to addressing that at the software level.

By naasking at Wed, 2008-07-30 19:15 | login or register to post comments

RAM Failure Is Out There

The difference that I see is that in the event of RAM failure on a single computer, it is acceptable for the whole system to fail.

Not necessarily! ;-)

Thanks for posting this link -- it's certainly a step in the right direction. However, you are unfairly pushing aside my point. In some of the cases mentioned by the paper -- for example eBay -- it would have been okay to fail the computation entirely had the failure been detected. So for them, total failure of a single system is still an acceptable way to handle RAM or processor failure.

More generally, we allow ourselves the convenient fiction that partial failure of a single computer amounts to total failure of a computation depending on the failing subsystem -- and in the case of RAM or processor failure, it is acceptable to shove the entire OS off the brink. Outside of the space program, programmers don't write code that recovers from RAM failure (RAM striping?) or processor failure -- or even detects these things.

By Jason Dusek at Wed, 2008-07-30 19:36 | login or register to post comments

Why is three so special?

Why is three so special? Because you have a simple majority in every failure scenario?

By Jason Dusek at Thu, 2008-07-31 04:56 | login or register to post comments

Thank you for the

Thank you for the link.

...you seem to be suggesting a computability perspective which is rather different than the way these issues are usually studied in the field.

The counter-example you give, of many computers with only one performing computation, is a good sign that I am painting with too broad a brush. Distributed systems generally expose failure events to the programmer in a way that non-distributed programs don't, generally because distributed systems aim for high-availability. In an "eventually consistent" storage system, for example, you can't reliably read back what you wrote from storage -- so the storage in your "computer" is stochastic. The metaphorical "tape walking machine" would have probabilistic behaviour in the presence of an eventually consistent paper tape.

By Jason Dusek at Wed, 2008-07-30 19:14 | login or register to post comments

Not two, but at least three...

Say we want to run a computation on two computers, with a
network cable between them. This gets very complicated -- one
must make all sorts of provisions for network failure, failure
of one machine, &c -- unless one is willing to fail fast, to
fail the whole thing if any component fails.

Yes, it's very complicated, but (with at least three participants, as Philippa points out) it's the way many HA clusters are built nowadays. And no, the whole thing does not (necessarily) fail if a component fails.

After examining this problem for quite some time, I wonder if
the network computer is not really a computer, in the sense of
a general purpose computer, but rather a lesser machine, not
able to run all the algorithms that a general purpose computer
can run. However, I lack the formal tools to say anything
about this. Is there a clear case to be made one way or the
other? What's a good book to start with, assuming I would like
to demonstrate it for myself?

If you don't take into account failures then both the networked and the normal configurations are Turing machines (as Ehud points out you have the simple case of the network computer using just a single node), so both can be considered computers, and can solve any solvable algorithm.

If you take into account failures then none of them are Turing machines. And thus none of them can solve any algorithm at all.

By vieiro at Wed, 2008-07-30 19:50 | login or register to post comments

Usually Fail

It is more usual to take failure into account with distributed systems than with single computers.

By Jason Dusek at Wed, 2008-07-30 20:04 | login or register to post comments

Lamport Wrote About This Stuff...

I guess this is actually a large part of Leslie Lamport's work. I had no idea the guy did distributed systems -- I've always known him for LaTeX.

By Jason Dusek at Thu, 2008-07-31 08:13 | login or register to post comments

Distributed Computing

You may find my articles on Wikipedia enlightening; Paxos Algorithm and State Machine Approach.

To answer some of your questions:
- A network of computers can compute anything a single computer can compute.
- A network of computers is not more likely to fail.
- You must have more than two computers.

A distributed system must place nodes carefully to avoid correlated failures. If it does so, it can withstand any number and type of failure that is planned for, which is far better than single (even HA) computers. But the price for this resilience is the difficult task of writing software for it.

Algorithms like Paxos and system architectures like the State Machine Approach reduce this complexity to a minimal core component. That component encapsulates all distributed failure cases, freeing application programmers from managing this burden.

By Bryan Turner at Sat, 2008-08-02 03:16 | login or register to post comments

Can you explain why I must

Can you explain why I must have more than two computers? Are odd numbers better?

I already read the Paxos article and I'm reading the "State Machine Approach" article right now.

By Jason Dusek at Wed, 2008-08-06 20:59 | login or register to post comments

Odd numbers would be better for democracy...

...as even numbers have the chance for a tie when there's a disagreement. But I don't know that democracy is a better algorithm (Minority Report).

By Chris Rathman at Wed, 2008-08-06 21:26 | login or register to post comments

This discussion is in danger

This discussion is in danger of becoming off-topic. The subject is huge and many of results are rather counterintuitive (it's a fascinating subject). If anyone can suggest good places for discussion of distributed algorithms, go ahead. Otherwise, unless a PL connection is being explored, I'd recommend taking the discussion offline.

By Ehud Lamm at Wed, 2008-08-06 21:31 | login or register to post comments

Lambda the Ultimate

User login

Navigation